Proceedings of ISP RAS


Extracting Objects and Their Attributes from Tables in Text Documents.

Nikita Astrakhantsev.

Abstract

Extracting information from tables is an important and rather complex part of information retrieval.

For the task of objects extraction from HTML tables we introduce the following methods: determining table orientation, processing of aggregating objects (like Total) and scattered headers (super row labels, subheaders).

Keywords

information extraction; information retrieval; natural language processing; table processing; table extraction; semi-structured information extraction; html; wiki markup

Edition

Proceedings of the Institute for System Programming, vol. 21, 2011, pp. 297-310.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

Full text of the paper in pdf Back to the contents of the volume