Извлечение объектов и их атрибутов из таблиц текстовых документов

Никита Астраханцев

Извлечение объектов и их атрибутов из таблиц текстовых документов

Никита Астраханцев

Полный текст:

PDF (Rus) |

сгенерировать QR код

Аннотация

Извлечение информации из таблиц является важной и достаточно сложной частью информационного поиска. В рамках задачи извлечения объектов из HTML-таблиц предлагаются методы, решающие следующие проблемы: определение ориентации таблицы, обработка агрегирующих объектов (таких как Total) и разрозненных заголовков (подзаголовков, перерезов).

Ключевые слова

Извлечение информации, информационный поиск, обработка естественного языка, обработка таблиц, извлечение таблиц, html, wiki markup

Об авторе

Никита Астраханцев

ИСП РАН, Москва
Россия

Список литературы

1. A.C. Silva, A.M. Jorge, L. Torg. Design of an end-to-end method to extract information from tables // International Journal of Document Analysis and Recognition. 2006. 8. N 2–3. P. 144–171.

2. Y. A. Tijerino, D. W. Embley, D. W. Lonsdale,. Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web. 2005. 8. N 3. 261–285.

3. D.W. Embley, C. Tao, S.W. Liddle. Automating the Extraction of Data from HTML Tables with Unknown Structure // Data & Knowledge Engineering. 2003. N 54. P. 3–28.

4. D. Rus, K. Summers. Using white space for automated document structuring. Workshop on the Principles of Document Processing, 1994.

5. S. Douglas, M. Hurst, D. Quinn. Using Natural Language Processing for Identifying and Interpreting tables in Plain Text. In: Fourth Symposium on Document Analysis and Information Retrieval, pp. 535–545, 1995.

6. M. Hurst, S. Douglas. Layout and Language: Preliminary investigations in recognizing the structure of tables // Proceedings of International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 1997. P. 1043–1047.

7. D. Pinto, A. McCallum, X. Wei, W.B. Croft. Table Extraction Using Conditional Random Fields // Proceedings of the ACM SIGIR N 26. New York, USA: ACM New York, 2003. P. 235–242.

8. S. Tupaj, Z. Shi, C.H. Chang, A. Hassan. Extracting tabular information from text ﬁles, EECS Department. Tufts University, 1996.

9. Y. Wang, T.P. Ihsin, H. Robert. Improvements of zone content classiﬁcation by using background analysis // Document Analysis Systems. 2000. N 4. P. 10–13.

10. Y. Wang, T.P. Ihsin, H. Robert. Automatic ground truth generation and a background-analysis-based table structure extraction method // Proceedings of the Sixth International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2001. P. 528–532.

11. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, B. Pollak. Towards domain-independent information extraction from web tables // Proceedings of the 16th WWW. New York, USA: ACM New York, 2007. P. 71–80.

12. Y. Wang, J. Hu. A machine learning based approach for table detection on the web // Proceedings of the 11th WWW. New York, USA: ACM New York, 2002. P. 242–250.

13. H.-H. Chen, S.-C. Tsai, S.-C., J.-H. Tsai. Mining tables from large scale HTML texts // 18th International Conference on Computational Linguistics. Saarbrücken, Germany: Morgan Kaufmann, 2000. P. 166–172.

14. M. Yoshida, K. Torisawa, J. Tsujii. A method to integrate tables of the WorldWideWeb // Proceedings of the First International Workshop on Web Document Analysis. Seattle, USA: PRImA Press, 2001. P. 31–34.

15. M.J. Cafarella, A. Halevy, Y. Zhang, D.Z. Wang, E. Wu. WebTables: Exploring the Power of Tables on the Web // ACM SIGMOD Record. 2008. N 37. P. 55–61.

Рецензия

Для цитирования:

Астраханцев Н. Извлечение объектов и их атрибутов из таблиц текстовых документов. Труды Института системного программирования РАН. 2011;21.

For citation:

Astrakhantsev N. Extracting Objects and Their Attributes from Tables in Text Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2011;21. (In Russ.)

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Логин
Пароль
	Запомнить меня

Войти

Труды Института системного программирования РАН

Извлечение объектов и их атрибутов из таблиц текстовых документов

Полный текст:

Аннотация

Ключевые слова

Об авторе

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов