Proceedings of ISP RAS


Texterra: A Framework for Text Analysis.

Denis Turdakov, Nikita Astrakhantsev, Yaroslav Nedumov, Andrey Sysoev, Ivan Andrianov, Vladimir Mayorov, Denis Fedorenko, Anton Korshunov, Sergey Kuznetsov.

Abstract

The paper presents a framework for fast text analytics developed during the Texterra project. Texterra is a technology for multilingual text mining based on novel text processing methods that exploit knowledge extracted from user-generated content. It delivers a fast scalable solution for text mining without the expensive customization. Depending on use-cases Texterra could be utilized as a library, extendable framework or scalable cloudbased service. This paper describes details of the project, use-cases and results of evaluation for all developed tools.

Texterra utilizes Wikipedia as a primary knowledge source to facilitate text mining in arbitrary documents (news, blogs, etc). We mine the graph of Wikipedia’s links to compute semantic relatedness between all concepts described in Wikipedia. As a result, we build a semantic graph with more than 5 million concepts. This graph is exploited to interpret meanings and relationships of terms in text documents.

In spite of large size, Wikipedia doesn’t contain information about many domain-specific concepts. In order to increase applicability of the technology we developed several automatic knowledge extraction tools. These tools include systems for knowledge extraction from MediaWiki resources and Linked Data resources, as well as system for knowledge base extension with concepts described in arbitrary text documents using original information extraction techniques.

In addition, utilization of information from Wikipedia allows easily extend Texterra for support of new Natural languages. The paper presents evaluation of Texterra applied for different text processing tasks (part-of-speech tagging, word sense disambiguation, keyword extraction and sentiment analysis) for English and Russian.

Keywords

Text mining, natural language processing, Wikipedia, computational linguistics, machine learning, knowledge base, semantic ontology, information retrieval, terminology extraction.

Edition

Proceedings of the Institute for System Programming, vol. 26, issue 1, 2014, pp. 421-438.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2014-26(1)-18

Full text of the paper in pdf Back to the contents of the volume