Named Entity Recognition in Russian: the Power of Wiki-based Approach.

Authors

Sysoev A., Andrianov I.

Abstract

Named entity recognition and classification is an important natural language processing task, aimed at finding words and word sequences, which denote named entities of different types in plain texts. This challenge was addressed in Task 1 of FactRuEval-2016 evaluation. In the context of this evaluation, our team, acting for the Institute for System Programming of the Russian Academy of Sciences, proposed two approaches to exploiting information, mined from Wikidata and Wikipedia, for improving quality of named entity detection methods. In the first approach word2vec word embeddings, computed on Wikipedia, are used along with basic features in tokens classification. The second approach utilizes both Wikipedia and Wikidata to automatically construct a representative corpus for named entity recognition and classification training. Additionally, Wikidata, treated as a property graph, is used to collect named entity specific word dictionaries. Our approaches (marked with identifier 'Orange' in FactRuEval-2016 organizers’ quality evaluation reports) show up promising results, doing especially good for such well-defined class as person, still being appropriate for detecting named entities of other types as well.

Full text of the paper in pdf

Edition

Computational Linguistics and Intellectual Technologies (Proceedings of the Annual International Conference “Dialogue”). Issue 15(22). 2016. pp. 746-755.

Research Group

Information Systems

All publications during 2016

All publications