Application of the RuThes thesaurus and Word2vec vector representations in the lexical typology problem

News

02 August, 2019 OS DAY-2019. Cooperation among operating platform developers and the security of Russian software

10 April, 2019 Ivannikov Memorial Workshop has been supported by IEEE

14 March, 2019 The annual Ivannikov Memorial Workshop will take place on 13-14 September 2019

Application of the RuThes thesaurus and Word2vec vector representations in the lexical typology problem

Polozov I.K. (MSU, Moscow, Russia)
Volkova I.A. (MSU, Moscow, Russia)

Abstract

The article describes the use of the RuThes thesaurus and Word2vec vector representations for determining the lexical typology of languages. The relevance of this work stems from the need to conduct typological studies of languages and the underdeveloped automation tools in this area. The article provides an overview of existing methods for determining lexical typology, describing the advantages and disadvantages of each method, and proposing an approach for automated typology extraction. Various types of RuThes relations are also described. The text corpora used are described. The semantic zones "pull-push" and "fix-spoil" are researched. Frames for these semantic zones are obtained. The extracted words implementing the semantic zones are analyzed and compared with a manual method. Three methods for extracting lexical typology are compared: using thesaurus only, using thesaurus and filtering by Word2vec, and using thesaurus and adding the closest words by Word2vec. An evaluation and comparison with existing methods are performed. For each method, recall, precision, and F-score were calculated. It was found that the best results for the "push-pull" semantic zone are achieved by combining the thesaurus and Word2vec filtering. Adding additional Word2vec closest words degrades all metrics except the F-score for the "push" semantic zone. Using the thesaurus alone, however, yields good results that could be helpful to language researchers. For the "fix-spoil" semantic zone, the best results are achieved by using the thesaurus, filtering, and adding Word2vec closest words. An explanation for the obtained results is offered. The software implementation was implemented using Python3, the Gensim library for generating Word2vec vectors, Scikit-learn for vector comparison, Numpy for array manipulation, Pymorphy2 for priming, NLTK for stopword filtering, and xml.etree for thesaurus manipulation. The practical significance lies in the development of an automated method for assisting linguists and evaluating its performance.

Keywords

lexical typology; RuThes; Word2vec; text classification; computational linguistics.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 227-240

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-15

For citation

Polozov I.K., Volkova I.A. Application of the RuThes thesaurus and Word2vec vector representations in the lexical typology problem. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 227-240 DOI: 10.15514/ISPRAS-2026-38(2)-15.

Full text of the paper in pdf (in Russian)

Back to the contents of the volume

На нашем сайте мы используем cookie файлы, содержащие информацию о предыдущих посещениях веб-сайта. Данные обрабатываются для улучшения качества работы нашего веб-сайта. Если вы не хотите использовать cookie файлы, измените настройки браузера.

Понятно