Application of the RuThes thesaurus and Word2vec vector representations in the lexical typology problem
News
Application of the RuThes thesaurus and Word2vec vector representations in the lexical typology problem
Abstract
The article describes the use of the RuThes thesaurus and Word2vec vector representations for determining the lexical typology of languages. The relevance of this work stems from the need to conduct typological studies of languages and the underdeveloped automation tools in this area. The article provides an overview of existing methods for determining lexical typology, describing the advantages and disadvantages of each method, and proposing an approach for automated typology extraction. Various types of RuThes relations are also described. The text corpora used are described. The semantic zones "pull-push" and "fix-spoil" are researched. Frames for these semantic zones are obtained. The extracted words implementing the semantic zones are analyzed and compared with a manual method. Three methods for extracting lexical typology are compared: using thesaurus only, using thesaurus and filtering by Word2vec, and using thesaurus and adding the closest words by Word2vec. An evaluation and comparison with existing methods are performed. For each method, recall, precision, and F-score were calculated. It was found that the best results for the "push-pull" semantic zone are achieved by combining the thesaurus and Word2vec filtering. Adding additional Word2vec closest words degrades all metrics except the F-score for the "push" semantic zone. Using the thesaurus alone, however, yields good results that could be helpful to language researchers. For the "fix-spoil" semantic zone, the best results are achieved by using the thesaurus, filtering, and adding Word2vec closest words. An explanation for the obtained results is offered. The software implementation was implemented using Python3, the Gensim library for generating Word2vec vectors, Scikit-learn for vector comparison, Numpy for array manipulation, Pymorphy2 for priming, NLTK for stopword filtering, and xml.etree for thesaurus manipulation. The practical significance lies in the development of an automated method for assisting linguists and evaluating its performance.
Keywords
Edition
Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 227-240
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).
DOI: 10.15514/ISPRAS-2026-38(2)-15
For citation
Full text of the paper in pdf (in Russian)
Back to the contents of the volume