Computing semantic similarity of concepts using shortest paths in Wikipedia link graph.
A measure of semantic similarity between concepts characterizes the degree of relatedness between their senses. Texterra system uses Wikipedia-based Dice semantic similarity measure for word sense disambiguation. Since concepts in Texterra are Wikipedia articles, one is interested in precise link-based semantic similarity measures. This work presents a global semantic similarity measure based on distances between concepts in Wikipedia link graph. Graph distance is estimated as the shortest path length between a pair of nodes (Wikipedia articles). The difference of the proposed method from existing measures based on shortest paths is in the usage of disparity of different link types. Here, a special data structure is used which allows one to compute the shortest pasts efficiently with acceptable memory costs. Compared to Dice measure, usage of shortest paths allows both to increase the correlation between computed and expert similarity and to achieve better results in the word sense disambiguation task. Also, it is demonstrated that regular and category links are the most relevant for semantic similarity estimation. This work shows that distances between articles in Wikipedia link graph can provide an effective basis for computing semantic similarity between corresponding concepts.Full text of the paper in pdf (in Russian)
Machine Learning and Data Analysis. 2014. V. 1, № 8. Pp. 1107 - 1125.