Proceedings of ISP RAS

A category-driven approach to deriving domain specific subsets of Wikipedia.

Anton V. Korshunov, Denis Yu. Turdakov, Jinguk Jeong, Minho Lee, Changsung Moon.


While many researchers attempt to build up different kinds of ontologies by means of Wikipedia, the possibility of deriving high-quality domain specific subset of Wikipedia using its own category structure still remains undervalued. We prove the necessity of such processing in this paper and also propose an appropriate technique. As a result, the size of knowledge base for our text processing framework has been reduced by more than order, while the precision of disambiguating musical metadata (ID3 tags) has decreased from 98% to 64%.


Wikipedia; ontology; automated ontology building; category; taxonomy; semantic relatedness; natural language processing; Texterra


Proceedings of the Institute for System Programming, vol. 21, 2011, pp. 323-348.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

Full text of the paper in pdf Back to the contents of the volume