Proceedings of ISP RAS


Topic modeling in natural language texts.

Anton Korshunov, Andrey Gomzin.

Abstract

Topic modeling is a method for building a model of a collection of text documents. The model is able to determine topics for each of documents. Shifting from term space to space of extracted topics helps resolving synonymy and polysemy of terms. Besides, it allows for more efficient topic-sensitive search, classification, summarization, and annotation of document collections and news feeds. The paper shows an evolution of topic modeling techniques. The earlier methods are based on clustering. These algorithms use some similarity function defined on two documents. The next generation of topic modeling techniques is based on Latent Semantic Indexing (LSA). Words co-occurrences in documents are analyzed here. Currently, the most popular are approaches based on Bayesian networks — directed probabilistic graphical models which incorporate different kinds of entities and metadata: document authorship, connections between words, topics, documents, and authors, etc. The paper contains a comparative survey of different models along with methods for parameter estimation and accuracy measurement. The following topic models are considered in the paper: Probabilistic Latent Semantic Indexing, Latent Dirichlet Allocation, nonparametric models, dynamic models, and semi-supervised models. The paper describes wellknown quality evaluation metrics: perplexity and topic coherence. Freely available implementations are listed as well.

Keywords

topic modeling; topic-sensitive search; document classification; probabilistic graphical models; Bayesian networks; latent Dirichlet allocation; dimensionality reduction; text mining; information retrieval; machine learning

Edition

Proceedings of the Institute for System Programming, vol. 23, 2012, pp. 215-244.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2012-23-13

Full text of the paper in pdf (in Russian) Back to the contents of the volume