Proceedings of ISP RAS

Detecting Content Spam on the Web through Text Diversity Analysis.

Anton Pavlov, Boris Dobrov.


Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributions for terms and topics. We combine them with a wide range of other content features to produce a content spam classifier that outperforms existing results.


search spam; feature analysis; topical diversity


Proceedings of the Institute for System Programming, vol. 21, 2011, pp. 277-296.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

Full text of the paper in pdf Back to the contents of the volume