Proceedings of ISP RAS


Detecting Content Spam on the Web through Text Diversity Analysis.

Anton Pavlov, Boris Dobrov.

Abstract

Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributions for terms and topics. We combine them with a wide range of other content features to produce a content spam classifier that outperforms existing results.

Keywords

search spam; feature analysis; topical diversity

Edition

Proceedings of the Institute for System Programming, vol. 21, 2011, pp. 277-296.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

Full text of the paper in pdf Back to the contents of the volume