Detecting Content Spam on the Web through Text Diversity Analysis.
Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributions for terms and topics. We combine them with a wide range of other content features to produce a content spam classifier that outperforms existing results.
Proceedings of the Institute for System Programming, vol. 21, 2011, pp. 277-296.
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).Full text of the paper in pdf Back to the contents of the volume