Detection of demographic attributes of microblog users.
Users of internet services often make errors or intentionally provide misleading information about their demographic attributes, including gender, age, marital status, education, religious and political views. At the same time, knowing values of user attributes allows to enhance the performance of recommender systems, internet marketing solutions, and other applications based on personalized results. In the paper, a method is proposed for automatic detection of demographic attributes of Twitter users by analyzing their textual messages and other data from their profiles. The method is based on a machine learning algorithm trained with binary vectors of token N-grams extracted from user posts. Its distinctive features are fully automatic compilation of training and testing data sets as well as support for a broad and extendable range of languages and demographic attributes. This is achieved by exploiting Facebook accounts associated with user profiles in Twitter. Additional steps are detecting language of posts and filtering borrowed content. Experimental study showed high accuracy of gender, age, and marital status detection for the most popular languages: English, Russian, German, French, Italian, and Spanish. Besides, detection of education, religious and political views is also supported for English.
Proceedings of the Institute for System Programming, vol. 25, 2013, pp. 179-194.
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).
DOI: 10.15514/ISPRAS-2013-25-10Full text of the paper in pdf (in Russian) Back to the contents of the volume