Proceedings of ISP RAS


Methods for Construction of Socio-Demographic Profile of Internet Users

A. Gomzin (ISP RAS, Moscow; MSU, Moscow), S. Kuznetsov (ISP RAS, Moscow; MSU, Moscow; MIPT, Moscow)

Abstract

The paper is devoted to methods for construction of socio-demographic profile of Internet users. Gender, age, political and religion views, region, relationship status are examples of demographic attributes. This work is a survey of methods that detect demographic attributes from user’s profile and messages. The most of surveyed works are devoted to gender detection. Age, political views and region are also interested researches.
The most popular data sources for demographic attributes extraction are social networks, such as Facebook, Twitter, Youtube.
The most of solutions are based on supervised machine learning. Machine learning allows to find target values (demographic attributes) dependencies from input data and use them to predict the value of the target attribute for the new data. The following problem solving steps are surveyed in the paper: feature extraction, feature selection, model training, evaluation.
Researches use different kind of data to predict demographic attributes. The most popular data source is text. Words sequences (n-grams), parts of speech, emoticons, features specific to particular resources (eg, @ mentions and # Hashtags on Twitter) are extracted and used as input for machine learning algorithms. Social graphs are also used as source data. Communities of users that are automatically extracted from social graph are user as features for attributes prediction.
Text data produces a lot of features. Feature selection algorithms are needed to reduce feature space.
The paper surveys feature selection, classification and regression algorithms, evaluation metrics.

Keywords

demographic attributes; social networks; text processing; machine learning

Edition

Proceedings of the Institute for System Programming, vol. 27, issue 4, 2015, pp. 129-144.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2015-27(4)-7

Full text of the paper in pdf (in Russian) Back to the contents of the volume