Proceedings of ISP RAS


Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis

D.V. Tsarev (MSU, Moscow), M.I. Petrovskiy (MSU, Moscow), I.V. Mashechkin (MSU, Moscow), A.Y. Korchagin (MSU, Moscow), V.Y. Korolev (MSU, Moscow)

Abstract

The paper presents the novel approach of user identification based on behavior analytics of user operations with a text information. It is offered to describe user behavior by content of his text documents. The structured representation of the considered behavioral information is carried out based on representation of documents text content in the user topic space, which is created by non-negative matrix factorization. The topic weights in the document characterize the user’s topic trend during an operating time with this document. The time variation of the topic weight values creates multidimensional time series that describe the history of user behavior when working with text data. Forecasting of such time series will allow for user identification based on estimated deviation of observed topic trend from the predicted topic weight values. This paper also presents the new time series forecasting method based on orthogonal nonnegative matrix factorization (ONMF) which is used within proposed user identification approach. It is worth noting that nonnegative matrix factorization methods were not used before for the time series forecasting task. The proposed user identification approach has been experimentally verified on the example of real corporate email correspondence created from the Enron dataset. In addition, experiments with other today popular forecasting methods have shown the superiority of proposed forecasting method in quality of user’s topic weights classification. Also we investigated two different approaches to estimates of the deviation of a time series point from the predicted value: absolute deviation and p-value estimation. Experiments have shown that both discussed approaches of deviation estimates are applicable in the proposed user identification approach.

Keywords

computer security; user identification; topic modeling; orthogonal nonnegative matrix factorization; time series forecasting

Edition

Proceedings of the Institute for System Programming, vol. 27, issue 1, 2015, pp. 151-172.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2015-27(1)-8

Full text of the paper in pdf (in Russian) Back to the contents of the volume