Deep Web Users Deanonimization System

. Privacy enhancing technologies (PETs) are ubiquitous nowadays. They are beneficial for a wide range of users: for businesses, journalists, bloggers, etc. However, PETs are not always used for legal activity. There a lot of anonymous networks and technologies which grants anonymous access to digital resources. The most popular anonymous networks nowadays is Tor. Tor is a valuable tool for hackers, drug and gun dealers. The present paper is focused on Tor users’ deanonimization using out-of-the box technologies and a basic machine learning algorithm. The aim of the work is to show that it is possible to deanonimize a small fraction of users without having a lot of resources and state-of-the-art machine learning techniques. The first stage of the research was the investigation of contemporary anonymous networks. The second stage was the investigation of deanonimization techniques: traffic analysis, timing attacks, attacks with autonomous systems. For our system, we used website fingerprinting attack, because it requires the smallest number of resources needed for successful implementation of the attack. Finally, there was an experiment held with 5 persons in one room with one corrupted entry Tor relay. We achieved a quite good accuracy (70%) for classifying the webpage, which the user visits, using the set of resources provided by global cybersecurity company. The deanonimization is a very important task from the point of view of national security.


Introduction
Internet privacy is considered as an integral part of freedom of speech. A lot of people are concerned about their anonymity in public and therefore, there is a growing need for privacy enhancing technologies.
The Deep Web is a layer of the Internet, which can not be accessed by traditional search engines, so the content in this layer is not indexed. The typical website in the deep web is static, with potentially no links to outer resources. For that reason, it is very hard to measure the real size of the deep web. In the modern world, there are a lot of networks and technologies, which grant access to deep web resources, for example, Tor, I2P, Freenet, etc. Each of these instruments hides users' traffic from adversaries, thus making the deanonimization a hard thing to do. A detailed overview of such technologies can be accessed in paper [1]. Nowadays, the largest and most widely used system is Tor [2]. Our research focuses on Tor users' deanonimization, because of its popularity and prevalence.

Tor background
Tor is the largest active anonymous network in the world. There are more than two million users per month, and the number of relays is close to 7000 [3]. Tor is a distributed overlay network consisting of volunteer servers. Every user in the world can provide Tor with computational resources needed for traffic retranslation over the network. Despite being a great privacy enhancing technology for law-abiding citizens, Tor is an essential tool in criminal society. Terrorists, drug and arm dealers in line with other offenders use Tor for their criminal activities. Thus, the solution of the deanonimization problem is very important for government special services [4]. For example, Russian Ministry of Internal Affairs (MIA) has recently announced a bidding for Tor deanonimization system [5]. The next key component of Tor is Hidden Services (HS). Tor HS provides users with anonymous servers to host their websites or any other applications. HS are accessed via special pseudo-domains «.onion», where Deep Web is located. From the user`s point of view, accessing a particular hidden service is as easy as visiting a normal website. In order to establish a connection with Tor network, the user must have pre-installed software (Tor client). The easiest way is to install TorBrowser, which is a customized version of Mozilla Firefox with built-in Tor software. To initiate the connection, a Tor client obtains a list of Tor nodes from a directory server. Then, the client builds a circuit of encrypted connections through relays in the network. The circuit is extended hop by hop, and each relay on the path knows only which relay gives data and which relay it is giving data to. There is no particular relay in the circuit (see Fig. 1), which knows the complete users path through the network. A Layered encryption is used along the path. The most interesting relays for a potential attacker are entry and exit relays. Every piece of information in the network is transferred in Tor cells that have equal size. An Entry relay (also called the guard) knows the IP address of the user, and Exit relay knows the destination resource. Traffic interception in the middle would not give any advantage to the attacker because everything is encrypted and secure.

Deanonimization techniques
There is a wide range of deanonimization methods (attacks). Some of them are passive: an adversary only observes traffic, without any trials to modify it somehow. Contrariwise, some of them are active: an attacker modifies traffic causing delays, insert patterns, etc. Earlier, we proposed classification of attacks, where the main principle is the amount of resources needed by an attacker to perform the deanonimization (see table 1).  More information about attacks mentioned in Table 1 can be found in paper [6]. We are focused on the resource-effective attack (WF), which only requires an attacker to control an entry relay of the user. The relay, which is fully controlled by an attacker is called a corrupted relay.

Website Fingerprinting Attack Overview
A website fingerprinting attack (WF) is an attack designed for a local passive eavesdropper to determine the client's endpoint using features from packet sequences. Generally speaking, WF breaks privacy, which is achieved by the proxy, VPN or Tor. This is an application of various machine learning techniques in the field of privacy. The first appearance of the WF was discussed in paper [7]. This attack has been widely discussed in the researchers` community because it has proven its effectiveness against various privacy enhancing technologies, such as Tor, SSL and VPN.

Fig. 2. Configuration of Tor circuit suitable for the WF attack
To perform a WF, an eavesdropper has to simulate users` behavior in the network, using the same conditions as the victim. In case of Tor, an attacker must have a corrupted entry relay (see Fig.2) that will be used for collecting data. The Attacker visits each site from the list and stores all packet sequences related to the request. Afterwards, he uses the traffic for training a classifier in a supervised way. The machine learning problem could be stated as a binary classification problem or multilabel classification problem. In the first case, classifier is trained to answer the question: «If the user visits a site from our list?». The second option is about guessing a particular website that the user visits.

The Oracle Problem
Since WF works with packet sequences, determining sequences related to the webpage is quite a difficult task. This issue is known as the Oracle problem. Researchers make two major assumptions, which simplify WF a lot: 1) an attacker has such an oracle at his disposal, 2) the victim loads pages one-by-one in a single tab. The Oracle helps to find precise subsequence of packets from overall captured traffic. Any excess packet sequence sent to classifier can significantly reduce its' accuracy. That is why, splitting the whole sequence is crucially important. Another reason is the user`s web-browsing behavior. The majority of people uses multi-tab browsing instead of loading a page in a single tab, working with it and loading another one. This behavior makes WF difficult in real life. An Oracle problem for packet sequences has not been solved yet, but Wang proposed a solution for Tor, which can work with a single tab [8]. He considered three-step process of determining correct split in case of single tab browsing between two pages. Wang used Tor cells instead of packets. The first step is making a time based split. The Attacker splits sequences if the time gap between two adjacent cells is greater than some constant, then the sequence is splitted there into two subsequences. If the time gap is too small, classification-based splitting is typically used. Wang used machine learning techniques that decide where to split and whether to split or not. After splitting, the result is ready for further classification. This method achieves quite good accuracy. However, the proposed solution doesn't work with multi-tab browsing and raw packet sequences, narrowing the range of real implementations. Study [9] proposed a time-based way to split traffic traces when the user utilizes 2 open tabs. They classify the first page with 75.9% and second with 40.5% of accuracy.

Real World Scenario
Overall, the applicability of WF in the real world scenario is still questionable. Users may visit hundreds of thousands of webpages every day. So, can the attacker successfully apply WF in reality? Panchenko et al. [10] checked the attack with a really huge dataset, and their approach outperformed the previous state of the art attack proposed by Wang. To conclude, WF attacks are still a serious threat to anonymous communication systems.
The aim of the current work is to show that an attacker can build a deanonimization system, applying learning libraries for most popular programming languages, which will be able to deanonimize a group of users trying to access the deep web content.

Deanonimization system scheme
For the sake of simplicity, we will use as much preconfigured software as possible.
In order to deal with deanonimization problem, our system must have two modules. The first module is used for mining Tor data, which will be used for collecting traffic traces. The second is aimed at applying machine learning techniques.

Data Mining
The data mining module is using various software, which can be easily installed on Mac OS or any Linux distributive. Since the packet traces can be collected on the relay side, or on the client side (the difference is only in the source/destination pair), we can use data mining module on local machine or on the remote server. We will use local machine for data mining (see Fig. 3). Simple data transformation can be applied for packet traces, to look exactly like those collected on the relay. The following software must be installed on the machine:  Torfree software for enabling anonymous communication,  Torsocksfree software that allows using any kind of application via the Tor network,  Wgeta program, which retrieves content from the web server and supports downloading via http, https, ftp,  Tsharka free and open packet analyzer; it is used for network troubleshooting, analysis, etc.,  Mozilla Firefox or Tor Browseran open web-browser (in case of Mozilla Firefox, it is needed to configure it for using Tor manually). Nevertheless, any program can be replaced by the specific library. The simplest solution is to use the proposed software. We must have full control over Tor circuits construal to use our own relay. For this purpose, we will use Stem Python library, which is freely accessible on the web. Stem is a Python controller library for Tor. We use Stem to create Tor circuits through our corrupted entry guard. Without this action, the accuracy of the classifier might become worse, because of different Tor versions on the relays and other reasons. Another option is to modify Tor configuration for using specified entry guards. It is very important to use the same entry guard, which will be used in production. Tshark is used as the main packet capturing tool. We also use Tshark for extracting TLS records from data. Tshark can be substituted with any library, which supports capturing of TCP packets. After that, the attacker has to automate the data gathering process. There are two ways to do it, namely, using wget via torsocks, or Mozilla Firefox. In case of wget, an attacker just launches page downloading from the command line, but the use of Mozilla Firefox requires more work. The automation of Mozilla can be done in two ways. The first option is to launch it from the command line and wait while the page is uploading; another one is to use Selenium Webdriver to automate the process.

Feature Extraction
We can extract features of the traffic at three different levels (see Fig.4)they are Tor cells, TLS and TCP. At the application level, Tor retranslates data in the fixed size packets called cells. All cells have equal length of 512 bytes and travel throughout the network in TLS records. It is noteworthy that several cells can be packed in a single TLS record. The last level is transport level: TLS record is then fragmented into several TCP packets. TCP packets size is limited by the MTU. Furthermore, several TLS records can be packed into a single TCP packet. However, it is questionable, which level is the most informative from the website fingerprinting attack perspective. The majority of researchers assume that the most informative level is the cells level.

. Information extraction levels
Firstly, the cell traces extraction should be performed in the following way: an attacker must extract TLS records from TCP packetsit could be achieved with the tshark software.
Here the file_name should be substituted by the .pcap file with TCP packets, whereas ouput_file is the desired ouput file with textual representation of TLS records. Hence, a simple regular expression can than be used for length extraction. Once the number is an extended extraction, an attacker should then multiply it by -1 if it is outgoing. The resulting array of TLS records lengths should then be transformed into Tor cells. An attacker should divide each number by 512 and append to the cells vector as many -1's or 1's as the number of integers found in the result of division. For example, if the length of TLS record is equal to 2048, the resulted cells vector would be [1,1,1,1]. After completion of cell traces extraction, we will have the representation of data in the form of [-1,1,1,1,-1,…]. Such arrays are then used as features, subsequently, the actual webpages are used as labels. However, such arrays have different lengths. Hence, as we are trying to simplify the process, we will append zeros to the end of input vectors because the majority of machine learning algorithms requires the input vectors to have equal lengths. By means of such operation, we will equalize the length of cell vectors.

Machine Learning Module
For machine learning purposes, we will use sklearn Python library, which is the most popular Python library for machine learning. The trained model will be used for classification of new traffic samples. This module works in a straightforward way. An attacker must train the model using collected cells and then use it as a ready model.

Experimental setup
We have implemented such a scheme using Java programming language and Python (Fig. 5). The aim of our experiment is to show that we can deanonimize a small fraction of users in the real world even if we don't use cutting-edge deanonimization techniques.

Experimental Environment
Consider the following situation: the group of terrorists is trying to gain access to illegal content from a small room in the dormitory. The list of resources was provided by the Group-IB cybersecurity company. In our experiment there were three users playing the role of terrorists. Each of them visited the resources from the list according to the following rules: only single tab browsing is used, and the time spent to read the webpage is, at least, 5 seconds. According to the research [10], situation described looks pretty realistic. Such rules allow us to simplify the process of splitting packet sequences and extracting traces.

Data Gathering
Before trying to deanonimize users, we made a preparation step and collected 80 traffic instances from our list of resources. Such a low number of traffic instances is sufficient, because bigger datasets are not affecting accuracy of classifier on the same number of websites. We have studied 7 resources related to drugs, weapons and extremism issues. Our users repeated the process of reading and uploading a webpage 5 times for each webpage from the list. After that, we downloaded collected packet sequences and made the data preprocessing step. We used time-based splitting as was proposed by Wang [11]. After this step, our data became ready for classification.

Machine Learning Model
Support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

Evalution metrics
 True positives (tp) -equal with hit.  False positives (fp) -type I error, equal with correct rejection.  False negatives (fn) -type II error.  Precision -the ratio tp / (tp + fp); there is an intuitive ability of the classifier to avoid labelling a negative sample with the positive label.  Recall -the ratio tp / (tp + fn); there is an intuitive ability of the classifier to find all the positive samples (the best is 1, the worst is 0).  F1-score -a weighted average of the precision and recall (its best value is 1, the worst is 0) = 2 * (precision * recall) / (precision + recall).  Scorethe subset accuracy returned in a multilabel classification. If the entire set of predicted labels for a sample strictly matches the true set of labels, then the subset accuracy is 1.0, otherwise it is 0.0.

Experimental results
We have performed the classifier evaluation using a built-in sklearn function. For ethical reasons documented in Tor ethical research [12], we've anonymized the websites used in the experiment.
Our simple model has achieved results presented in table 2. Overall, the total score of the classifier = 0.714 These results are not outstanding in comparison with the state-of-the-art techniques, but they show that we can deanonimize users with the help of a relatively simple program and achieve sufficient accuracy.

Conclusion
It was shown that the attacker without cutting-edge machine learning techniques can apply website fingerprinting. If the attacker has enough experience and technical competence, he will be able to build such a system and use it for the purpose of deanonimization. Moreover, the proposed solution will work better if the attacker sniffs Wi-Fi or other local network, because it is very easy for him to find Tor related traffic and collect traces. In this case, the deanonimization is targeted and easily implemented.