Proceedings of ISP RAS

Scalable code clone detection tool based on semantic analysis

Sevak Sargsyan (ISP RAS, Moscow), Shamil Kurmnagaleev (ISP RAS, Moscow), Andrey Belevantsev (ISP RAS, Moscow), Hayk Aslanyan (ISP RAS, Moscow), Artiom Baloian (ISP RAS, Moscow)


This article describes the methods of code clones detection. New approach of code clones detection is proposed for C/C++ languages based on analysis of existed methods. The method based on semantic analysis of the project, which allows detecting code clones with high accuracy. It is realized as part of LLVM compiler, which allows exceeding existed methods. The tool is consisted of three basic parts. The first part is Program Dependence Graph (PDG) generation and serialization. PDG is constructed during compilation time of the project based on LLVM‘s intermediate representation. Several simple optimizations are applied on these graphs, then they are serialized to file. The second stage is analyzing of stored PDGs. PDGs are loaded from files and split to subgraphs. Every subgraph is considered as clone candidate.  New method is purposed for the splitting, which increases number of detected clones. There are two types of algorithms for clone detection. The first types of algorithms try to prove that the pair of PDGs can not be clones. These algorithms have linear complexity, which allows processing huge amount of PDGs pairs. In case of failure graph isomorphism algorithms are applied for similar subgraphs detection. The last part is integrated system for automatic testing of algorithm’s accuracy. For the project, set of clones are automatically generated, then clone detection algorithms are applied for original source and generated one


semantic analysis; code clones; PDG; LLVM


Proceedings of the Institute for System Programming, vol. 27, issue 1, 2015, pp. 39-50

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2015-27(1)-3

Full text of the paper in pdf (in Russian) Back to the contents of the volume