Discovering Near Duplicate Text in Software Documentation*

Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gram model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8–3 Mb are processed 5–22 min.


Introduction
Software projects produce a lot of textual information, and analysis of this data is a truly significant task for practice [1].One particular problem in this context is software documentation duplicate management.When being developed, a lot of copypasted text fragments appeared in software documentation, which is often not tracked properly.According classification from [2], there are different kinds of software documents.For some of them, duplicate text is undesired, while others should contain duplicate text.But in any case duplicates increase documentation complexity and maintenance costs.The situation is further complicated by duplicate information frequently being "near duplicate", i.e., the same information may be presented many times with different levels of detail, in various contexts, etc.Most popular technique to detect duplicates in software documentation is software clone detection [3].There are a number of approaches using this technique in software documentation research [4], [5], [6].However, these approaches operate only with exact duplicates.Near duplicate clone detection techniques [7], [8], [9], [10] are not directly capable of detecting duplicates from text documents as they involve some degree of parsing of the underlying source code for duplicate detection.In our previous studies [11], [12], [13] we have presented a near duplicate detection approach which is based on software clone detection.We adapted clone detection tool Clone Miner [14] to detect exact duplicates in documents, then near duplicates were extracted as combinations of exact duplicates.However, this approach outcomes a lot of false positives because it can not manage exact duplicate detection and operates with bad-quality "bricks" for combination of near duplicates.Meanwhile false positives' problem is one of the big obstacle of duplicate management in practice [4].In this paper we suggest an near duplicate detection algorithm based on N-gram model [1].The algorithm doesn't use software clone detection, omitting the intermediate phases of exact duplicate detection.We have implemented the algorithm using Natural Language Toolkit [15] (NLTK).The algorithm was evaluated on documentation of five industrial projects.

Related Work
The problem of duplicate management in software project documents is being actively explored at the moment.Juergens et al. [4] analyze redundancy in requirement specifications.Horie et al. [16] consider the problem of text fragment duplicates in Java API documentation.Wingkvist et al. [5] detect exact duplicates to manage documents maintenance.Rago et al. [17] detect duplicate functionality in textual requirement specifications.However, the problem of near duplicate detection is still open.It is mentioned in [4], and Nosál and Porubän [18] suggest only using near duplicates omitting the way to detect them.For software engineering, the conceptual background of near duplicate analysis is provided by Bassett [19].He introduced the terms of archetype (the common part of various occurrences of variable information) and delta (the variation part).Based on this concept, Jarzabek developed an XML-based software reuse method [20].Koznov and Romanovsky [21], [22] applied the ideas of Bassett and Jarzabek to software documentation reuse, including automated documentation refactoring.However, these studies did not resolve the problem of document duplicate detection.There are various techniques to detect near duplicate clones in source code.SourcererCC [7] detects near duplicates of code blocks using a static bag-of-tokens strategy that is resilient to minor differences between code blocks.Deckard [8] computes certain characteristic vectors of code to approximate the structure of Abstract Syntax Trees in the Euclidean space.Locality sensitive hashing (LSH) [9] is used to group similar vectors with the Euclidean distance.NICAD [10] is a text-based near duplicate detection tool that also uses a tree-based structural analysis.However, these techniques are not directly capable of detecting duplicates in text documents as they involve some degree of parsing the underlying source code for duplicate detection.A suitable customization for this purpose can be explored in the future.Finally, there is a need for mature near duplicate detection methods to provide a proper duplicate analysis in software documentation.New information retrieving methods should be applied to increase the search quality.Natural language processing methods appear attractive for that purpose [1].

Background
Modern natural language processing and computer linguistics employ numerous standard approaches to analyze and transform texts.One of them is N-gram model [23].Let us consider the text as a set of sentences.For every sentence the Ngram model includes all sequences (N-grams) consisting of n words, where every next word directly follow to previous one in the same order as in the sentence.Therefore every N-gram is a substring of the correspondent sentence.For example, if we want to detect the fact that two sentence are similar we can to compare their Ngram sets.N-gram model is used to perform different kinds of text analysis.One of the most common programming tools for practical use of N-gram model is Natural Language Toolkit (NLTK) [15].It provides a number of standard linguistic operations and is implemented in Python, that makes it easy to integrate NLTK into our Documentation Refactoring Toolkit [24] environment.

The Algorithm
The proposed algorithm requires the raw input document to be preprocessed: it should be divided into sentences, the sentences should be divided into words (tokens), and for every sentence an N-gram set is build.The algorithm collects document sentences into groups, if they are close to each other and were likely derived from one source by copy and paste.The algorithm works as follows.First, it extracts sentences and builds 3-gram set for each of them.After that, for each sentence, the algorithm scans existing groups and chooses the best one, which already contains the largest number of the sentence's 3grams.Then, if the best group already contains at least a half of the sentence's 3grams, the sentence is added to this group, and the group's 3-gram set is complemented with the new sentence's 3-grams.When no such group is found, a new group is introduced.Finally, the algorithm outputs the groups that contain two or more sentences.These groups are near duplicate groups.

Evaluation
We follow to the GQM framework [25] to organize evaluation of our algorithm.We formulate a set of evaluation questions: Question 1: How many false positives (incorrect and irrelevant duplicate groups) and meaningful near duplicates are found?Question 2: What is the performance of the algorithm?
We use the notion reuse amount [26] that means the relation of the reusable part to document length.For exact duplicates the reusable part is the total number of symbols, covered by duplicates, for near duplicates we consider only their archetypes.
In [4] the same metric is named clone coverage.Following [12], [13] we selected documentation of the four open sources as evaluation objects, but add one more commercial project documentation:  Linux Kernel documentation (LKD), 892 KB in total [27];  Zend Framework documentation (Zend), 2924 KB in total [28];  DocBook 4 Definitive Guide (DocBook), 686 KB in total [29];  Version Control with Subversion (SVN), 1810 KB in total [30];  Commercial project user guide (CProj), 164 KB in total.To answer question 1, we performed an manual analysis of near duplicate detected.The results are presented in Table 1.The table includes column Document (evaluation documents) and two sections: Proposed algorithm (data concerning algorithm presented in the paper) and Manual analysis (results of manual analysis of the algorithm output).The Proposed algorithm section is organized as follows:  automatically detected shows numbers of groups, which algorithm found;  raw reuse amount contains reuse amount values for the evaluated documents.The Manual analysis section contains the following columns:  markup-only contains numbers of groups without human-readable text (they only contain markup);  irrelevant presents numbers of false-positive groups, which were detected by human during manual revision of algorithm output;  total meaningful shows number of meaningful duplicates, manually detected analyzing algorithm output;  meaningful reuse amount presents reuse amount values for meaningful near duplicates detected.14.4% of groups contain no human-readable text, but only markup, 24.6% of groups contain text which is similar, but this is just formal similarity, and duplicates of those groups are not semantically connected.Remaining 61% of groups are meaningful duplicate groups.For documents of different sizes their count varies from few dozens to several hundreds depending on the size and nature of document, therefore we can say that proposed algorithm detects considerable amount of near duplicates, and most of them are meaningful.The reuse amount has been decreased in 2 times after manual processing.These data indicates the false positive problem need to be resolved for the algorithm.
Finally, to answer question 2 we estimated the working time of the algorithm with the evaluation documents.For our experiments we used the usual work station Intel i5-2400, 3.10GHz, RAM 4 GiB, Windows 10.Our estimation results are presented in table 2. The first column of the table contains the acronyms of the documents to be evaluated.The second one contains the size of the documents.The third column presents the algorithm processing time values.The forth column presents the processing speed.The processing speed depends on two parameters: the size of the document and the reuse amount.It decreases when the document size grows and as the reuse amount increases.The first statement is obvious.The second one follows from the fact that, roughly speaking, the larger the reuse amount is, the fewer groups of single sentence exist, and therefore number operations in cycle of the best group selection (see listing 1, lines 5-13) decreases.However, this is a rough estimation because the size of the groups also contributes to the processing speed.And we cannot say for certain whether or not a larger reuse amount might compensate for a larger document size.Among the five documents presented in table 2, we can see our assumption confirmed.In the case of these documents, the processing speed decreases as the document size increases, with one exception.The processing speed of the algorithm for Zend was higher than that for SVN, although the size of the Zend document was bigger than that of SVN.At the same time, the reuse amount of Zend is substantially higher than that of SVN.Also the assumption concerning the reuse amount works well in our experiments carried out outside of results presented in this paper.However, further research is needed to verify this assumption.In addition, implementation factors need to be explored, which can influence the algorithm performance.Finally, the performance of the algorithm appears sufficient for practical applications.The algorithm demonstrates an acceptable processing time for rather large documents, i.e. from 1 to 3 Mb.Larger documents are quite rare in practice.

Conclusion
We have presented an algorithm for the detection of near duplicates in software documentation based on N-gram model.The proposed algorithm is close to the naive voting clustering algorithm [31], using a similarity measure resembling the Jaccard index [32].Compared to [12], [13], the algorithm looks much simpler, while also making use of the techniques and apparatus conventionally used for text analysis.It should be noted, the algorithm has a considerable limitation: it only detects single sentences as near duplicates.Our primary goal for future research is to extend the algorithm to make possible processing arbitrary text fragments.Here are some additional future directions of the research: 1.It is necessary to resolve false positives problem.The algorithm output should be compared to manual document analysis.
2. Classification of false positives and meaningful near duplicates should be developed.False positives may include markup, document metadata, etc. Meaningful near duplicates usually describe entities of the same nature (function descriptions, command line parameters, data type specifications, etc.).
3. Improvement of experiment model should be performed.For example, Juergens et al. [4] spend much effort to obtain objective results in analyzing duplicates of real industry documents.
Research results could be applied in various fields of software engineering, e.g. in model based testing [33], [34] to provide correctness of initial requirement specifications, which are used for test generation.
Let's describe the algorithm in more detail.The formal specification of the algorithm is presented on the listing.Below the main functions of the algorithm are briefly considered.intersect(A,B) function returns elements, which exist in both A and B sets  size(A) function returns number of elements in the set A  sent is an array of sentences in document text o sent[i].nGrams is 3-gram set of the i-th sentence  groups is an array of near duplicate groups Lines 1-22: the main algorithm cycle, which iterates over all sentences of the document.2.Lines 5-13: the cycle for the best group selection.For each groups: 2.1.Line 7: intersection of 3-gram set with the 3-gram set of current sentence is calculated.
o groups[i].nGrams is a 3-gram set of i-th group o groups[i].sent is a set of sentences of i-th group Details of proposed algorithm are described below: 1. 3.2.Lines 19, 20: otherwise, we put the sentence into the best group found.4. Lines 23-25: groups with single sentence are not near duplicate groups, therefore we remove them.

Table 1 .
Near-duplicate groups detected

Table 2 .
Performance analysis