CN111324380A

CN111324380A - Efficient multi-version cross-project software code clone detection method

Info

Publication number: CN111324380A
Application number: CN202010122695.6A
Authority: CN
Inventors: 吴毅坚; 方维康
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23

Abstract

The invention belongs to the technical field of software code analysis, and particularly relates to an efficient multi-version cross-project software code clone detection method. The method comprises the steps of firstly obtaining version information of a software project containing a plurality of versions, then establishing a method version group by the same method with different versions and same or highly similar code contents based on method names and file paths, selecting the earliest version in each method version group as a sample method, wherein a set of the sample methods is called a history image, then carrying out clone detection on all the history images, and simultaneously establishing an index relation between the sample method and the method version group, which is called method index. And finally, recovering the original full-scale clone relation according to the clone detection result of the sample method and the method index. The invention considers that a plurality of versions of the project have a large amount of repeated codes, and shields the repeated codes during code clone detection, thereby improving the efficiency of multi-version cross-project code clone detection.

Description

An efficient multi-version cross-project software code clone detection method

技术领域technical field

本发明属于软件开发技术领域，具体涉及一种高效的多版本跨项目软件代码克隆检测方法。The invention belongs to the technical field of software development, in particular to an efficient multi-version cross-project software code clone detection method.

背景技术Background technique

克隆代码产生的原因多种多样，软件开发人员对软件代码有意或无意地进行复制粘贴是克隆代码产生的最主要原因，并且克隆通常会伴随着一些轻微的修改，但克隆在提升代码开发速度的同时也带来了许多隐患。要了解一个复杂软件项目内部重复出现的代码以及与其他软件项目的关联关系，提升目标软件关键代码的稳定性，需要利用代码克隆检测技术和自动化的代码分析技术来实现。并且代码克隆信息可以辅助代码评审和分析人员发现目标软件系统的关键性代码并形成完整、全面的理解，为深入理解软件系统奠定数据基础，从而提高系统健壮性和稳定性。There are various reasons for the generation of cloned code. Software developers intentionally or unintentionally copy and paste the software code is the main reason for the generation of cloned code. Cloning is usually accompanied by some slight modifications, but cloning can improve the speed of code development. It also brings many hidden dangers. To understand the repeated codes in a complex software project and the relationship with other software projects, and to improve the stability of the key codes of the target software, it is necessary to use code clone detection technology and automated code analysis technology to achieve. And the code clone information can assist code reviewers and analysts to discover the key codes of the target software system and form a complete and comprehensive understanding, laying a data foundation for in-depth understanding of the software system, thereby improving the robustness and stability of the system.

目前国内外研究人员对代码克隆检测技术已做出许多研究成果。包括但不限于：Baker等人提出基于文本的克隆检测方法，将源代码作为文本处理并以文本行为基本单位进行比较以检测克隆。Kamiya等人的CCFinder利用词法分析器将每行源代码划分为Token，然后对Token序列进行转换，最后使用基于后缀树的匹配算法来对转换后的Token序列进行克隆检测。Baxter 等人提出了抽象语法树检测法，该方法将源代码解析为语法树进行克隆分析，从而更多地保留了语法结构。Ferrante 等人提出了基于程序依赖图的克隆检测方法，该方法将源代码转化成程序依赖图然后利用图与图之间的相似性检测克隆。At present, researchers at home and abroad have made many research results on code clone detection technology. Including but not limited to: Baker et al. proposed a text-based clone detection method, which treats the source code as text and compares it as a basic unit of text to detect clones. Kamiya et al.'s CCFinder uses a lexical analyzer to divide each line of source code into tokens, then converts the token sequences, and finally uses a suffix tree-based matching algorithm to perform clone detection on the converted token sequences. Baxter et al. proposed an abstract syntax tree detection method, which parses the source code into a syntax tree for clonal analysis, thereby preserving more syntactic structure. Ferrante et al. proposed a clone detection method based on program dependency graph, which converts the source code into a program dependency graph and then uses the similarity between graphs to detect clones.

当前学术界的克隆检测技术或工具主要集中于克隆检测算法本身，然而对于某些含有多个版本的项目，由于版本间含有大量的重复代码，因此无论采用何种克隆检测工具，都不可避免的存在大量的不必要的重复检测。为了能更高效的检测克隆，考虑到版本间重复代码的规律性，提出一种比传统代码克隆检测方法更加适合多版本项目的克隆检测方法，大大提升了多版本项目克隆检测的效率。The current clone detection technologies or tools in academia mainly focus on the clone detection algorithm itself. However, for some projects with multiple versions, due to the large amount of repetitive code between the versions, no matter which clone detection tool is used, it is unavoidable There is a large number of unnecessary duplicate detections. In order to detect clones more efficiently, considering the regularity of code duplication between versions, a clone detection method that is more suitable for multi-version projects than traditional code clone detection methods is proposed, which greatly improves the efficiency of multi-version project clone detection.

发明内容SUMMARY OF THE INVENTION

本发明的目的是考虑到多版本项目中存在大量的重复代码，传统的克隆检测方法会有许多不必要的重复检测，为弥补传统方法的不足而提供了一种基于版本间代码映射关系的高效多版本软件项目代码克隆检测方法。The purpose of the present invention is to consider that there are a large number of duplicate codes in multi-version projects, and the traditional clone detection method will have many unnecessary duplicate detections. Multi-version software project code clone detection method.

本发明提供的基于版本间代码映射关系的高效多版本软件项目代码克隆检测方法，主要思想为：为拥有多个版本的项目，根据一些启发式规则将不同版本、代码内容相同或高度相似的相同方法建立方法版本组，再从每个方法版本组中选取一个方法作为样本参与克隆检测。最后将样本方法的克隆检测结果恢复成完整的克隆检测结果。The main idea of the high-efficiency multi-version software project code clone detection method based on the code mapping relationship between versions provided by the present invention is: for a project with multiple versions, according to some heuristic rules Methods Establish method version groups, and then select a method from each method version group as a sample to participate in clone detection. Finally, the clone detection result of the sample method is restored to the complete clone detection result.

本发明方法具体步骤如下：The concrete steps of the method of the present invention are as follows:

a.获取每个待分析软件项目的历史版本信息，包括版本名称、发布时间；a. Obtain the historical version information of each software project to be analyzed, including version name and release time;

b.对于每个项目，（1）建立方法版本组：首先为不同版本、代码内容相同或高度相似的同一方法建立方法版本组；（2）构建历史映像：再从所有方法版本组中选取最早的版本作为样本方法，这些样本方法的集合称为该项目的历史映像；（3）建立方法索引：最后，建立样本方法和其所在的方法版本组间的索引关系，该索引称为方法索引。若项目只有一个版本，则该版本就是这个项目的历史映像；b. For each project, (1) establish a method version group: first, establish a method version group for the same method with different versions, the same or highly similar code content; (2) build a historical image: then select the earliest method from all method version groups The version of the sample method is used as the sample method, and the collection of these sample methods is called the historical image of the project; (3) Establish method index: Finally, establish the index relationship between the sample method and the method version group in which it belongs, which is called the method index. If the project has only one version, the version is the historical image of the project;

c.采用代码克隆检测工具对各个项目的历史映像进行克隆检测，得到克隆检测结果；c. Use the code clone detection tool to perform clone detection on the historical images of each project, and obtain the clone detection results;

d.用得到的克隆检测结果结合步骤b保存的方法索引恢复原始的全量克隆关系。d. Use the obtained clone detection result in combination with the method index saved in step b to restore the original full clone relationship.

步骤a中，所述的待检测项目即用户指定的需要进行代码克隆检测的项目的集合。该步骤需要用户提供这些项目的版本信息。In step a, the items to be detected are the set of items specified by the user that need to be tested for code clones. This step requires the user to provide version information for these items.

其中，所述的待分析软件的版本信息，包括该项目的所有版本的名称及对应的发布时间。该版本信息按照预先规定的格式存储。版本信息的获取可以由版本控制工具如SVN、Git等直接导出，如果项目不是由版本控制工具管理的，可以按照规定的格式手动添加版本信息。Wherein, the version information of the software to be analyzed includes the names of all versions of the project and the corresponding release time. The version information is stored in a predetermined format. The acquisition of version information can be directly exported by version control tools such as SVN, Git, etc. If the project is not managed by the version control tool, version information can be added manually according to the specified format.

步骤b中，所述的建立方法版本组是就单个项目而言，如果该项目有多个版本，那么版本之间很有可能存在大量相同的方法，并且这些方法一般存在于相同的相对路径下，并且方法名一致。跟据此特征，建立方法版本组的开销非常小，但对后续克隆检测效率的提升十分可观。具体过程为，按照版本的发布时间顺序，依次对项目的每个版本作如下处理：首先提取当前版本中的所有方法，对于每个方法，判断其是否已经属于某个方法版本组，是则跳过；否则，建立一个新的方法版本组，对于该方法，提取其方法名与所在文件的相对路径，查找所有后续版本中与该相对路径、方法名都相同且文本高度相似的方法，将这些方法添加到该新的方法版本组。接着，选取所有方法版本组中最早的版本作为样本方法，一个项目中所有样本方法的集合被称作该项目的历史映像。最后，建立样本方法和方法版本组间的索引关系，该索引称为方法索引。In step b, the establishment of a method version group is for a single project. If the project has multiple versions, there are likely to be a large number of the same methods between versions, and these methods generally exist in the same relative path. , and the method name is the same. According to this feature, the cost of establishing a method version group is very small, but the efficiency of subsequent clone detection is greatly improved. The specific process is to process each version of the project in turn as follows according to the release time sequence of the version: first extract all methods in the current version, and for each method, determine whether it already belongs to a method version group, and skip to Otherwise, create a new method version group, for this method, extract the relative path between its method name and the file where it is located, find all subsequent versions of the method with the same relative path, method name and highly similar text, and put these The method is added to this new method version group. Next, the earliest version in all method version groups is selected as the sample method, and the set of all sample methods in a project is called the historical image of the project. Finally, the index relationship between the sample method and the method version group is established, and the index is called the method index.

其中，所述的相同的方法的判定标准依据的是方法文本间的编辑距离，具体为：对于方法A、B，若方法A、B文本间的编辑距离与方法A、B文本长度的较小者的比值小于0.05，即方法A、B间的文本相似度超过95%，则方法A、B被认为是相同方法。Wherein, the determination criterion of the same method is based on the edit distance between the method texts, specifically: for methods A and B, if the edit distance between the methods A and B texts is smaller than the length of the methods A and B texts If the ratio is less than 0.05, that is, the text similarity between methods A and B exceeds 95%, then methods A and B are considered to be the same method.

步骤c中，所述的克隆检测即对各个项目的历史映像进行检测，检测范围既包括项目内也包含项目间的克隆，检测结果为克隆组。并且，检测工具是可配置的，既可以用现成的检测工具也可以自己开发。In step c, the clone detection is to detect the historical images of each project, the detection scope includes both intra-project and inter-project clones, and the detection result is a clone group. Moreover, the detection tool is configurable, and it can be developed by using the existing detection tool.

步骤d中，所述的恢复原始的全量克隆关系指的是依据项目的历史映像的克隆检测结果，结合方法索引，从部分克隆关系映射到完整的克隆关系。In step d, the restoration of the original full clone relationship refers to mapping from the partial clone relationship to the complete clone relationship based on the clone detection result of the historical image of the project, combined with the method index.

其中，所述的全量克隆关系指的是多版本项目在不作额外处理的情况下，用克隆检测工具检测出来的结果。The full clone relationship refers to the result detected by the clone detection tool without additional processing of the multi-version project.

本发明与现有技术相比，具有以下的优点和积极效果：本发明为软件维护人员和软件开发者提供了一种理解多版本软件系统内部以及和其他项目的克隆关系的有效手段。与传统的代码克隆检测技术或工具主要集中于算法本身不同，本发明从多版本项目自身的结构特征出发，减少多版本项目的待检测代码量，能极大提升多版本跨项目代码克隆检测的效率。Compared with the prior art, the present invention has the following advantages and positive effects: the present invention provides an effective means for software maintainers and software developers to understand the clone relationship within a multi-version software system and other projects. Different from the traditional code clone detection technology or tool that mainly focuses on the algorithm itself, the present invention starts from the structural characteristics of the multi-version project itself, reduces the amount of code to be detected in the multi-version project, and can greatly improve the detection of multi-version cross-project code clones. efficiency.

附图说明Description of drawings

图1为本发明的基本过程示意图。包括提取项目版本信息，建立方法版本组、历史映像及方法索引，克隆检测和克隆恢复几个过程。FIG. 1 is a schematic diagram of the basic process of the present invention. It includes several processes of extracting project version information, establishing method version group, historical image and method index, clone detection and clone recovery.

图2为示例的实施过程示意图，展示了针对软件项目集合的多个发布版本进行克隆检测的具体过程。FIG. 2 is a schematic diagram of an exemplary implementation process, showing a specific process of clone detection for multiple release versions of a software project set.

具体实施方式Detailed ways

通过以下对本发明的实施例并结合其附图的描述，可以进一步理解本发明的目的、具体结构特征和优点。图2为示例的实施过程示意图。The objectives, specific structural features and advantages of the present invention can be further understood through the following description of the embodiments of the present invention in conjunction with the accompanying drawings. FIG. 2 is a schematic diagram of an example implementation process.

本实施例从GitHub上选择超过50星、至少有两个发布版本且来自不同领域的共251个Java开源项目，作为克隆检测目标代码。下面针对该软件项目集合的多版本跨项目代码克隆检测的具体实现方式示例。In this embodiment, a total of 251 Java open source projects with more than 50 stars, at least two release versions, and from different fields are selected from GitHub as clone detection target codes. The following is an example of a specific implementation manner of the multi-version cross-project code clone detection for the software project collection.

基于该实施例的主要过程为：The main process based on this embodiment is:

（1）根据用户指定的目标项目集合路径、版本信息文件路径等，解析出待检测的目标项目集合以及其版本信息；借助版本管理工具Git，将其所有发布版本源代码抽取出来并保存，同时将其所有的版本信息保存到数据库；考虑到本实施例仅针对Java代码，因此只保留源代码中的Java代码文件（.java文件）；总共发布版本共3234个，总代码行数约3亿行；(1) According to the target project set path, version information file path, etc. specified by the user, parse out the target project set to be detected and its version information; with the help of the version management tool Git, extract and save all the source code of its released version, and at the same time Save all its version information to the database; considering that this embodiment is only for Java code, only the Java code file (.java file) in the source code is retained; there are a total of 3234 released versions, and the total number of lines of code is about 300 million Row;

（2）构建历史映像和建立方法索引；采用本方法对所选的3234个发布版本构建历史映像并保存方法索引共耗时1129秒（为保证结果可信，经过多次构建操作取平均值，下同）；生成的251个项目的历史映像共计约4千万行代码，788120个样本方法；(2) Constructing historical images and establishing method indexes; it took 1129 seconds to construct historical images for the selected 3234 released versions and save method indexes using this method (to ensure the credibility of the results, the average value was obtained after multiple construction operations, The same below); the generated historical images of 251 projects total about 40 million lines of code and 788,120 sample methods;

（3）克隆检测；我们用一种现有的代码克隆检测工具对第（2）步生成的历史映像进行跨项目克隆检测，得到82595个克隆组共644653个克隆实例，共耗时96秒；(3) Clone detection; we used an existing code clone detection tool to perform cross-project clone detection on the historical image generated in step (2), and obtained 82,595 clone groups with a total of 644,653 clone instances, which took a total of 96 seconds;

（4）恢复全量克隆关系；根据检测得到的克隆组以及所建立的方法索引，恢复全量的克隆关系，共3821507个实例。(4) Restore the full clone relationship; according to the detected clone group and the established method index, restore the full clone relationship, with a total of 3,821,507 instances.

结果分析：第二步生成的历史映像代码量共约4千万行，与原所有发布版本代码量（3亿行）相比，代码量缩减约87%。第三步对历史映像进行克隆检测耗时96秒。作为对比，我们也用同一个代码克隆检测工具对同样的代码集合（3234个发布版本原始的3亿行代码）进行代码克隆检测，共耗时约5800秒。可见用本方法后，整体的克隆检测时间有了较大的缩减，提升了多版本跨项目代码克隆检测的效率。Analysis of the results: The amount of historical image code generated in the second step is about 40 million lines, which is reduced by about 87% compared with the code amount of all the original releases (300 million lines). In the third step, clone detection of historical images takes 96 seconds. As a comparison, we also used the same code clone detection tool to perform code clone detection on the same code set (3234 original 300 million lines of code in 3234 releases), which took about 5800 seconds in total. It can be seen that after using this method, the overall clone detection time is greatly reduced, and the efficiency of multi-version cross-project code clone detection is improved.

Claims

1. An efficient multi-version cross-project software code clone detection method is characterized by comprising the following specific steps:

a. acquiring historical version information of each software project to be analyzed;

b. for each project, a method version set is first established: establishing a method version group for the same method with different versions and same or highly similar code content; then, a history map is built: selecting the earliest version from all the method version groups as a sample method, wherein the set of the sample methods is called a history image of the item; and finally establishing a method index: establishing an index relation between a sample method and a method version group where the sample method is located, wherein the index is called a method index;

c. performing clone detection on the historical images of all the projects by adopting a code clone detection tool to obtain a clone detection result;

d. and (c) restoring the original full-scale clone relation by combining the obtained clone detection result with the method index stored in the step b.

2. The method according to claim 1, wherein in step a, the version information of the software to be analyzed includes names of all versions of the project and corresponding release times; the version information is stored in a predetermined format.

3. The method according to claim 1, wherein in step b, the specific process of establishing the method version group is to sequentially perform the following processing on each version of the project according to the release time sequence of the versions: firstly, extracting all methods in the current version, judging whether each method belongs to a certain method version group or not for each method, and skipping if the method belongs to the certain method version group; otherwise, establishing a new method version group, extracting the relative path between the method name and the file where the method is located, searching methods which are the same as the relative path and the method name and have highly similar texts in all subsequent versions, and adding the methods into the new method version group; then, selecting the earliest version in all method version groups as a sample method, wherein the set of all sample methods in a project is called a history image of the project; finally, an index relationship between the sample methods and the groups of method versions is established, and the index is called a method index.

4. The method according to claim 3, wherein in step b, the criterion of the same method is an edit distance between method texts, specifically: for method a and method B, if the ratio of the edit distance between the texts of method a and method B to the smaller of the text lengths of method a and method B is less than 0.05, i.e., the text similarity between method a and method B exceeds 95%, then method a and method B are considered to be the same method.

5. The method according to claim 3, wherein in step c, the clone detection is performed on the history map of each item, the detection range includes the clone within the item and among the items, and the detection result is a clone group; also, the detection tool is configurable.

6. The method according to any one of claims 1 to 5, wherein in step d, the restoring of the original full-scale clonal relationship refers to mapping from a partial clonal relationship to a full clonal relationship in combination with a method index based on a clonal detection result of a history map of an item;

wherein, the full-scale clone relation refers to the result detected by a clone detection tool under the condition that the multi-version project is not additionally processed.