CN111459799B

CN111459799B - Software defect detection model establishing and detecting method and system based on Github

Info

Publication number: CN111459799B
Application number: CN202010140642.7A
Authority: CN
Inventors: 柯鑫; 叶贵鑫; 汤战勇; 尹小燕; 龚晓庆; 房鼎益
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2023-03-10
Anticipated expiration: 2040-03-03
Also published as: CN111459799A

Abstract

The invention discloses a software defect detection model establishment, detection method and system based on Github, wherein the establishment of the detection model includes: firstly, preprocessing the data set in the Github platform to obtain the change record and the corresponding Bug that meet the requirements Fix file pairs; then process the change records that meet the requirements to generate slice vectors and labels; finally input the slice vectors and labels into the bidirectional LSTM model for training and learning, and obtain a trained detection model. For the target file to be detected, the vector of the target file is processed and input into the detection model to obtain the detection result. The method of the invention solves the problems of unbalanced data, insufficient data diversity and poor model generalization ability faced by the current defect detection based on source code learning because the data set is too small, and can achieve higher detection accuracy.

Description

A Github-based software defect detection model establishment, detection method and system

技术领域technical field

本发明属于代码审计技术领域，具体涉及一种基于Github的软件缺陷检测模型建立、检测方法及系统。The invention belongs to the technical field of code auditing, and in particular relates to a Github-based software defect detection model establishment, detection method and system.

背景技术Background technique

目前代码审计领域中已经存在各种各样的缺陷分析工具，这些工具试图检测出软件中的常见缺陷。静态检测工具(例如Clang)无需执行程序即可执行此操作。动态检测工具通过在真实或虚拟处理器上重复执行许多测试用例来检测缺陷。静态和动态检测工具都是基于人工定义缺陷规则的工具，因此仅限于人工设计规则，并且不能保证对代码库的完整测试。符号执行将输入数据替换为符号值，并在程序的控制流程图上进行分析与诊断。尽管它可以探查所有可行的程序路径，但是符号执行非常昂贵，并且无法很好地扩展到大型程序。除了这些传统工具之外，最近还有大量关于使用机器学习进行程序分析的工作。大量开放源代码平台如Github为直接从挖掘的数据中学习软件缺陷的模式提供了机会。A variety of defect analysis tools already exist in the field of code auditing, which attempt to detect common defects in software. Static detection tools such as Clang can do this without executing the program. Dynamic inspection tools detect defects by repeatedly executing many test cases on real or virtual processors. Both static and dynamic detection tools are based on manual definition of defect rules, so they are limited to manual design rules and cannot guarantee complete testing of the code base. Symbolic execution replaces input data with symbolic values, and performs analysis and diagnosis on the control flow diagram of the program. Although it can probe all feasible program paths, symbolic execution is expensive and does not scale well to large programs. In addition to these traditional tools, there has been a considerable amount of recent work on program analysis using machine learning. Numerous open source platforms such as Github provide opportunities to learn patterns of software defects directly from mined data.

目前，在缺陷检测领域利用机器学习与深度学习实现缺陷检测的技术有很多，Hovsepyan等人通过词袋模型对Java源代码进行表示之后，利用支持向量机(SVM)来预测源码的标签。但是，他们的工作仅限于在单个软件存储库上进行训练和评估。Mou等人通过词嵌入源代码的抽象语法树中的节点，并训练基于树的卷积神经网络，实现了缺陷检测，这项工作探索了深度学习在程序分析中的潜力。Kapur等利用8种机器学习的方法，学习缺陷的特征，给出源码文件中含有缺陷的可能性。Li等人利用循环神经网络 (RNN)训练与库/API函数调用相关的代码片段，以检测与这些API调用不当的两种类型的缺陷。At present, there are many technologies that use machine learning and deep learning to realize defect detection in the field of defect detection. After Hovsepyan et al. represented the Java source code through the bag of words model, they used the support vector machine (SVM) to predict the label of the source code. However, their work is limited to training and evaluation on a single software repository. Mou et al. implemented defect detection by embedding words into nodes in the abstract syntax tree of source code and training a tree-based convolutional neural network. This work explored the potential of deep learning in program analysis. Kapur et al. used 8 machine learning methods to learn the characteristics of defects and give the possibility of defects in source code files. Li et al. utilized a recurrent neural network (RNN) to train code snippets related to library/API function calls to detect two types of defects related to these API calls inappropriately.

先前的大多数工作所使用的有限的数据集(大小和种类都有限)限制了结果的实用性，并阻止了他们充分利用深度学习的力量，如上述相关工作均在PROMISE、DEFECT4J、NVD、SARD等漏洞库上操作，而这些数据集最大包含开源仓库的数量在8个以下，且更新缓慢，缺陷种类少，缺陷覆盖率低，这就会导致现有的模型无法用于检测复杂多样的缺陷。目前没有相关工作直接在大型开源代码库中挖掘数据、并从中学习缺陷特征达到检测目的，主要原因是在如Github这样的大型开源代码库上收集数据面临着非常高的误报率问题，会进一步影响模型的有效性。The limited datasets (both limited in size and variety) used by most of the previous work limited the utility of the results and prevented them from exploiting the full power of deep learning, such as the related work mentioned above in PROMISE, DEFECT4J, NVD, SARD These data sets contain less than 8 open source warehouses at most, and are updated slowly, with few types of defects and low defect coverage, which makes the existing models unable to detect complex and diverse defects. . At present, there is no related work that directly mines data in large-scale open source code bases, and learns defect features from them to achieve detection purposes. The main reason is that collecting data on large open source code bases such as Github faces a very high false positive rate. affect the effectiveness of the model.

发明内容Contents of the invention

为解决现有技术中存在的不足，本发明提供了一种基于Github的软件缺陷检测模型建立、检测方法及系统，解决现有方法漏洞库更新缓慢，缺陷种类少，缺陷覆盖率低而导致缺陷静态检测技术必须面临样本不平衡、模型无法用于检测复杂多样的缺陷、模型失效快的问题。In order to solve the deficiencies in the prior art, the present invention provides a Github-based software defect detection model establishment, detection method and system, which solves the problem of slow update of the vulnerability database in the existing method, few types of defects, and low defect coverage, which lead to defects Static detection technology must face the problems of unbalanced samples, models that cannot be used to detect complex and diverse defects, and models that fail quickly.

为了解决上述技术问题，本发明采用如下技术方案予以实现：In order to solve the above technical problems, the present invention adopts the following technical solutions to achieve:

一种基于Github的软件缺陷检测模型建立方法，包括以下步骤：A method for building a software defect detection model based on Github, comprising the following steps:

步骤1，数据预处理：Step 1, data preprocessing:

对Github中的仓库进行排序，选取排名靠前的仓库作为源数据集，对源数据集中的每个仓库的变更记录进行筛选和去重，得到符合要求的变更记录及其对应的Bug-Fix文件对；Sort the warehouses in Github, select the top-ranked warehouses as the source data set, filter and deduplicate the change records of each warehouse in the source data set, and obtain the required change records and their corresponding Bug-Fix files right;

步骤2，对步骤1得到的符合要求的变更记录及其对应的Bug-Fix文件对进行处理，生成切片的向量及标签：Step 2, process the change records that meet the requirements obtained in step 1 and their corresponding Bug-Fix file pairs, and generate slice vectors and labels:

步骤2.1，对步骤1得到的符合要求的变更记录的Bug-Fix文件对进行解析和比对，获得每个变更记录对应的增加和删除的代码行信息；Step 2.1, analyze and compare the Bug-Fix file pairs of the change records that meet the requirements obtained in step 1, and obtain the added and deleted code line information corresponding to each change record;

步骤2.2，根据步骤2.1获得的增加和删除的代码行信息及其Bug-Fix 文件对的数据流，确定每一个变更记录导致缺陷的缺陷代码行；Step 2.2, according to the added and deleted code line information obtained in step 2.1 and the data flow of the Bug-Fix file pair, determine the defective code line that causes the defect for each change record;

步骤2.3，根据步骤1得到的符合要求的变更记录的Bug-Fix文件对的数据流和控制流信息，生成Bug-Fix文件对的切片；Step 2.3, according to the data flow and control flow information of the Bug-Fix file pair of the change record that meets the requirements obtained in step 1, generate slices of the Bug-Fix file pair;

步骤2.4，根据步骤2.2确定的缺陷代码行，对步骤2.3得到的Bug-Fix 文件对的切片添加标签；Step 2.4, according to the defect code line determined in step 2.2, add labels to the slices of the Bug-Fix file pair obtained in step 2.3;

步骤2.5，对步骤2.3得到的Bug-Fix文件对的切片逐个进行变量名替换，替换为一个统一的令牌；Step 2.5, replace the variable names one by one with the slices of the Bug-Fix file pairs obtained in step 2.3, and replace them with a unified token;

步骤2.6，对步骤2.5替换过变量名的切片逐个进行分词处理，每个切片得到一个令牌序列；In step 2.6, word segmentation is performed on the slices whose variable names have been replaced in step 2.5, and a token sequence is obtained for each slice;

步骤2.7，将步骤2.6得到的每一个令牌序列转化为向量；Step 2.7, converting each token sequence obtained in step 2.6 into a vector;

步骤3，模型训练：Step 3, model training:

将步骤2得到的切片的向量及标签输入到双向LSTM模型中进行训练和学习，得到训练好的检测模型。Input the slice vector and label obtained in step 2 into the bidirectional LSTM model for training and learning, and obtain a trained detection model.

具体的，所述的步骤1具体包括：Specifically, the step 1 specifically includes:

步骤1.1，对Github中的仓库按照仓库的影响因子fork由大到小进行排序，选取排名前30％～35％的仓库作为源数据集；Step 1.1, sort the warehouses in Github according to the impact factor fork of the warehouses from large to small, and select the top 30% to 35% warehouses as the source data set;

步骤1.2，采用关键字搜索方法对源数据集中的每个仓库的变更记录进行筛选，得到具有不同缺陷类型的变更记录；Step 1.2, use the keyword search method to filter the change records of each warehouse in the source data set, and obtain the change records with different defect types;

步骤1.3，过滤掉步骤1.2得到的具有不同缺陷类型的变更记录中不符合动词/直接宾语模式的变更记录；Step 1.3, filter out the change records that do not meet the verb/direct object pattern among the change records with different defect types obtained in step 1.2;

步骤1.4，采用TextRank技术对步骤1.3得到的变更记录进行再筛选，剔除掉混合型变更记录；Step 1.4, use TextRank technology to re-screen the change records obtained in step 1.3, and eliminate mixed change records;

步骤1.5，对步骤1.4得到的变更记录进行去重，得到符合要求的变更记录及其对应的Bug-Fix文件对。In step 1.5, deduplicate the change records obtained in step 1.4 to obtain the change records that meet the requirements and their corresponding Bug-Fix file pairs.

具体的，所述的步骤2.4中，在添加标签过程中，包含缺陷代码行的切片标记为0，不包含缺陷代码行的切片标记为1。Specifically, in step 2.4, in the labeling process, the slice containing the defective code line is marked as 0, and the slice not containing the defective code line is marked as 1.

具体的，所述的步骤2.7中，使用word2vec工具将每一个令牌序列转化为向量。Specifically, in the step 2.7, the word2vec tool is used to convert each token sequence into a vector.

本发明还公开了一种基于Github的软件缺陷检测模型建立系统，包括以下模块：The invention also discloses a Github-based software defect detection model building system, including the following modules:

数据预处理模块，用于对Github中的仓库进行排序，选取排名靠前的仓库作为源数据集，对源数据集中的每个仓库的变更记录进行筛选和去重，得到符合要求的变更记录及其对应的Bug-Fix文件对；The data preprocessing module is used to sort the warehouses in Github, select the top-ranked warehouses as the source data set, filter and deduplicate the change records of each warehouse in the source data set, and obtain the change records that meet the requirements and Its corresponding Bug-Fix file pair;

数据提取模块，包括：Data extraction modules, including:

增加和删除的代码行信息获取模块，用于对数据预处理模块得到的符合要求的变更记录的Bug-Fix文件对进行解析和比对，获得每个变更记录对应的增加和删除的代码行信息；The added and deleted code line information acquisition module is used to analyze and compare the Bug-Fix file pairs of the change records that meet the requirements obtained by the data preprocessing module, and obtain the added and deleted code line information corresponding to each change record ;

缺陷代码行获取模块，用于根据增加和删除的代码行信息获取模块获得的增加和删除的代码行信息及其Bug-Fix文件对的数据流，确定每一个变更记录导致缺陷的缺陷代码行；The defective code line acquisition module is used to determine the defective code line that each change record causes a defect according to the added and deleted code line information obtained by the added and deleted code line information acquisition module and the data flow of the Bug-Fix file pair;

切片生成模块，用于根据数据预处理模块得到的符合要求的变更记录的 Bug-Fix文件对的数据流和控制流信息，生成Bug-Fix文件对的切片；The slice generation module is used to generate slices of the Bug-Fix file pair according to the data flow and control flow information of the Bug-Fix file pair of the change record that meets the requirements obtained by the data preprocessing module;

添加标签模块，用于根据缺陷代码行获取模块确定的缺陷代码行，对切片生成模块得到的Bug-Fix文件对的切片添加标签；Add labeling module, be used for according to the defective code row determined by defect code row obtaining module, add label to the slice of the Bug-Fix file pair that slice generation module obtains;

变量名替换模块，用于对切片生成模块得到的Bug-Fix文件对的切片逐个进行变量名替换，替换为一个统一的令牌；The variable name replacement module is used to replace the variable names one by one for the slices of the Bug-Fix file pairs obtained by the slice generation module, and replace them with a unified token;

分词处理模块，用于对步骤变量名替换模块替换过变量名的切片逐个进行分词处理，每个切片得到一个令牌序列；The word segmentation processing module is used to perform word segmentation processing on the slices whose variable names have been replaced by the step variable name replacement module, and each slice obtains a token sequence;

向量转化模块，用于将分词处理模块得到的每一个令牌序列转化为向量；The vector conversion module is used to convert each token sequence obtained by the word segmentation processing module into a vector;

模型训练模块，用于数据提取模块得到的切片的向量及标签输入到双向 LSTM模型中进行训练和学习，得到训练好的检测模型。The model training module is used to input the vector and label of the slice obtained by the data extraction module into the bidirectional LSTM model for training and learning, and obtain a trained detection model.

具体的，所述的数据预处理包括：Specifically, the data preprocessing includes:

源数据集获取模块，用于对Github中的仓库按照仓库的影响因子fork 由大到小进行排序，选取排名前30％～35％的仓库作为源数据集；The source data set acquisition module is used to sort the warehouses in Github according to the impact factor fork of the warehouses from large to small, and select the warehouses with the top 30% to 35% as the source data sets;

一次筛选模块，用于采用关键字搜索方法对源数据集中的每个仓库的变更记录进行筛选，得到具有不同缺陷类型的变更记录；A screening module, which is used to filter the change records of each warehouse in the source data set by using a keyword search method to obtain change records with different defect types;

二次筛选模块，用于过滤掉一次筛选模块得到的具有不同缺陷类型的变更记录中不符合动词/直接宾语模式的变更记录；The secondary screening module is used to filter out the change records that do not conform to the verb/direct object mode among the change records with different defect types obtained by the primary screening module;

三次筛选模块，用于采用TextRank技术对二次筛选模块得到的变更记录进行再筛选，剔除掉混合型变更记录；The third screening module is used to re-screen the change records obtained by the secondary screening module by using TextRank technology, and eliminate the mixed change records;

去重模块，用于对三次筛选模块得到的变更记录进行去重，得到符合要求的变更记录及其对应的Bug-Fix文件对。The deduplication module is used to deduplicate the change records obtained by the three-time screening module to obtain the required change records and their corresponding Bug-Fix file pairs.

具体的，所述的添加标签模块中，在添加标签过程中，包含缺陷代码行的切片标记为0，不包含缺陷代码行的切片标记为1。Specifically, in the labeling module, during the labeling process, the slice containing the defective code line is marked as 0, and the slice not containing the defective code line is marked as 1.

具体的，所述的向量转化模块中，使用word2vec工具将每一个令牌序列转化为向量。Specifically, in the vector conversion module, a word2vec tool is used to convert each token sequence into a vector.

本发明还公开一种基于Github的软件缺陷检测方法，包括以下步骤：The present invention also discloses a Github-based software defect detection method, comprising the following steps:

步骤1，对于待检测的目标文件，按照上述步骤2.3、2.6和步骤2.7进行处理，得到该目标文件的向量；Step 1, for the target file to be detected, process according to the above steps 2.3, 2.6 and step 2.7 to obtain the vector of the target file;

步骤2，将步骤1得到的向量输入到检测模型中，得到检测结果。Step 2, input the vector obtained in step 1 into the detection model to obtain the detection result.

本发明还公开了一种基于Github的软件缺陷检测系统，包括：The present invention also discloses a software defect detection system based on Github, comprising:

数据处理模块，用于将待检测的目标文件按照上述步骤2.3、步骤2.6 和步骤2.7进行处理，得到该目标文件的向量；The data processing module is used to process the target file to be detected according to the above steps 2.3, 2.6 and 2.7 to obtain the vector of the target file;

检测模块，用于将数据处理模块得到的向量输入到检测模型中，得到检测结果。The detection module is used to input the vector obtained by the data processing module into the detection model to obtain the detection result.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

(1)本发明提出一种可以在Github上可直接获取数据集的方法，且误报率较低，最终获取的数据集远远大于目前相关工作的数据集。解决了目前基于源码进行学习的缺陷检测面临的数据集过小而必须面临的数据不平衡，数据多样性不够，模型泛化能力差的问题。(1) The present invention proposes a method that can directly obtain data sets on Github, and the false positive rate is low, and the finally obtained data sets are far larger than the current related work data sets. It solves the problems that the current defect detection based on source code learning is too small and must face data imbalance, insufficient data diversity, and poor model generalization ability.

(2)本发明利用切片方法实现更小粒度的缺陷定位，解决目前大部分相关工作检测缺陷只能给出一个大概的位置，如一个包，或者一个文件的问题。(2) The present invention utilizes the slicing method to realize defect location with a smaller granularity, and solves the problem that most of the related work detection defects can only give an approximate location, such as a package or a file.

(3)本发明在对数据进行切片时，融合源码的数据流、控制流特征，更好的抓取源码的执行顺序信息，使得模型能够学习到更多的代码特征，从而能够达到更高的检测准确率。(3) When slicing data, the present invention integrates the data flow and control flow features of the source code, and better captures the execution order information of the source code, so that the model can learn more code features, thereby achieving higher Detection accuracy.

附图说明Description of drawings

图1是实施例记载的模型建立及检测流程图。Fig. 1 is the flow chart of model establishment and detection described in the embodiment.

图2是实施例步骤2.1得到的commit以及与之对应的Bug-Fix文件。Figure 2 is the commit obtained in step 2.1 of the embodiment and the corresponding Bug-Fix file.

图3是实施例步骤2.3对Bug File提取slice。Fig. 3 is the slice extracted from Bug File in step 2.3 of the embodiment.

图4是实施例步骤2.3对Fixed File提取slice。Fig. 4 is the slice extracted from Fixed File in step 2.3 of the embodiment.

图5是实施例步骤2.5中slice的变量名替换。Fig. 5 is the variable name replacement of slice in step 2.5 of the embodiment.

图6是实施例步骤2.6中slice分词处理过程。Fig. 6 is the slice word segmentation process in step 2.6 of the embodiment.

图7是实施例中双向LSTM神经网络的结构示意图。Fig. 7 is a schematic structural diagram of the bidirectional LSTM neural network in the embodiment.

图8是实施例中步骤1所记载的数据预处理流程图。Fig. 8 is a flow chart of data preprocessing described in step 1 in the embodiment.

具体实施方式Detailed ways

“切片”是指将代码按照一定规则进行切割得到的多行代码片段。切割方法可以按照数据流或者控制流以及其他自定义方法。在本发明中，切割按照的是数据流和控制流两部分信息，目的在于提取出在语义上具有相关性的多行代码，并凸显缺陷信息。"Slice" refers to a multi-line code fragment obtained by cutting the code according to certain rules. The cutting method can follow data flow or control flow and other custom methods. In the present invention, the segmentation is based on two parts of information, data flow and control flow, with the purpose of extracting multi-line codes with semantic relevance and highlighting defect information.

“变更记录”是指在Github上每一次仓库管理者对仓库中的代码进行更改、修正以及添加，称之为该仓库的变更记录，即commit。"Change record" refers to every time the warehouse manager on Github changes, corrects and adds codes in the warehouse, it is called the change record of the warehouse, that is, commit.

“混合型变更记录”是指一个commit包含了众多修改的理由，无法直观的看出哪一部分的代码因为什么理由而修改。"Mixed change records" means that a commit contains many reasons for modification, and it is impossible to intuitively see which part of the code is modified for what reason.

“动词/直接宾语模式”是指变更记录中的变更记录描述符合动词+直接宾语规则，比如变更记录描述“fix a bug”，该短语具有从属关系，在Stanford CoreNLP中“bug”称为“dobj”即直接宾语，其中“fix”称为governor，即动词，从属于“bug”。"Verb/direct object mode" means that the change record description in the change record conforms to the verb + direct object rule, such as the change record description "fix a bug", which has a subordinate relationship. In Stanford CoreNLP, "bug" is called "dobj " is the direct object, among which "fix" is called governor, that is, a verb, subordinate to "bug".

“令牌”是指将源码切割后生成的字符，用于表示源码中的变量、关键字、特殊字符等。如“if(str＝＝null)”切割后的令牌为“if”、“(”、“str”、“＝＝”、“null”、“)”。"Token" refers to the characters generated after cutting the source code, and is used to represent variables, keywords, special characters, etc. in the source code. For example, the cut tokens of "if(str==null)" are "if", "(", "str", "==", "null", ")".

“令牌序列”是指一系列有特定顺序的令牌集合。如“if(str＝＝null)”按照源码出现顺序组成的令牌序列为[“if”、“(”、“str”、“＝＝”、“null”、“)”]。"Token sequence" means a series of token collections in a specific order. For example, the sequence of tokens formed by "if (str==null)" according to the order of appearance of the source code is ["if", "(", "str", "==", "null", ")"].

本发明通过数据预处理后得到的每个符合要求的变更记录均包含一个 Bug-Fix文件对。Each qualified change record obtained after data preprocessing in the present invention includes a Bug-Fix file pair.

本发明不依赖于具体的编程语言，为了方便说明，本发明使用开源代码库Github与Java语言为例详细介绍本发明的具体细节。以下结合附图和实施例对本发明进一步的说明。The present invention does not depend on a specific programming language. For the convenience of explanation, the present invention uses the open source code library Github and the Java language as examples to introduce the specific details of the present invention in detail. The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

本实施例公开的一种基于Github的软件缺陷检测模型建立方法，包括以下步骤：A method for establishing a software defect detection model based on Github disclosed in this embodiment comprises the following steps:

步骤1，数据获取及其预处理Step 1, data acquisition and its preprocessing

Github是目前最大的最具有影响力的开源项目托管平台。每个在Github上的仓库，具有其仓库名，如repository_1，而每一次该仓库管理者对仓库中的代码进行更改、修正以及添加，称之为该仓库的变更记录，即commit。其中对于每一个commit，该平台都会要求提交者，添加对这次commit的说明性描述，目的在于介绍这一次的修改意图，方便软件的迭代开发和管理，如“fix a bug”，这样的说明性描述被称之为commit message。因此我们可以通过对commit message进行筛选，找到符合要求的commit。而每一个 commit，该平台也具有与这个commit对应的文件修改记录，即Bug-Fix文件对，分别为修改前文件Bugfile,修改后的文件Fixed file。Github is currently the largest and most influential open source project hosting platform. Each warehouse on Github has its warehouse name, such as repository_1, and every time the warehouse manager makes changes, corrections, and additions to the code in the warehouse, it is called the change record of the warehouse, that is, commit. Among them, for each commit, the platform will require the submitter to add an explanatory description of this commit, the purpose is to introduce the modification intention this time, and facilitate the iterative development and management of the software, such as "fix a bug". The description is called a commit message. Therefore, we can filter the commit message to find the commit that meets the requirements. For each commit, the platform also has a file modification record corresponding to this commit, that is, a Bug-Fix file pair, which are respectively the Bugfile before modification and the Fixed file after modification.

目前有很多利用Github上的历史修改信息进行研究的工作，如Zhou等人，直接按照关键字搜索或者定义规则去筛选commit，但是相关工作指出这种方法是存在很高的误报率的，且commit存在很大的低质量问题，低质量问题主要以下三种commit并不符合研究意图，但是却包含在搜索集中， (1)“how”型commit，并没有明确的指出了这次变更是为了什么变更，即“why”，而只是单纯指出这次变更是如何更改的、更改在哪些地方的，即“how”；(2)混合型commit，即一个commit包含了众多修改的理由，无法直观的看出哪一部分的代码因为什么理由而修改。(3)ID型commit，即一个commit没有任何描述，只有事件的ID。At present, there are many research works using historical modification information on Github. For example, Zhou et al. directly filter commits according to keyword search or definition rules, but related work points out that this method has a high false positive rate, and Commits have a lot of low-quality problems. The low-quality problems mainly include the following three kinds of commits that do not meet the research intent, but are included in the search set. (1) "how" type commits do not clearly indicate that this change is for What is the change, that is, "why", but simply pointing out how and where the change was changed, that is, "how"; (2) Hybrid commit, that is, a commit contains many reasons for modification, which cannot be intuitive You can see which part of the code is modified for what reason. (3) ID-type commit, that is, a commit does not have any description, only the ID of the event.

为了解决这个问题，我们提出一种通用的从Github上获取数据的方法，即基于V-DO(动词/直接宾语)过滤和TextRank关键词提取的commit筛选方法。首先对仓库进行排名，获取质量比较高的仓库中的commit；然后按照传统方法对commit进行按缺陷类型进行检索，在这一步ID型commit 因不符合关键词搜索不会出现在数据集中；在对commit进行筛选的部分我们做了以下工作(1)筛选commit messge符合verb/direct-object(V-DO)规则的commit，目的是为了去除掉“how”型commit[9]；(2)利用TextRank技术对commit再进行筛选，目的是为了去除混合型commit；最后进行去重，去除掉因为fork导致的commit重复问题。详细过程如图8 数据获取流程图中所示。In order to solve this problem, we propose a general method to obtain data from Github, that is, a commit screening method based on V-DO (verb/direct object) filtering and TextRank keyword extraction. First, rank the warehouses to obtain the commits in warehouses with relatively high quality; then search the commits according to the defect type according to the traditional method. In this step, ID-type commits will not appear in the data set because they do not match the keyword search; In the part of commit screening, we did the following work (1) Screening the commit message that conforms to the verb/direct-object (V-DO) rule, the purpose is to remove the "how" type commit[9]; (2) use TextRank The technology screens the commits again to remove mixed commits; finally, deduplication is performed to remove the commit duplication caused by fork. The detailed process is shown in Figure 8 Data Acquisition Flowchart.

步骤1的整个处理流程如图8所示，具体包括：The entire processing flow of Step 1 is shown in Figure 8, specifically including:

步骤1.1，对Github中的仓库进行排序，选取排名前30％～35％的仓库作为源数据集。Step 1.1, sort the warehouses in Github, and select the top 30% to 35% warehouses as the source data set.

本实施例中，优选的，按照仓库的影响因子fork由大到小进行排序；fork 指的是从别人的代码库中复制一份到你自己的代码库，与普通的复制不同， fork包含了原有库中的所有提交记录。而目前大多数的工作都是先按照star 对仓库进行排序，但在实验时发现，同比的100个按照star排名的仓库中 commit message的质量远远低于100个按照fork排名。In this embodiment, it is preferable to sort the fork from large to small according to the impact factor of the warehouse; fork refers to copying a copy from someone else's code base to your own code base. Unlike ordinary copying, fork includes All commit records in the original repository. At present, most of the work is to sort the warehouses according to the star first, but in the experiment, it was found that the quality of the commit message in the 100 warehouses ranked according to the star was far lower than that of the 100 warehouses ranked according to the fork.

步骤1.2，采用关键字搜索方法对源数据集中的每个仓库的变更记录进行筛选，得到不同缺陷类型的变更记录；Step 1.2, use the keyword search method to filter the change records of each warehouse in the source data set, and obtain the change records of different defect types;

本实施例利用Github提供的API，按照“fix”加缺陷类型关键字的传统方法进行初筛，本实施例选取了五种缺陷类型，分别是空指针异常、文件操作符异常、不合法的参数异常、SQL注入、XSS注入，其对应的关键字如下：In this embodiment, the API provided by Github is used to conduct preliminary screening according to the traditional method of adding defect type keywords to "fix". In this embodiment, five defect types are selected, which are null pointer exception, file operator exception, and illegal parameters. Abnormal, SQL injection, XSS injection, the corresponding keywords are as follows:

空指针异常(NullPointerException):nullpointerexception、nullpointer、npe；NullPointerException: nullpointerexception, nullpointer, npe;

文件寻址异常(FileNotFoundException):filenotfoundexception、filenotfound；File addressing exception (FileNotFoundException): filenotfoundexception, filenotfound;

不合法的参数异常(IllegalArgumentException):illegalargumentException、illegalargument；Illegal ArgumentException (IllegalArgumentException): illegalargumentException, illegalargument;

SQL注入(SQL injection):sql injection；SQL injection (SQL injection): sql injection;

XSS注入(XSS injection):xss injection。XSS injection (XSS injection): xss injection.

步骤1.3，对步骤1.2中得到的具有五种缺陷类型的变更记录进行再次筛选，过滤掉不符合动词/直接宾语模式的变更记录；Step 1.3, re-screening the change records obtained in step 1.2 with five defect types, filtering out the change records that do not meet the verb/direct object pattern;

在本步骤中引入动词/直接宾语模式，通过过滤掉不符合动词/直接宾语模式的message，可以获得一组commit message格式很类似的数据。为了找到模式，本实施例使用了自然语言处理(NLP)工具Stanford CoreNLP来注释具有语法依赖性的句子。语法依存关系是句子各部分之间的一组依存关系。比如说短语“fix a bug”，该短语具有从属关系，在Stanford CoreNLP中称为“dobj”，其中governor为“fix”，从属为“bug”。对于V-DO过滤层，寻找“dobj”依赖项，它们表示动词/直接宾语模式。对于每个句子，检查该句子是否从“dobj”依赖开始。如果句子以“dobj”开头，则将该句子标记为“dobj”句子。即认为其是符合要求的commit。In this step, the verb/direct object mode is introduced, and by filtering out messages that do not conform to the verb/direct object mode, a set of data with a very similar commit message format can be obtained. To find patterns, this example uses the natural language processing (NLP) tool Stanford CoreNLP to annotate sentences with grammatical dependencies. A grammatical dependency is a set of dependencies between parts of a sentence. For example, the phrase "fix a bug", which has a subordination relationship, is called "dobj" in Stanford CoreNLP, where the governor is "fix" and the subordination is "bug". For the V-DO filter layer, look for "dobj" dependencies, which represent verb/direct object patterns. For each sentence, check whether the sentence starts with a "dobj" dependency. If a sentence starts with "dobj", mark that sentence as a "dobj" sentence. That is, it is considered to be a commit that meets the requirements.

步骤1.4，采用TextRank技术对步骤1.3得到的变更记录进行再筛选，剔除掉混合型变更记录。本实施例利用python实现了TextRank的算法，具体为：将commit message输入到TextRank中，抽取出来的关键词top5中如果包含步骤1.2中涉及到的关键词，则认为这一条commit是符合要求的。In step 1.4, use TextRank technology to re-screen the change records obtained in step 1.3, and eliminate mixed change records. This embodiment uses python to implement the TextRank algorithm, specifically: input the commit message into TextRank, and if the extracted keywords top5 contain the keywords involved in step 1.2, then this commit is deemed to meet the requirements.

步骤1.5，由于Github存在fork的功能，所以会存在仓库的commit重复出现在两个仓库中的情况，为了防止commit的重复影响后期模型训练的准确度，本步骤对步骤1.4得到的变更记录进行去重，删掉重复的变更记录，得到符合要求的变更记录及其对应的Bug-Fix文件对。In step 1.5, because Github has a fork function, the commits of the warehouses may appear repeatedly in the two warehouses. In order to prevent the repetition of commits from affecting the accuracy of later model training, this step deletes the change records obtained in step 1.4. Repeat, delete duplicate change records, and get the change records that meet the requirements and their corresponding Bug-Fix file pairs.

通过上述步骤1获得了符合要求的commit和commit涉及的Bug-Fix文件对。但是由于一个文件不仅包含缺陷的部分，还包含其他与缺陷无关的部分，因此通过下面步骤2提取出只和缺陷相关的代码，利用这部分代码加强缺陷特征，用于更好的训练模型。这种提取出只和缺陷相关的代码，称之为切片，用切片方法，可以实现更小粒度的缺陷定位，解决目前大部分相关工作检测缺陷只能给出一个大概的位置，如一个包，或者一个文件的问题。Through the above step 1, the required commit and the Bug-Fix file pair involved in the commit are obtained. However, since a file contains not only defective parts, but also other parts that are not related to defects, the code that is only related to defects is extracted through the following step 2, and this part of code is used to strengthen defect features for better training models. This extraction of codes that are only related to defects is called slicing. With the slicing method, smaller-grained defect positioning can be achieved. Most of the current related work detects defects that can only give an approximate location, such as a package. Or a file problem.

步骤2，对步骤1得到的符合要求的变更记录进行处理，生成切片 (slices)的向量及标签。Step 2: Process the change records that meet the requirements obtained in Step 1, and generate vectors and labels of slices.

步骤2.1，对步骤1得到的符合要求的变更记录的Bug-Fix文件对进行解析和比对，获得每个变更记录对应的增加和删除的代码行信息。如图2 所示，一个修复空指针异常的commit，对应着一个包含缺陷的文件和一个修复后的文件。该commit以删除第5行，增加第5,6,7,8行的方式对文件的空指针异常进行修复。Step 2.1: Analyze and compare the Bug-Fix file pairs of the change records that meet the requirements obtained in step 1, and obtain the added and deleted code line information corresponding to each change record. As shown in Figure 2, a commit that fixes a null pointer exception corresponds to a file containing a defect and a file after repair. This commit fixes the null pointer exception of the file by deleting line 5 and adding lines 5, 6, 7, and 8.

步骤2.2，根据步骤2.1获得的增加和删除的代码行信息及其Bug-Fix 文件对的数据流，确定每一个变更记录导致缺陷的缺陷代码行。Step 2.2, according to the added and deleted code line information obtained in step 2.1 and the data flow of the Bug-Fix file pair, determine the defective code line that causes the defect in each change record.

在本实施例中，根据步骤2.1确定好的更改信息(即删除5，增加5,6,7,8) 以及缺陷的类型信息(空指针异常)，需要确定引起缺陷的代码行。对于空指针，根据数据流信息确定引起风险的是fixed file中第6行的变量msg2，再根据数据流信息，得到这个变量的数据来源即，ex.getMessage()，因此把这两个元素都认定为Buggy Item，然后再在Bug file中确定是否有代码行，包含Buggy item的方法调用，如果包含，那么就认为是有缺陷的代码行。根据上述方法，可以确定Bug file中的第5行是包含缺陷的代码行。In this embodiment, according to the change information determined in step 2.1 (ie delete 5, add 5, 6, 7, 8) and the type information of the defect (null pointer exception), it is necessary to determine the line of code that caused the defect. For the null pointer, according to the data flow information, it is determined that the variable msg2 in line 6 of the fixed file causes the risk, and then according to the data flow information, the data source of this variable is ex.getMessage(), so these two elements are both Identify it as a Buggy Item, and then determine whether there is a line of code in the Bug file, including the method call of the Buggy item. If it is included, it is considered a defective line of code. According to the above method, it can be determined that line 5 in the bug file is a code line containing a defect.

步骤2.3，根据步骤1得到的符合要求的变更记录的Bug-Fix文件对的数据流和控制流信息，生成Bug-Fix文件对的切片。每个Bug-Fix文件对对应有多个切片，即Bug file和Fixed file的各自的切片。Step 2.3, according to the data flow and control flow information of the Bug-Fix file pair of the change records that meet the requirements obtained in Step 1, generate slices of the Bug-Fix file pair. Each Bug-Fix file pair corresponds to multiple slices, that is, the respective slices of the Bug file and the Fixed file.

如图3，在本实施例中，对于Bug file提取slice，因为第4行涉及到一个if的条件语句，所以代码执行到这里，会分成两条分支，即slice会分为两个，一个是slice1，即执行了第4行，执行到第5行，然后再执行直到方法体尾部。一个是slice2，即没有通过第4行的条件，没有执行到第5行语句，然后一直执行到方法体尾部。对于fixed file同理，如图4，会生成三个slice，slice4，slice5，slice6。As shown in Figure 3, in this embodiment, for the bug file extraction slice, because the fourth line involves an if conditional statement, so the code execution here will be divided into two branches, that is, the slice will be divided into two, one is slice1, that is, execute the 4th line, execute to the 5th line, and then execute until the end of the method body. One is slice2, that is, the condition in line 4 is not passed, the statement in line 5 is not executed, and then it is executed until the end of the method body. The same is true for fixed files, as shown in Figure 4, three slices, slice4, slice5, and slice6 will be generated.

步骤2.4，根据步骤2.2确定的缺陷代码行，对步骤2.3得到的Bug-Fix 文件对的切片添加标签。Step 2.4, according to the defective code line determined in step 2.2, add labels to the slices of the Bug-Fix file pair obtained in step 2.3.

在本实施例中，将包含缺陷代码行的切片标记为0，不包含缺陷代码行的切片标记为1。此处需要注意的是，为了增强缺陷信息，把从方法体开头一直执行到缺陷代码行处的所有代码行，标记为新的slice，并记为负样本，标记为0。主要原因是因为步骤2.3提取slice后存在数据不平衡问题，如图 4中所展示的正样本是slice1，slice3，slice4，slice5，slice6，负样本是slice2。这样会导致模型并不能很好的去学习到负样本的特征，而会侧重去学习正样本的特征。因此为了加强学习到缺陷信息的效果，对于每一个缺陷代码行，把从方法体开头一直执行到缺陷代码行处的代码行作为一个集合，再生成一个slice。如Slice3，即从第1行一直执行到缺陷代码行第5行的所有代码行的集合，并把它标记为负样本，从而扩充负样本的量。并且对正样本和负样本中重复的样本去掉。In this embodiment, the slice containing the defective code line is marked as 0, and the slice not containing the defective code line is marked as 1. It should be noted here that in order to enhance the defect information, all code lines executed from the beginning of the method body to the defective code line are marked as new slices and recorded as negative samples, marked as 0. The main reason is that there is a data imbalance problem after the slice is extracted in step 2.3. As shown in Figure 4, the positive samples are slice1, slice3, slice4, slice5, slice6, and the negative sample is slice2. This will lead to the model not being able to learn the characteristics of negative samples very well, but will focus on learning the characteristics of positive samples. Therefore, in order to strengthen the effect of learning defect information, for each defective code line, the code lines executed from the beginning of the method body to the defective code line are regarded as a set, and then a slice is generated. For example, Slice3 is a collection of all code lines executed from line 1 to line 5 of the defective code line, and marks it as a negative sample, thereby expanding the amount of negative samples. And remove the repeated samples in the positive samples and negative samples.

步骤2.5，为了提取slice的特征，对步骤2.3得到的Bug-Fix文件对的切片逐个进行变量名替换，替换为一个统一的令牌。本实施例替换为“var” +该变量在文件中出现的顺序，如var1，var2，如图5。这样可避免因为变量名不同对提取特征造成影响。In step 2.5, in order to extract the features of the slice, the slices of the Bug-Fix file pair obtained in step 2.3 are replaced with variable names one by one, and replaced with a unified token. In this embodiment, it is replaced by "var" + the order in which the variable appears in the file, such as var1, var2, as shown in Figure 5. In this way, the influence on the extracted features due to different variable names can be avoided.

步骤2.6，对步骤2.5替换过变量名的切片逐个进行分词处理，每个切片得到一个令牌序列。具体为：把切片分成一个一个的word，并用空格间隔，如图6所示。In step 2.6, word segmentation is performed on the slices whose variable names have been replaced in step 2.5, and a token sequence is obtained for each slice. Specifically: Divide the slices into words one by one, and separate them with spaces, as shown in Figure 6.

步骤2.7，将步骤2.6得到的每一个令牌序列转化为向量。本实施例使用word2vec工具进行向量转化，该工具基于分布式表示的思想，它将一个 token映射到一个整数，然后将其转换为固定长度的向量。、对每一个token 生成一个与之对应的50维的向量，将这些向量拼接起来形成每个slice的向量。In step 2.7, convert each token sequence obtained in step 2.6 into a vector. This embodiment uses the word2vec tool for vector conversion, which is based on the idea of distributed representation, which maps a token to an integer, and then converts it into a fixed-length vector. , Generate a corresponding 50-dimensional vector for each token, and splicing these vectors together to form a vector for each slice.

步骤3，模型训练：Step 3, model training:

本实施例采用双向LSTM神经网络，能够精确地捕捉缺陷代码的数据流特征，提高缺陷的识别精度。This embodiment adopts a bidirectional LSTM neural network, which can accurately capture the data flow characteristics of defect codes and improve the recognition accuracy of defects.

双向LSTM神经网络结构如图7所示，包括：The bidirectional LSTM neural network structure is shown in Figure 7, including:

双向LSTM层，包括两个LSTM神经网络，一个网络的输入是向量从前往后的顺序，用上文信息预测下文信息，去捕捉上下文关系，另一个网络的输入是向量从后往前的顺序，用下文信息去预测上文信息，从另外一个角度捕捉上下文的关系；最后，将两个网络的隐层单元输出进行拼接，作为双向LSTM层的输出；Bidirectional LSTM layer, including two LSTM neural networks, the input of one network is the sequence of vectors from front to back, using the above information to predict the following information to capture the context, and the input of the other network is the sequence of vectors from back to front, Use the following information to predict the above information, and capture the context relationship from another perspective; finally, splicing the output of the hidden layer units of the two networks as the output of the bidirectional LSTM layer;

全连接层，用于将两个LSTM神经网络学习到的特征映射到样本的标记空间中；The fully connected layer is used to map the features learned by the two LSTM neural networks into the label space of the sample;

激活层，用于将隐藏层输出的多维向量映射到样本的标签空间上。可以输出得到模型对一个样本的预测标签值0或1。The activation layer is used to map the multidimensional vector output by the hidden layer to the label space of the sample. It can be output to get the model's predicted label value 0 or 1 for a sample.

本发明的另一个实施例中公开了一种基于Github的软件缺陷检测模型建立系统，包括以下模块：Another embodiment of the present invention discloses a Github-based software defect detection model building system, including the following modules:

数据预处理模块，包括：Data preprocessing module, including:

源数据集获取模块，用于对Github中的仓库进行排序，选取排名前 30％～35％的仓库作为源数据集。本实施例中仓库的排序原理同上述模型建立方法实施例。The source data set acquisition module is used to sort the warehouses in Github, and select the top 30% to 35% warehouses as the source data set. The principle of sorting warehouses in this embodiment is the same as the above embodiment of the model building method.

一次筛选模块，用于采用关键字搜索方法对源数据集中的每个仓库的变更记录进行筛选，得到具有不同缺陷类型的变更记录。本实施例也选取了五种缺陷类型，分别是空指针异常、文件操作符异常、不合法的参数异常、SQL 注入、XSS注入。The primary screening module is used to filter the change records of each warehouse in the source data set by using a keyword search method to obtain change records with different defect types. This embodiment also selects five types of defects, which are null pointer exception, file operator exception, illegal parameter exception, SQL injection, and XSS injection.

去重模块，用于对三次筛选模块得到的变更记录进行去重，得到符合要求的变更记录及其对应的Bug-Fix文件对；The de-duplication module is used to de-duplicate the change records obtained by the three-time screening module, and obtain the required change records and their corresponding Bug-Fix file pairs;

数据提取模块，包括：Data extraction modules, including:

切片生成模块，用于根据数据预处理模块得到的符合要求的变更记录的 Bug-Fix文件对的数据流和控制流信息，生成Bug-Fix文件对的切片。每个 Bug-Fix文件对对应有多个切片，即Bug file和Fixed file的各自的切片。The slice generation module is configured to generate slices of the Bug-Fix file pair according to the data flow and control flow information of the Bug-Fix file pair of the change records that meet the requirements obtained by the data preprocessing module. Each Bug-Fix file pair corresponds to multiple slices, that is, the respective slices of Bug file and Fixed file.

添加标签模块，用于根据缺陷代码行获取模块确定的缺陷代码行，对切片生成模块得到的Bug-Fix文件对的切片添加标签。在本实施例中，将包含缺陷代码行的切片标记为0，不包含缺陷代码行的切片标记为1。此处需要注意的是，为了增强缺陷信息，把从方法体开头一直执行到缺陷代码行处的所有代码行，标记为新的slice，并记为负样本，标记为0。Adding a labeling module, configured to add labels to the slices of the Bug-Fix file pair obtained by the slice generating module according to the defective code lines determined by the defective code line obtaining module. In this embodiment, the slice containing the defective code line is marked as 0, and the slice not containing the defective code line is marked as 1. It should be noted here that, in order to enhance the defect information, all lines of code executed from the beginning of the method body to the line of defective code are marked as a new slice and recorded as a negative sample, marked as 0.

变量名替换模块，用于对切片生成模块得到的Bug-Fix文件对的切片逐个进行变量名替换，替换为一个统一的令牌；如var1，var2，如图5。The variable name replacement module is used to replace the variable names one by one for the slices of the Bug-Fix file pairs obtained by the slice generation module, and replace them with a unified token; such as var1, var2, as shown in Figure 5.

向量转化模块，用于将分词处理模块得到的每一个令牌序列转化为向量。本实施例使用word2vec工具进行向量转化。The vector conversion module is used to convert each token sequence obtained by the word segmentation processing module into a vector. In this embodiment, the word2vec tool is used for vector conversion.

模型训练模块，用于数据提取模块得到的切片的向量及标签输入到双向 LSTM模型中进行训练和学习，得到训练好的检测模型。本实施例采用双向 LSTM神经网络，双向LSTM神经网络结构如图7所示。The model training module is used to input the vector and label of the slice obtained by the data extraction module into the bidirectional LSTM model for training and learning, and obtain a trained detection model. In this embodiment, a bidirectional LSTM neural network is used, and the structure of the bidirectional LSTM neural network is shown in FIG. 7 .

通过上述检测模型建立方法和系统，得到了用于检测待检测目标文件的检测模型，基于该检测模型，本发明的另一个实施例中还公开了一种基于Github的软件缺陷检测方法，该检测方法包括以下步骤：Through the above detection model establishment method and system, a detection model for detecting the target file to be detected is obtained. Based on the detection model, another embodiment of the present invention also discloses a software defect detection method based on Github. The detection The method includes the following steps:

步骤1，对于待检测的目标文件，按照上述模型建立方法实施例中的步骤2.3、2.6和步骤2.7进行处理，得到该目标文件的向量；Step 1, for the target file to be detected, process according to the steps 2.3, 2.6 and 2.7 in the embodiment of the above-mentioned model building method to obtain the vector of the target file;

步骤2，将步骤1的向量输入到上述实施例得到的检测模型中，得到检测结果。Step 2: Input the vector in step 1 into the detection model obtained in the above embodiment to obtain the detection result.

每一个切片，检测模型都能够对其生成一个标签，0代表有缺陷，1代表没有缺陷，即达到预测目的，而根据预测结果和被预测的切片，也能够定位到目标文件哪些代码行具有缺陷，实现缺陷的细粒度定位。For each slice, the detection model can generate a label for it, 0 means defective, 1 means no defect, that is, the purpose of prediction is achieved, and according to the prediction result and the predicted slice, it can also locate which code lines of the target file have defects , to achieve fine-grained location of defects.

本发明的另一个实施例中还公开了一种基于Github的软件缺陷检测系统，该系统包括：A kind of software defect detection system based on Github is also disclosed in another embodiment of the present invention, and this system comprises:

数据处理模块，用于将待检测的目标文件按照步骤2.3、2.6和步骤2.7 进行处理，得到该目标文件的向量；The data processing module is used to process the target file to be detected according to steps 2.3, 2.6 and step 2.7 to obtain the vector of the target file;

检测模块，用于将数据处理模块得到的向量输入到上述实施例记载的检测模型中，得到检测结果。The detection module is configured to input the vector obtained by the data processing module into the detection model described in the above embodiment to obtain a detection result.

仿真实验Simulation

步骤1：发明人从Github平台上爬取了在2009年到2018年所有年份中创建的，尽可能能获取到的排名靠前的仓库列表，共计378631个仓库，占总数的6.4％。按照上述步骤从Github上爬取符合要求的commit，然后人工标记空指针异常10000条，文件寻址异常10000条，不合法参数异常10000 条，SQL注入1000条，XSS注入1000条。然后利用V-DO过滤技术和TextRank的关键词提取技术对commit进行筛选，准确率均可达98％以上，因此将该方法运用于整个数据集。最后，将数据集随机打乱，以人工的方法进行抽样检查，在五种类型的缺陷中均为发现错误标记。最后经过筛选后的 commit共计153140个。结果如表1。Step 1: The inventor crawled the list of top-ranked warehouses created in all years from 2009 to 2018 from the Github platform, and obtained as much as possible a list of top-ranked warehouses, a total of 378,631 warehouses, accounting for 6.4% of the total. Follow the above steps to crawl the commits that meet the requirements from Github, and then manually mark 10,000 null pointer exceptions, 10,000 file addressing exceptions, 10,000 illegal parameter exceptions, 1,000 SQL injections, and 1,000 XSS injections. Then use the V-DO filtering technology and the keyword extraction technology of TextRank to screen the commits, and the accuracy rate can reach more than 98%, so this method is applied to the entire data set. Finally, the data set was randomly shuffled, and the sampling inspection was performed manually, and all five types of defects were found to be wrongly marked. A total of 153,140 commits were finally screened. The results are shown in Table 1.

表1每个缺陷类型爬取的commit以及筛选后的commit数量Table 1 The number of commits crawled by each defect type and the number of commits after screening

步骤2：通过对每个类型的缺陷进行提取slice，共得到slice 155070 个，空指针异常54000个，文件寻址异常52000个，不合法的参数异常37000 个，SQL注入5070个，XSS注入5070个。详细如表1所示。本发明获取的数据要远远大于目前相关工作的数据集。Step 2: By extracting slices for each type of defect, a total of 155,070 slices, 54,000 null pointer exceptions, 52,000 file addressing exceptions, 37,000 illegal parameter exceptions, 5,070 SQL injections, and 5,070 XSS injections were obtained. . Details are shown in Table 1. The data obtained by the present invention is far larger than the data set of related work at present.

步骤3：双向LSTM神经网络训练的硬件平台为：NVIDIA GeForce GTX 1080 GPU、Intel Xeon E5-1620 CPU。当调整双向LSTM网络的参数时，将模型的默认不变，将参数设置为深度学习社区广泛使用的值。在输入时，每个词的向量维度word_dim是200维，限制一个样本最长max_len为200 个词，不够的补零。一次输入的样本batch_size为8，学习率learning_rate为0.01。BLSTM模型中BLSTM的节点为300。Step 3: The hardware platforms for bidirectional LSTM neural network training are: NVIDIA GeForce GTX 1080 GPU, Intel Xeon E5-1620 CPU. When tuning the parameters of the bidirectional LSTM network, leave the defaults of the model unchanged and set the parameters to values widely used in the deep learning community. At the time of input, the vector dimension word_dim of each word is 200 dimensions, and the longest max_len of a sample is limited to 200 words, and zeros are filled if it is not enough. The sample batch_size of one input is 8, and the learning rate learning_rate is 0.01. The node of BLSTM in BLSTM model is 300.

实验结果：Experimental results:

将数据集分为8:1:1，分别用于训练、验证、测试。经过模型训练，我们的结果如下表所示。因为我们把含有缺陷信息的代码块标记为0，不含的标记为1，因此将含有缺陷信息的样本看做负样本、不含缺陷信息的看做正样本。我们对每个类型的缺陷进行检测，结果如表2。Divide the dataset into 8:1:1 for training, verification, and testing respectively. After model training, our results are shown in the table below. Because we mark the code blocks containing defect information as 0, and the code blocks without defect information as 1, so the samples containing defect information are regarded as negative samples, and the samples without defect information are regarded as positive samples. We detect each type of defect and the results are shown in Table 2.

从表2中可得知，本发明的可行性，在五种缺陷类型上均达到很高的准确率。且由表1可知，本发明的数据量很大，目前现有的相关工作的数据集均在PROMISE、DEFECT4J等数据集上操作，而这些数据集最大包含仓库的数量在8个以下。且本发明利用切片方法进行数据切割，再用slice进行训练和检测，最后实现小粒度的缺陷定位。It can be seen from Table 2 that the feasibility of the present invention has achieved very high accuracy rates on five types of defects. And it can be seen from Table 1 that the amount of data in the present invention is very large, and the data sets of existing related work are all operated on data sets such as PROMISE and DEFECT4J, and the maximum number of warehouses contained in these data sets is less than 8. Moreover, the present invention utilizes a slicing method for data cutting, and then uses slices for training and detection, and finally realizes small-grained defect location.

表2检测结果的F1-measure值Table 2 F1-measure value of test results

需要说明的是，本发明并不局限于上述实施例，在本发明公开的技术方案的基础上，本领域的技术人员根据所公开的技术内容，不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形，这些替换和变形均在本发明的保护范围内。It should be noted that the present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some technical changes without creative work according to the disclosed technical content. Some replacements and modifications are made to the features, and these replacements and modifications are all within the protection scope of the present invention.

Claims

1. A software defect detection model building method based on Github is characterized by comprising the following steps:

step 1, data preprocessing:

sorting the warehouses in the Github, selecting the warehouse with the highest rank as a source data set, and screening and de-weighting change records of each warehouse in the source data set to obtain the change records meeting the requirements and the corresponding Bug-Fix file pairs;

the Bug-Fix file pair refers to a file Bug file before modification and a modified file Fixed file;

step 2, processing the change records meeting the requirements obtained in the step 1 and the corresponding Bug-Fix file pairs thereof to generate sliced vectors and labels:

step 2.1, analyzing and comparing the Bug-Fix file pairs of the change records meeting the requirements obtained in the step 1 to obtain added and deleted code line information corresponding to each change record;

step 2.2, determining a defect code line causing defects of each change record according to the added and deleted code line information obtained in the step 2.1 and the data stream of the Bug-Fix file pair;

step 2.3, generating slices of the Bug-Fix file pairs according to the data flow and control flow information of the Bug-Fix file pairs which meet the requirement and are obtained in the step 1;

step 2.4, according to the defect code line determined in the step 2.2, adding a label to the slice of the Bug-Fix file pair obtained in the step 2.3;

step 2.5, replacing the variable names of the slices of the Bug-Fix file pair obtained in the step 2.3 one by one, and replacing the slices with a unified token;

step 2.6, the slices with the variable names replaced in the step 2.5 are subjected to word segmentation one by one, and each slice obtains a token sequence;

step 2.7, converting each token sequence obtained in the step 2.6 into a vector;

step 3, model training:

and (3) inputting the vector and the label of the slice obtained in the step (2) into a bidirectional LSTM model for training and learning to obtain a trained detection model.

2. The method for building a Github-based software defect detection model according to claim 1, wherein the step 1 specifically comprises:

step 1.1, sorting the warehouses in Github from large to small according to the influence factors fork of the warehouses, and selecting the warehouses with the top 30% -35% of ranks as a source data set;

step 1.2, screening change records of each warehouse in the source data set by adopting a keyword search method to obtain change records with different defect types;

step 1.3, filtering change records which do not accord with verb/direct object modes in the change records with different defect types obtained in the step 1.2;

step 1.4, re-screening the change records obtained in the step 1.3 by adopting a TextRank technology, and removing mixed change records;

the TextRank technology is a TextRank algorithm, and the specific method for screening the change record obtained in the step 1.3 by adopting the TextRank technology is as follows: inputting the change record obtained in the step 1.3 into a TextRank algorithm, if the first five extracted keywords comprise the keywords related in the step 1.2, considering that the change record is in accordance with the requirement, entering the step 1.5, and if not, deleting the change record;

and step 1.5, removing the duplicate of the change record obtained in the step 1.4 to obtain the change record meeting the requirement and a corresponding Bug-Fix file pair thereof.

3. The method according to claim 1, wherein in the step 2.4, during the labeling process, the slice including the defect code line is marked as 0, and the slice not including the defect code line is marked as 1.

4. The method of claim 1, wherein in step 2.7, each token sequence is converted into a vector using a word2vec tool.

5. A software defect detection model building system based on Github is characterized by comprising the following modules:

the data preprocessing module is used for sequencing the warehouses in the Github, selecting the warehouse with the top rank as a source data set, and screening and removing the weight of the change record of each warehouse in the source data set to obtain the change record meeting the requirement and the corresponding Bug-Fix file pair; the Bug-Fix file pair refers to a file Bug file before modification and a modified file Fixed file;

a data extraction module comprising:

the added and deleted code line information acquisition module is used for analyzing and comparing the Bug-Fix file pairs of the change records meeting the requirements, which are obtained by the data preprocessing module, so as to obtain the added and deleted code line information corresponding to each change record;

the defect code line acquisition module is used for determining the defect code line causing the defect of each change record according to the added and deleted code line information acquired by the added and deleted code line information acquisition module and the data stream of the Bug-Fix file pair;

the slice generating module is used for generating slices of the Bug-Fix file pairs according to the data flow and the control flow information of the Bug-Fix file pairs which meet the required change records and are obtained by the data preprocessing module;

the tag adding module is used for adding tags to the slices of the Bug-Fix file pair obtained by the slice generating module according to the defect code line determined by the defect code line obtaining module;

the variable name replacing module is used for replacing the variable names of the slices of the Bug-Fix file pair obtained by the slice generating module one by one into a uniform token;

the word segmentation processing module is used for carrying out word segmentation processing on the slices with the variable names replaced by the variable name replacing module one by one, and each slice obtains a token sequence;

the vector conversion module is used for converting each token sequence obtained by the word segmentation processing module into a vector;

and the model training module is used for inputting the vector and the label of the slice obtained by the data extraction module into the bidirectional LSTM model for training and learning to obtain a trained detection model.

6. The Github-based software fault detection modeling system of claim 5, wherein said pre-processing of data includes:

the source data set acquisition module is used for sorting the warehouses in the Github from large to small according to the influence factors fork of the warehouses, and selecting the warehouses with the ranks 30% -35% as the source data set;

the primary screening module is used for screening the change records of each warehouse in the source data set by adopting a keyword searching method to obtain the change records with different defect types;

the secondary screening module is used for filtering change records which do not accord with verb/direct object modes in the change records with different defect types obtained by the primary screening module;

the third screening module is used for re-screening the change records obtained by the secondary screening module by adopting a TextRank technology and rejecting mixed change records;

the TextRank technology refers to a TextRank algorithm, and the concrete method for screening the change records obtained in the step 1.3 by adopting the TextRank technology is as follows: inputting the change records obtained by the secondary screening module into a TextRank algorithm, if the first five extracted keywords contain the keywords related to the primary screening module, keeping the change records, entering a duplicate removal module, and otherwise, deleting the change records;

and the duplication removing module is used for removing duplication from the change records obtained by the third screening module to obtain the change records meeting the requirements and the corresponding Bug-Fix file pairs.

7. The Github-based software defect detection modeling system of claim 5, wherein in the tag-adding module, slices containing lines of defect code are marked as 0 and slices not containing lines of defect code are marked as 1 during tag addition.

8. The Github-based software bug detection model building system of claim 5, wherein the vector conversion module converts each token sequence into a vector using a word2vec tool.

9. A software defect detection method based on Github is characterized by comprising the following steps:

step 1, processing a target file to be detected according to steps 2.3 and 2.6 and step 2.7 of any one of claims 1 to 4 to obtain a vector of the target file;

and 2, inputting the vector obtained in the step 1 into the detection model in the claim 1 to obtain a detection result.

10. A system for detecting software defects based on gitubs, comprising:

a data processing module, configured to process the target file to be detected according to steps 2.3 and 2.6 and step 2.7 of any one of claims 1 to 4, to obtain a vector of the target file;

and the detection module is used for inputting the vector obtained by the data processing module into the detection model of claim 1 to obtain a detection result.