[go: up one dir, main page]

CN108572917A - Construction Method of Code Prediction Model Based on Method Constraint Relationship - Google Patents

Construction Method of Code Prediction Model Based on Method Constraint Relationship Download PDF

Info

Publication number
CN108572917A
CN108572917A CN201810327625.7A CN201810327625A CN108572917A CN 108572917 A CN108572917 A CN 108572917A CN 201810327625 A CN201810327625 A CN 201810327625A CN 108572917 A CN108572917 A CN 108572917A
Authority
CN
China
Prior art keywords
code
method call
prediction model
class
call statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810327625.7A
Other languages
Chinese (zh)
Other versions
CN108572917B (en
Inventor
刘琰
方文渊
魏强
刘楝
张文悦
左青松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA Information Engineering University
Original Assignee
PLA Information Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA Information Engineering University filed Critical PLA Information Engineering University
Priority to CN201810327625.7A priority Critical patent/CN108572917B/en
Publication of CN108572917A publication Critical patent/CN108572917A/en
Application granted granted Critical
Publication of CN108572917B publication Critical patent/CN108572917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3604Analysis of software for verifying properties of programs
    • G06F11/3608Analysis of software for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

本发明属于软件技术领域,尤其涉及基于方法约束关系的代码预测模型构建方法。基于方法约束关系的代码预测模型构建方法,包括:将面向对象代码抽象为对象涉及的方法调用语句序列集合,所述方法调用语句序列集合包括多个方法调用语句序列;所述方法调用语句序列包括一个对象涉及的所有方法调用语句;根据方法之间的约束关系,将所述方法调用语句序列的方法调用语句表示为四元特征组,根据所述四元特征组得出所述面向对象代码的代码特征词序列集合;利用N‑gram模型将所述代码特征词序列集合滑动切分为多个3‑gram序列,提取调用所述3‑gram序列,构建代码预测模型。本发明构建的方法预测模型具有较高的代码预测准确率。

The invention belongs to the technical field of software, in particular to a method for constructing a code prediction model based on a method constraint relationship. A method for constructing a code prediction model based on a method constraint relationship, comprising: abstracting the object-oriented code into a set of method call statement sequences involved in the object, the set of method call statement sequences comprising a plurality of method call statement sequences; the method call statement sequence comprising All method call statements involved in an object; according to the constraint relationship between methods, the method call statement of the method call statement sequence is represented as a quaternary feature group, and the object-oriented code is obtained according to the quaternary feature group A code feature word sequence set; using the N-gram model to slide the code feature word sequence set into a plurality of 3-gram sequences, extract and call the 3-gram sequences, and build a code prediction model. The method prediction model constructed by the invention has higher code prediction accuracy.

Description

基于方法约束关系的代码预测模型的构建方法Construction Method of Code Prediction Model Based on Method Constraint Relationship

技术领域technical field

本发明属于软件技术领域,尤其涉及基于方法约束关系的代码预测模型的构建方法。The invention belongs to the technical field of software, in particular to a method for constructing a code prediction model based on a method constraint relationship.

背景技术Background technique

在软件开发过程中,开发人员需要调用大量的API,以实现软件的各种功能。但遗憾的是,开发人员通常无法掌握所有API的使用方法。当需要使用一个陌生的类时,即使是有经验的开发人员也需要花费大量的时间来学习该类以及其包含的数十种API的使用方法。为提高软件开发的效率,构建代码预测模型并实现代码推荐已成为目前代码分析领域的研究热点。During the software development process, developers need to call a large number of APIs to realize various functions of the software. Unfortunately, developers often don't know how to use all APIs. When an unfamiliar class needs to be used, even experienced developers need to spend a lot of time learning how to use the class and the dozens of APIs it contains. In order to improve the efficiency of software development, building a code prediction model and implementing code recommendation has become a research hotspot in the field of code analysis.

之前的研究表明,编程语言具有良好的重复性。Gabel等人(Gabel M,Su Z.Astudy of the uniqueness of source code[C].Eighteenth ACM SigsoftInternational Symposium on Foundations of Software Engineering.ACM,2010:147-156)观察到代码重复,并以6-40个代码元素的粒度报告语法冗余。如循环语句for(inti=0;i<n;i++)在许多源代码中频繁出现。Hindle等(Hindle A,Barr E T,Su Z,et al.On thenaturalness of software[C].International Conference on SoftwareEngineering.ACM,2012:837-847)的研究表明,这种代码规律可以通过自然语言处理技术(例如,n-gram语言模型)进行捕获,并依此建立模型,实现代码预测与推荐。Raychev等(Raychev V,Vechev M,Yahav E.Code completion with statistical language models[C].ACMSigplan Symposium on Programming Language Design&Implementation.ACM,2014:419-428)认为代码预测的核心是方法的预测,并提取源代码中的方法调用构建统计语言模型,取得了良好的预测结果。相比于自然语言,编程语言在语法结构和数据类型上有着更为严格的要求,不同的对象、方法甚至参数之间都可能存在着一定的约束关系。这些约束关系体现了语言本身的规则要求和开发人员潜在的思想逻辑。充分利用这些约束关系可以使统计语言模型突破单纯依靠自然语言中句子出现概率的局限性,使代码预测更符合开发人员的编程思维。遗憾的是,在现阶段的研究中,基于自然语言技术构建的统计模型往往缺少对这些约束关系的研究与利用,使得预测的准确率仍有提升的空间。Previous research has shown that programming languages have good reproducibility. Gabel et al. (Gabel M, Su Z. Astudy of the uniqueness of source code [C]. Eighteenth ACM Sigsoft International Symposium on Foundations of Software Engineering. ACM, 2010: 147-156) observed code duplication, and with 6-40 Syntax redundancy is reported at the granularity of code elements. For example, the loop statement for (inti=0; i<n; i++) frequently appears in many source codes. Research by Hindle et al. (Hindle A, Barr E T, Su Z, et al. On the naturalness of software[C]. International Conference on Software Engineering. ACM, 2012: 837-847) shows that this code rule can be processed by natural language processing technology (For example, n-gram language model) to capture and build a model based on this to realize code prediction and recommendation. Raychev et al. (Raychev V, Vechev M, Yahav E. Code completion with statistical language models [C]. ACMSigplan Symposium on Programming Language Design & Implementation. ACM, 2014: 419-428) believe that the core of code prediction is method prediction, and extract the source The method call in the code builds a statistical language model and achieves good prediction results. Compared with natural language, programming language has stricter requirements on grammatical structure and data type, and there may be certain constraints between different objects, methods and even parameters. These constraints reflect the language's own rule requirements and the developer's potential thinking logic. Making full use of these constraint relationships can make the statistical language model break through the limitation of relying solely on the probability of sentences in natural language, and make code prediction more in line with the programming thinking of developers. Unfortunately, in the current research, statistical models based on natural language technology often lack the research and utilization of these constraint relationships, so there is still room for improvement in the accuracy of predictions.

早期的研究人员尝试通过规则匹配、代码搜索等方法完成代码预测,但是一直难以得到较高的准确率。随着计算机性能的提高以及自然语言处理技术的日益成熟,基于代码结构或者自然语言处理来构建代码预测模型以实现代码预测渐渐成为研究热点。Early researchers tried to complete code prediction through rule matching, code search and other methods, but it has been difficult to obtain a high accuracy rate. With the improvement of computer performance and the maturity of natural language processing technology, building a code prediction model based on code structure or natural language processing to achieve code prediction has gradually become a research hotspot.

由于大规模源代码带来的复杂性,基于规则匹配、代码搜索的方法难以捕获一些潜在的规则,预测的准确率普遍不高。基于程序结构的方法在预测准确率上取得了重大突破,但是复杂的模型必然带来昂贵的时空代价。基于自然语言处理的方法虽然在时空复杂度方面有着较大优势,但是由于没有充分利用源代码中的语法结构信息,使得准确率上仍有提升的空间。Due to the complexity brought by large-scale source codes, methods based on rule matching and code search are difficult to capture some potential rules, and the prediction accuracy is generally not high. The method based on the program structure has made a major breakthrough in the prediction accuracy, but the complex model will inevitably bring expensive time and space costs. Although methods based on natural language processing have great advantages in terms of time and space complexity, there is still room for improvement in accuracy because they do not make full use of the grammatical structure information in the source code.

发明内容Contents of the invention

本发明针对现有基于自然语言的代码预测模型对代码之间语法结构关系的利用不足的问题,提出基于方法约束关系的代码预测模型的构建方法,本发明构建的方法预测模型具有较高的代码预测准确率。The present invention aims at the problem that the existing natural language-based code prediction model makes insufficient use of the grammatical structure relationship between codes, and proposes a method for constructing a code prediction model based on the method constraint relationship. The method prediction model constructed by the present invention has a higher code prediction accuracy.

为了实现上述目的,本发明采用以下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:

基于方法约束关系的代码预测模型的构建方法,包括以下步骤:A method for constructing a code prediction model based on a method constraint relationship, comprising the following steps:

步骤1:将面向对象代码抽象为对象涉及的方法调用语句序列集合,所述方法调用语句序列集合包括多个方法调用语句序列;所述方法调用语句序列包括一个对象涉及的所有方法调用语句;Step 1: abstracting the object-oriented code into a set of method call statement sequences involved in an object, the set of method call statement sequences including a plurality of method call statement sequences; the method call statement sequence including all method call statements involved in an object;

步骤2:根据方法之间的约束关系,将所述方法调用语句序列的方法调用语句表示为四元特征组,根据所述四元特征组得出所述面向对象代码的代码特征词序列集合;Step 2: According to the constraint relationship between methods, the method call statement of the method call statement sequence is represented as a quaternary feature group, and the code feature word sequence set of the object-oriented code is obtained according to the quaternary feature group;

步骤3:利用N-gram模型将所述代码特征词序列集合滑动切分为多个3-gram序列,提取调用所述3-gram序列,构建代码预测模型。Step 3: Use the N-gram model to slide and segment the code feature word sequence set into multiple 3-gram sequences, extract and call the 3-gram sequences, and construct a code prediction model.

进一步地,所述步骤1包括:Further, said step 1 includes:

步骤1.1:将面向对象代码中的方法调用语句根据对象的不同进行分类,方法调用语句之间的相互顺序保持不变,对于包含多个对象的方法调用语句复制多份,放入每个对象中,Code={Object1,Object2,…,Objectn},Object=Method1·Method2…Methodn Step 1.1: Classify the method call statements in the object-oriented code according to different objects. The mutual order of the method call statements remains unchanged. For the method call statements containing multiple objects, copy multiple copies and put them in each object , Code={Object 1 ,Object 2 ,...,Object n }, Object=Method 1 · Method 2 ...Method n

其中Objectn表示第n个对象涉及到的所有方法调用语句,即第n个对象涉及到的方法调用语句序列;Methodn表示同一个方法调用语句序列中的第n个方法调用语句;Among them, Object n represents all method call statements involved in the nth object, that is, the method call statement sequence involved in the nth object; Method n represents the nth method call statement in the same method call statement sequence;

源代码被转化为若干个对象涉及的方法调用语句序列集合;The source code is converted into a set of method call statement sequences involved in several objects;

步骤1.2:对面向对象代码中的别名现象和代码结构进行分析,并根据分析结果调整所述步骤1.1中所述方法调用语句序列集合,包括:Step 1.2: Analyze the alias phenomenon and code structure in the object-oriented code, and adjust the method call statement sequence set described in the step 1.1 according to the analysis results, including:

对指向同一地址的不同别名进行归并,视为同一对象;Merge different aliases pointing to the same address and treat them as the same object;

对于条件选择结构,将Objectn根据条件的不同分成若干个相互独立的、不含有条件选择结构的子序列;For the conditional selection structure, Object n is divided into several subsequences that are independent of each other and do not contain the conditional selection structure according to different conditions;

对于循环结构,将循环的迭代次数统一限定为2次;For the loop structure, the number of iterations of the loop is uniformly limited to 2;

步骤1.3:将所述步骤1.2调整后的方法调用语句序列集合根据所属类的不同进行分类Code={Class1,Class2,…,Classn},将类中每个对象涉及的方法调用语句视为一个单独序列Class={SeqObject1,SeqObject2,…,SeqObjectn},Seq=Method1·Method2…Methodn,其中Classn表示第n个类涉及到的所有方法调用语句,SeqObjectn表示同一个类中第n个对象涉及到的方法调用语句序列;Step 1.3: Classify the set of method call statement sequences adjusted in step 1.2 according to the class to which they belong Code={Class 1 ,Class 2 ,...,Class n }, treat the method call statement involved in each object in the class as It is a separate sequence Class={Seq Object1 ,Seq Object2 ,...,Seq Objectn }, Seq=Method 1 Method 2 ...Method n , where Class n represents all method call statements involved in the nth class, Seq Objectn represents the same A sequence of method call statements involved in the nth object in a class;

综上,面向对象代码被抽象为3个层次,即所述方法调用语句序列集合为:In summary, the object-oriented code is abstracted into three levels, that is, the set of method call statement sequences is:

进一步地,所述四元特征组为:Further, the four-element feature group is:

Method=<ClassN,Name,Return,Relation>(2)Method=<ClassN,Name,Return,Relation>(2)

其中Method为对象的一个方法调用语句,ClassN为类名,Name为方法名,Return为方法的返回类型,Relation为对象和方法的关系;Among them, Method is a method call statement of the object, ClassN is the class name, Name is the method name, Return is the return type of the method, and Relation is the relationship between the object and the method;

所述Relation取值方式如下:The value of the Relation is as follows:

进一步地,所述代码特征词序列集合为:Further, the set of code feature word sequences is:

进一步地,所述步骤3包括:Further, said step 3 includes:

步骤3.1:利用N-gram模型,将所述步骤2中代码特征词序列集合的代码特征序列滑动切分成3-gram片段,Code={s1,s2,…,sn},s=Method1·Method2·Method3Step 3.1: Use the N-gram model to slide the code feature sequence of the code feature word sequence set in step 2 into 3-gram segments, Code={s 1 ,s 2 ,...,s n }, s=Method 1 Method 2 Method 3 ,

Method=<ClassN,Name,Return,Relation>,其中sn为第n个3-gram片段,s为一个3-gram片段,表示对象连续涉及到的3次方法调用语句序列;Method=<ClassN, Name, Return, Relation>, where s n is the nth 3-gram segment, and s is a 3-gram segment, representing the sequence of 3 method call statements involved in the object;

步骤3.2:通过对面向对象代码进行训练,统计不同的3-gram片段s出现的频率;Step 3.2: By training the object-oriented code, count the frequency of occurrence of different 3-gram segments s;

步骤3.3:利用马尔科夫假设思想,通过所述步骤3.2的统计结果构建代码预测模型,所述代码预测模型用于预测方法调用语句Method3出现的概率,所述代码预测模型为:Step 3.3: Utilize the Markov assumption idea, construct a code prediction model through the statistical results of the step 3.2, the code prediction model is used to predict the probability of occurrence of the method call statement Method 3 , and the code prediction model is:

其中α和V为平滑参数。where α and V are smoothing parameters.

与现有技术相比,本发明具有的有益效果:Compared with the prior art, the present invention has the beneficial effects:

本发明以同一对象涉及的方法调用语句序列为研究对象,根据面向对象语言中方法之间的约束关系,将类名、方法名、返回类型、对象和方法的关系等要素作为代码特征词,构建代码预测模型。通过大量实验得出本发明构建的代码预测模型具有较高的代码预测准确率。The present invention takes the method call statement sequence involved in the same object as the research object, and according to the constraint relationship between the methods in the object-oriented language, elements such as the class name, method name, return type, and the relationship between the object and the method are used as code feature words to construct Code prediction model. Through a large number of experiments, it is found that the code prediction model constructed by the present invention has a high code prediction accuracy rate.

附图说明Description of drawings

图1为本发明实施例的基于方法约束关系的代码预测模型的构建方法的基本流程示意图。FIG. 1 is a schematic flowchart of a method for constructing a code prediction model based on a method constraint relationship according to an embodiment of the present invention.

图2为本发明另一实施例的基于方法约束关系的代码预测模型的构建方法的基本流程示意图。FIG. 2 is a schematic flowchart of a method for constructing a code prediction model based on a method constraint relationship according to another embodiment of the present invention.

图3为本发明实施例的基于方法约束关系的代码预测模型的构建方法的代码特征词分析流程示意图。FIG. 3 is a schematic diagram of a code feature word analysis process of a method for constructing a code prediction model based on a method constraint relationship according to an embodiment of the present invention.

图4为本发明实施例的基于方法约束关系的代码预测模型的构建方法的实验总体框图。FIG. 4 is an overall experimental block diagram of a method for constructing a code prediction model based on a method constraint relationship according to an embodiment of the present invention.

图5为本发明实施例的基于方法约束关系的代码预测模型的构建方法的三种模型的实验结果对比图之一。FIG. 5 is one of the comparison diagrams of experimental results of three models of the construction method of the code prediction model based on the method constraint relationship according to the embodiment of the present invention.

图6为本发明实施例的基于方法约束关系的代码预测模型的构建方法的三种模型的实验结果对比图之二。FIG. 6 is the second comparison diagram of experimental results of three models of the construction method of the code prediction model based on the method constraint relationship according to the embodiment of the present invention.

图7为本发明实施例的基于方法约束关系的代码预测模型的构建方法的CPMMC模型对不同Java类的预测结果图。FIG. 7 is a diagram of the prediction results of different Java classes by the CPMMC model of the construction method of the code prediction model based on the method constraint relationship according to the embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体的实施例对本发明做进一步的解释说明:The present invention will be further explained below in conjunction with accompanying drawing and specific embodiment:

实施例一:Embodiment one:

如图1所示,本发明的一种基于方法约束关系的代码预测模型的构建方法,包括以下步骤:As shown in Figure 1, a method for constructing a code prediction model based on a method constraint relationship of the present invention includes the following steps:

步骤S101:将面向对象代码抽象为对象涉及的方法调用语句序列集合,所述方法调用语句序列集合包括多个方法调用语句序列;所述方法调用语句序列包括一个对象涉及的所有方法调用语句;Step S101: Abstracting the object-oriented code into a set of method call statement sequences involved in an object, the set of method call statement sequences including a plurality of method call statement sequences; the method call statement sequence including all method call statements involved in an object;

步骤S102:根据方法之间的约束关系,将所述方法调用语句序列的方法调用语句表示为四元特征组,根据所述四元特征组得出所述面向对象代码的代码特征词序列集合;Step S102: According to the constraint relationship between methods, express the method call statement of the method call statement sequence as a quaternary feature group, and obtain the code feature word sequence set of the object-oriented code according to the quaternary feature group;

步骤S103:利用N-gram模型将所述代码特征词序列集合滑动切分为多个3-gram序列,提取调用所述3-gram序列,构建代码预测模型。Step S103: using the N-gram model to slide and segment the code feature word sequence set into multiple 3-gram sequences, extract and call the 3-gram sequences, and construct a code prediction model.

实施例二:Embodiment two:

如图2所示,本发明的另一种基于方法约束关系的代码预测模型的构建方法,包括以下步骤:As shown in Figure 2, another method for constructing a code prediction model based on the method constraint relationship of the present invention includes the following steps:

步骤S201:将面向对象代码抽象为对象涉及的方法调用语句序列集合,所述方法调用语句序列集合包括多个方法调用语句序列;所述方法调用语句序列包括一个对象涉及的所有方法调用语句;Step S201: abstracting the object-oriented code into a set of method call statement sequences involved in an object, the set of method call statement sequences including multiple method call statement sequences; the method call statement sequence including all method call statements involved in an object;

所述步骤S201包括:The step S201 includes:

步骤S2011:将面向对象代码中的方法调用语句根据对象的不同进行分类,方法调用语句之间的相互顺序保持不变,对于包含多个对象的方法调用语句复制多份,放入每个对象中,Code={Object1,Object2,…,Objectn},Object=Method1·Method2…Methodn Step S2011: classify the method call statements in the object-oriented code according to different objects, keep the mutual order of the method call statements unchanged, copy multiple copies of the method call statements containing multiple objects, and put them into each object , Code={Object 1 ,Object 2 ,...,Object n }, Object=Method 1 · Method 2 ...Method n

其中Objectn表示第n个对象涉及到的所有方法调用语句,即第n个对象涉及到的方法调用语句序列;Methodn表示同一个方法调用语句序列中的第n个方法调用语句;Among them, Object n represents all method call statements involved in the nth object, that is, the method call statement sequence involved in the nth object; Method n represents the nth method call statement in the same method call statement sequence;

源代码被转化为若干个对象涉及的方法调用语句序列集合;The source code is converted into a set of method call statement sequences involved in several objects;

步骤S2012:对面向对象代码中的别名现象和代码结构进行分析,并根据分析结果调整所述步骤S2011中所述方法调用语句序列集合,包括:Step S2012: Analyzing the alias phenomenon and code structure in the object-oriented code, and adjusting the method call statement sequence set in the step S2011 according to the analysis result, including:

对指向同一地址的不同别名进行归并,视为同一对象;Merge different aliases pointing to the same address and treat them as the same object;

对于条件选择结构,将Objectn根据条件的不同分成若干个相互独立的、不含有条件选择结构的子序列;For the conditional selection structure, Object n is divided into several subsequences that are independent of each other and do not contain the conditional selection structure according to different conditions;

对于循环结构,将循环的迭代次数统一限定为2次;For the loop structure, the number of iterations of the loop is uniformly limited to 2;

步骤S2013:将所述步骤S2012调整后的方法调用语句序列集合根据所属类的不同进行分类Code={Class1,Class2,…,Classn},将类中每个对象涉及的方法调用语句视为一个单独序列Class={SeqObject1,SeqObject2,…,SeqObjectn},Seq=Method1·Method2…Methodn,其中Classn表示第n个类涉及到的所有方法调用语句,SeqObjectn表示同一个类中第n个对象涉及到的方法调用语句序列;Step S2013: Classify the set of method call statement sequences adjusted in step S2012 according to the class to which they belong Code={Class 1 , Class 2 ,...,Class n }, treat the method call statement involved in each object in the class as It is a separate sequence Class={Seq Object1 ,Seq Object2 ,...,Seq Objectn }, Seq=Method 1 Method 2 ...Method n , where Class n represents all method call statements involved in the nth class, Seq Objectn represents the same A sequence of method call statements involved in the nth object in a class;

综上,面向对象代码被抽象为3个层次,即所述方法调用语句序列集合为:In summary, the object-oriented code is abstracted into three levels, that is, the set of method call statement sequences is:

步骤S202:根据方法之间的约束关系,将所述方法调用语句序列的方法调用语句表示为四元特征组,根据所述四元特征组得出所述面向对象代码的代码特征词序列集合;Step S202: According to the constraint relationship between methods, express the method call statement of the method call statement sequence as a four-element feature group, and obtain the code feature word sequence set of the object-oriented code according to the four-element feature group;

所述四元特征组为:The four-element feature set is:

Method=<ClassN,Name,Return,Relation> (2)Method=<ClassN, Name, Return, Relation> (2)

其中Method为对象的一个方法调用语句,ClassN为类名,Name为方法名,Return为方法的返回类型,Relation为对象和方法的关系;Among them, Method is a method call statement of the object, ClassN is the class name, Name is the method name, Return is the return type of the method, and Relation is the relationship between the object and the method;

所述Relation取值方式如下:The value of the Relation is as follows:

所述代码特征词序列集合为:The set of code feature word sequences is:

步骤S203:利用N-gram模型将所述代码特征词序列集合滑动切分为多个3-gram序列,提取调用所述3-gram序列,构建代码预测模型CPMMC(Code-Predicting Model basedon Method Constraints)。Step S203: Use the N-gram model to slide the set of code feature words into multiple 3-gram sequences, extract and call the 3-gram sequences, and construct a code prediction model CPMMC (Code-Predicting Model based on Method Constraints) .

所述步骤S203包括:The step S203 includes:

步骤S2031:利用N-gram模型,将所述步骤S202中代码特征词序列集合的代码特征序列滑动切分成3-gram片段,Code={s1,s2,…,sn},s=Method1·Method2·Method3Step S2031: Use the N-gram model to slide and segment the code feature sequence of the code feature word sequence set in step S202 into 3-gram segments, Code={s 1 ,s 2 ,...,s n }, s=Method 1 Method 2 Method 3 ,

Method=<ClassN,Name,Return,Relation>,其中sn为第n个3-gram片段,s为一个3-gram片段,表示对象连续涉及到的3次方法调用语句序列;Method=<ClassN, Name, Return, Relation>, where s n is the nth 3-gram segment, and s is a 3-gram segment, representing the sequence of 3 method call statements involved in the object;

步骤S2032:通过对面向对象代码进行训练,统计不同的3-gram片段s出现的频率;Step S2032: by training the object-oriented code, count the frequency of occurrence of different 3-gram segments s;

步骤S2033:利用马尔科夫假设思想,通过所述步骤S2032的统计结果构建代码预测模型CPMMC,所述代码预测模型CPMMC用于预测方法调用语句Method3出现的概率,所述代码预测模型CPMMC为:Step S2033: Utilize the idea of Markov hypothesis, construct the code prediction model CPMMC through the statistical results of the step S2032, the code prediction model CPMMC is used to predict the probability of occurrence of the method call statement Method 3 , and the code prediction model CPMMC is:

其中α和V为平滑参数。where α and V are smoothing parameters.

作为一种可实施方式,本实施例基于编程语言为Java语言的面向对象代码对本发明做详细介绍:As a kind of implementable mode, the present embodiment is based on the object-oriented code of Java language that the programming language is described in detail to the present invention:

自然语言处理技术之所以能够在一定程度上捕获编程语言中的方法调用规律,是因为开发人员在使用这些方法时遵循着某种思维方式。通常情况,程序中的代码不是独立存在的,在它前后的若干行代码共同决定了该代码的出现和使用方式。基于这些分析,本发明提出了面向对象编程语言中,方法的约束关系这一概念。The reason why natural language processing technology can capture the method calling rules in programming languages to a certain extent is that developers follow a certain way of thinking when using these methods. Usually, the code in the program does not exist independently, and several lines of code before and after it jointly determine the appearance and usage of the code. Based on these analyses, the present invention proposes the concept of constraint relationship of methods in object-oriented programming language.

方法的约束关系:在面向对象编程语言中,不同方法之间对彼此使用方式的影响关系。方法的约束关系可以分成两类,方法的内部约束关系和方法的外部约束关系。Method constraint relationship: In an object-oriented programming language, the influence relationship between different methods on how to use each other. Method constraints can be divided into two categories, method internal constraints and method external constraints.

方法的内部约束关系:在面向对象编程语言中,来自同一个对象的方法之间的约束关系。Internal constraints of methods: In object-oriented programming languages, constraints between methods from the same object.

方法的外部约束关系:在面向对象编程语言中,来自不同对象的方法之间的约束关系。External constraint relationship of methods: In object-oriented programming languages, the constraint relationship between methods from different objects.

代码特征词:在面向对象代码模型中,用来表示某一代码片段的词或词组。Code feature word: In the object-oriented code model, a word or phrase used to represent a certain code fragment.

代码抽象是构建代码预测模型的核心,它确定了模型的整体结构以及方法调用在模型中的表现形式(代码特征词)。Code abstraction is the core of building a code prediction model, which determines the overall structure of the model and the representation of method calls in the model (code feature words).

在确定代码预测模型的整体结构之前,我们需要分析方法约束关系的范围,也就是尽可能寻找强约束关系。Alan Kay(BruceEckel.Java编程思想:第4版[M].机械工业出版社,2007)曾对面向对象语言的特性进行分析,他提出,程序是对象的集合,它们通过发送消息来告知彼此所要做的事。因此,Java程序可以被看做一个由对象为节点,方法调用(消息传递)为边的网络。基于网络的基本特征,本发明可以定性的认为,同一个节点的边之间存在较为紧密的关系。也就是说,通常情况,同一个对象涉及的方法调用之间存在较强的约束关系。因此,本发明将代码预测模型CPMMC的基本结构定为同一个对象涉及的方法调用语句序列。Before determining the overall structure of the code prediction model, we need to analyze the scope of the method constraint relationship, that is, to find strong constraint relationships as much as possible. Alan Kay (BruceEckel.Java Programming Thoughts: Fourth Edition [M]. Mechanical Industry Press, 2007) once analyzed the characteristics of object-oriented languages. He proposed that a program is a collection of objects, and they send messages to inform each other what they want. do. Therefore, a Java program can be viewed as a network with objects as nodes and method calls (message passing) as edges. Based on the basic characteristics of the network, the present invention can qualitatively consider that there is a relatively close relationship between the edges of the same node. That is to say, under normal circumstances, there is a strong constraint relationship between method calls involved in the same object. Therefore, the present invention defines the basic structure of the code prediction model CPMMC as a sequence of method call statements involved in the same object.

在CPMMC模型中,面向对象代码被抽象为3个层次,形式化表示为:In the CPMMC model, the object-oriented code is abstracted into three levels, formalized as:

其中Classn表示第n个类涉及到的方法调用语句的集合,SeqObjectn表示同一个类中第n个对象涉及到的方法调用语句序列,Methodn则表示同一个方法调用语句序列中的第n个方法调用语句。CPMMC模型将源代码转化为若干个方法调用语句序列Seq的集合,每一个序列都描述了一个对象涉及到的所有方法调用语句。Among them, Class n represents the set of method call statements involved in the nth class, Seq Objectn represents the method call statement sequence involved in the nth object in the same class, and Method n represents the nth method call statement sequence in the same method call statement sequence A method call statement. The CPMMC model converts the source code into a collection of several method call statement sequences Seq, and each sequence describes all the method call statements involved in an object.

面向对象代码被抽象为3个层次的具体过程如步骤S2011-步骤S2013所示。The specific process of object-oriented code being abstracted into three levels is shown in step S2011-step S2013.

本发明从约束关系的主体、约束关系的表现形式以及约束关系的拓展三个层次分析方法调用语句Method,并提取代码特征词对其进行描述。如图3所示。The present invention invokes the statement Method from three hierarchical analysis methods of the subject of the constraint relationship, the expression form of the constraint relationship, and the expansion of the constraint relationship, and extracts code feature words to describe it. As shown in Figure 3.

约束关系的主体即为存在约束关系的另一个方法调用语句。这个方法可以来自于对象本身,也可以来自于其他对象,所以需要通过该对象的对象名ObjectN来表示。考虑到参数列表(包括参数个数和类型)和方法名(它们合起来被称为“方法签名”)唯一地标识出某个方法,方法名Name和参数列表Parameter也被定为候选特征词。为了解决开发人员在命名对象时存在的个性化问题,本发明将对象名ObjectN抽象为类名ClassN。经过实验分析,直接将参数列表加入代码特征词中并不能提高代码预测模型的准确性,所以参数列表Parameter也最终被排除。经过上述分析,代表约束关系主体的特征被确定为<ClassN,Name>。The subject of the constraint relationship is another method call statement in which the constraint relationship exists. This method can come from the object itself or from other objects, so it needs to be represented by the object name ObjectN of the object. Considering that the parameter list (including the number and type of parameters) and the method name (they are collectively called "method signature") uniquely identify a certain method, the method name Name and parameter list Parameter are also selected as candidate feature words. In order to solve the personalization problem that developers have when naming objects, the present invention abstracts the object name ObjectN into a class name ClassN. After experimental analysis, directly adding the parameter list to the code feature words does not improve the accuracy of the code prediction model, so the parameter list Parameter is finally excluded. After the above analysis, the feature representing the subject of the constraint relationship is determined as <ClassN,Name>.

约束关系的表现形式在代码中体现为方法与SeqObjectn中的对象Objectn的关系,这种关系在Java语言中分为三种情况:对象为方法的调用主体、对象为方法的参数、方法的返回值赋值给对象。为了描述这三种情况,本发明借鉴了Raychev等(Raychev V,Vechev M,Yahav E.Code completion with statistical language models[C].ACMSigplanSymposium on Programming Language Design&Implementation.ACM,2014:419-428)的研究成果,用一个占位符Relation来表示约束关系的表现形式。The expression form of the constraint relationship is reflected in the code as the relationship between the method and the object Objectn in Seq Objectn . This relationship is divided into three situations in the Java language: the object is the calling body of the method, the object is the parameter of the method, and the return of the method The value is assigned to the object. In order to describe these three situations, the present invention draws on the research results of Raychev et al. , use a placeholder Relation to represent the representation of the constraint relationship.

同一行代码中不同代码元素之间也存在相互影响关系,其中赋值符号所代表的关系最强。为了体现这种关系,本发明拓展了方法的返回值Return作为方法约束关系的补充。There are also mutual influence relationships between different code elements in the same line of code, among which the relationship represented by the assignment symbol is the strongest. In order to reflect this relationship, the present invention expands the return value Return of the method as a supplement to the method constraint relationship.

综上所述,一次方法的调用可以通过一个四元组来表示,方法调用语句形式化表示为:To sum up, a method call can be represented by a quadruple, and the method call statement is formalized as:

Method=<ClassN,Name,Return,Relation> (2)Method=<ClassN, Name, Return, Relation> (2)

其中Method为对象的一个方法调用语句,ClassN为类名,Name为方法名,Return为方法的返回类型,Relation为对象和方法的关系。Relation的取值方式如下:Among them, Method is a method call statement of the object, ClassN is the class name, Name is the method name, Return is the return type of the method, and Relation is the relationship between the object and the method. The value of Relation is as follows:

至此,面向对象代码被抽象为一个四层的代码特征词序列集合:So far, the object-oriented code has been abstracted into a four-layer code feature word sequence collection:

此外,两种常见的情况会对抽取出的代码特征词序列SeqObjectn产生影响。1)当出现条件选择语句(如if/else、switch/case)时,因条件的不同,程序执行的是若干条完整且相互独立的指令,所以需要根据条件的不同构建若干条相互独立的代码特征词序列。2)当同一个对象出现在不同的代码结构体中时,为了降低复杂度,适当的结构体限制变得必不可少。考虑到程序中的代码与其前若干行代码紧密相关,而且耦合最紧密的代码语句往往在同一函数内,所以本发明将代码特征词序列的范围定为函数体,同一对象出现在多个函数体的情况被视为多个独立的代码特征词序列。In addition, two common situations will affect the extracted code feature word sequence Seq Objectn . 1) When a conditional selection statement (such as if/else, switch/case) appears, due to different conditions, the program executes several complete and independent instructions, so it is necessary to construct several independent codes according to different conditions character sequence. 2) When the same object appears in different code structures, in order to reduce the complexity, appropriate structure restrictions become essential. Considering that the code in the program is closely related to its previous lines of code, and the most tightly coupled code statements are often in the same function, so the present invention defines the code feature word sequence as the function body, and the same object appears in multiple function bodies The case of is treated as multiple independent code feature word sequences.

如下为Stack overflow网站中的一段面向对象的代码片段,该片段可被抽象为3个对象的代码特征词序列,如表1所示。The following is an object-oriented code snippet from the Stack overflow website, which can be abstracted into a sequence of code feature words for three objects, as shown in Table 1.

表1 CPMMC模型的代码特征词序列Table 1 Code feature word sequence of CPMMC model

CPMMC基于代码抽象后的代码特征词序列集合构建N-gram统计语言模型来实现方法调用语句的预测。统计语言模型用于通过将语言单位(如单词,短语,句子和文档)的发生概率赋予自然语言来捕捉自然语言的规律。由于语言单位被表示为一个或多个基本符号的序列,所以通过计算这种序列的概率来构建统计语言模型。在形式上可表示为:CPMMC builds an N-gram statistical language model based on the code feature word sequence set after code abstraction to realize the prediction of method call sentences. Statistical language models are used to capture the regularities of natural language by assigning probabilities of occurrence of linguistic units such as words, phrases, sentences, and documents to natural language. Since a language unit is represented as a sequence of one or more basic symbols, a statistical language model is constructed by computing the probability of such a sequence. In form it can be expressed as:

统计语言模型:统计语言模型L是由基本单元词汇V、生成过程G和似然函数P(.|L)组成的统计模型。P(s|L)是L由过程G之后生成V中元素序列s的概率。Statistical language model: Statistical language model L is a statistical model composed of basic unit vocabulary V, generation process G and likelihood function P(.|L). P(s|L) is the probability that L generates sequence s of elements in V after process G.

当对统计语言模型L的讨论语境清楚时,可以使用P(s)来表示P(s|L),称为序列s的生成概率。因此,可以简单地将统计语言模型视为具有每个可能序列的概率分布。When the context of the discussion of the statistical language model L is clear, P(s) can be used to represent P(s|L), which is called the generation probability of the sequence s. Therefore, a statistical language model can simply be viewed as having a probability distribution for each possible sequence.

N-gram模型是一个具有两个假设的统计语言模型。首先,它假设一个序列可以从左到右生成。其次,该序列中的单词的产生概率仅取决于其本地语境。n个单词的序列称为N-gram。当n的值是1、2或3时,该模型被称为unigram、bigram或trigram。The N-gram model is a statistical language model with two assumptions. First, it assumes that a sequence can be generated from left to right. Second, the probabilities of words in this sequence depend only on their local context. A sequence of n words is called an N-gram. When the value of n is 1, 2 or 3, the model is called unigram, bigram or trigram.

现有的研究已经证明,基于源代码的N-gram模型是合理的。即下一个代码特征词是可以根据先前的代码特征词预测的。例如,在如下面向对象代码中,FileWriter类的实例化对象fw分别调用了以下方法:Existing research has proven that source code-based N-gram models are reasonable. That is, the next code feature word can be predicted according to the previous code feature words. For example, in the following object-oriented code, the instantiated object fw of the FileWriter class calls the following methods respectively:

fw=new FileWriter(filepath);fw = new FileWriter(filepath);

fw.write(str);fw.write(str);

那么,这组方法调用语句序列可以被视为下一句方法调用语句的上下文,并用于预测接下来对象fw调用的方法,如fw.close();。Then, this group of method call statement sequences can be regarded as the context of the next method call statement, and used to predict the next method called by the object fw, such as fw.close();.

假设从左到右产生代码特征词,序列s=m1·m2·m3...mi-1之后的下一个代码特征词c=mi的概率可被表示为Assuming that code feature words are generated from left to right, the probability of the next code feature word c=m i after the sequence s=m 1 ·m 2 ·m 3 ...m i-1 can be expressed as

P(c|s)=P(mi|m1·m2...mi-1) (6)P(c|s)=P(m i |m 1 ·m 2 ... m i-1 ) (6)

根据马尔科夫假设,条件概率P(c|s)可被近似的计算为According to the Markov assumption, the conditional probability P(c|s) can be approximated as

P(c|s)=P(mi|mi-n+1·mi-n+2...mi-1) (7)P(c|s)=P(m i |m i-n+1 m i-n+2 ... m i-1 ) (7)

其中mi-n+1·mi-n+2...mi-1是由序列s的最后n-1个代码特征词构成的子序列。通过这种近似,统计语言模型只需要计算和存储涉及最多n个连续代码特征词的条件概率。Among them, m i-n+1 ·m i-n+2 ... m i-1 is a subsequence composed of the last n-1 code feature words of the sequence s. With this approximation, the statistical language model only needs to compute and store conditional probabilities involving at most n consecutive code feature words.

CPMMC采用3-gram模型,即trigram模型,将一个长度为l的代码特征词序列SeqObjectn切分为l-2个连续的序列CPMMC adopts the 3-gram model, that is, the trigram model, and divides a code feature word sequence Seq Objectn with a length of l into l-2 continuous sequences

Seqt i表示由3个连续的代码特征词构成的序列Seq t i represents a sequence composed of 3 consecutive code feature words

Seqt i=Methodi 1·Methodi 2·Methodi 3 (9)Seq t i =Method i 1 ·Method i 2 ·Method i 3 (9)

在进行切分后,源代码Code成为由n个Seqt构成的集合After splitting, the source code Code becomes a collection of n Seq t

在此基础上,代码特征词序列Methodi 1·Methodi 2之后生成Methodi 3的概率可被计算为On this basis, the probability of generating Method i 3 after the code feature word sequence Method i 1 ·Method i 2 can be calculated as

其中a和V为平滑参数,以处理极小值甚至0值出现的情况。Among them, a and V are smooth parameters to deal with the occurrence of extremely small values or even zero values.

因为存在数据稀疏的问题,对于少数无法构建trigram模型的情况,CPMMC会采用bigram甚至unigram模型。Because of the problem of data sparsity, for a few cases where the trigram model cannot be constructed, CPMMC will use the bigram or even unigram model.

例如Java程序员需要实现文件读取操作,编写了一段不完整的代码,如下所示,For example, Java programmers need to implement file reading operations and write an incomplete code, as shown below,

BufferedReaderreader=BufferedReaderreader=

newBufferedReader(new FileReader(file));newBufferedReader(new FileReader(file));

String tempString=readerString tempString = reader

reader.xreader.x

reader对象的下一个方法调用语句x的使用受之前的两个方法调用语句(初始化、readLine)的约束。初始化、readLine、x相互之间一定符合某种逻辑并且往往会在不同代码中多次出现。但是单纯依靠方法的约束关系很难判断出reader对象是否还会继续读取文件。在加入拓展特征词后,CPMMC可以通过x的返回类型void排除继续使用readLine方法的可能性。经过代码预测模型CPMMC可以得到x为close的概率较高。The use of the next method call statement x of the reader object is subject to the constraints of the previous two method call statements (initialization, readLine). Initialization, readLine, and x must be consistent with each other and often appear multiple times in different codes. However, it is difficult to judge whether the reader object will continue to read files purely relying on the constraint relationship of the method. After adding the extended feature word, CPMMC can exclude the possibility of continuing to use the readLine method through the return type void of x. Through the code prediction model CPMMC, it can be obtained that the probability of x being close is relatively high.

经过上述分析,CPMMC模型的预测方法具有合理性,本发明会通过实验进一步论证。Through the above analysis, the prediction method of the CPMMC model is reasonable, and the present invention will be further demonstrated through experiments.

CPMMC在训练过程中主要分为代码抽象和构建trigram模型。CPMMC is mainly divided into code abstraction and building trigram model in the training process.

对于一个大小为N的训练集,包含的对象个数为o,对象涉及到的方法数为m。For a training set of size N, the number of objects contained is o, and the number of methods involved in the objects is m.

在代码抽象过程中,假设提取方法的代码特征词所需要的时间为t,时间复杂度T1可表示为In the process of code abstraction, assuming that the time required to extract the code feature words of the method is t, the time complexity T 1 can be expressed as

根据编程语言的一般规律,o与N正相关,存在线性关系,上述公式可近似表示为According to the general law of programming languages, o is positively correlated with N, and there is a linear relationship. The above formula can be approximately expressed as

其中为平均每个对象涉及到的方法数,为每个方法提取特征词所需要的平均时间。所以代码抽象过程的时间复杂度为in is the average number of methods involved per object, The average time required to extract feature words for each method. So the time complexity of the code abstraction process is

T1(N)=O(N) (14)T 1 (N)=O(N) (14)

在构建trigram模型过程中,假设构建并存储一个新的trigram需要的时间为τ,新trigram出现的概率为p,那么构建trigram模型的时间复杂度T2可表示为In the process of constructing a trigram model, assuming that the time required to construct and store a new trigram is τ, and the probability of a new trigram appearing is p, then the time complexity T 2 of constructing a trigram model can be expressed as

上述公式可以近似表示为The above formula can be approximated as

因为代码中的方法序列存在一定的重复性,所以pi与i的大小负相关,当i无限大时,pi趋于0,即p与N存在负相关。显然,时间复杂度T2小于N×C,即Because there is a certain repetition in the method sequence in the code, p i is negatively correlated with the size of i. When i is infinitely large, p i tends to 0, that is, there is a negative correlation between p and N. Obviously, the time complexity T2 is less than N×C, namely

T2(N)<O(N) (17)T 2 (N)<O(N) (17)

本发明基于方法的约束关系提出CPMMC模型。事实上,方法约束关系的强弱是存在差异的,例如open/close往往成对出现。换而言之,在同一个对象中,方法open和close之间存在巨大的约束关系。The invention proposes the CPMMC model based on the constraint relationship of the method. In fact, there are differences in the strength of the method constraint relationship, for example, open/close often appear in pairs. In other words, in the same object, there is a huge constraint relationship between the methods open and close.

根据功能的不同,Java语言中的类可以分成三类,分别为数据结构类、内部功能类和外部接口类,如表2所示。According to different functions, classes in the Java language can be divided into three categories, namely data structure classes, internal function classes and external interface classes, as shown in Table 2.

表2 Java类的分类Table 2 Classification of Java classes

一般情况下,外部接口类因为要与外部程序(如文件系统、数据库、浏览器等)进行交互,需要遵循一些严格的规范(如文件的打开和关闭),经常需要连续调用若干方法以实现某种功能,方法之间的约束关系较强;数据结构类中的方法往往实现一些最简单的数据操作,相互的依赖性不强,调用灵活,方法之间的约束关系较弱;而内部功能类虽然不需要遵循外部程序提供的接口,但是相比数据结构类的方法,调用过程较为复杂,方法约束关系的强弱通常在另外两类之间。In general, external interface classes need to follow some strict specifications (such as opening and closing files) because they need to interact with external programs (such as file systems, databases, browsers, etc.), and often need to continuously call several methods to achieve certain The constraints between the methods are strong; the methods in the data structure class often implement some of the simplest data operations, the mutual dependence is not strong, the call is flexible, and the constraints between the methods are weak; while the internal function class Although there is no need to follow the interface provided by the external program, the calling process is more complicated than the method of the data structure class, and the strength of the method constraint relationship is usually between the other two classes.

根据CPMMC模型,以开源的Java代码为样本,对代码中的方法调用做预测,验证CPMMC模型的效果和性能,分析CPMMC模型的适用范围。According to the CPMMC model, using the open source Java code as a sample, predict the method calls in the code, verify the effect and performance of the CPMMC model, and analyze the scope of application of the CPMMC model.

验证CPMMC模型在代码预测中的效果需要大量的面向对象代码。本发明的实验在Github网站上获取了评分在1000Stars以上的11个Java开源代码作为源代码,源代码总数为139KLOC。为了去除开发人员自定义的类对预测的干扰,实验采集了JavaTM PlatformStandard Ed.7版本中定义的4024个类,以及这些类包含的51641个方法,并将其定为实验预测的范围。Validating the effectiveness of the CPMMC model in code prediction requires a lot of object-oriented code. The experiment of the present invention has obtained 11 Java open source codes scoring more than 1000 Stars as source codes on the Github website, and the total number of source codes is 139KLOC. In order to remove the interference of the developer's custom classes on the prediction, the experiment collected 4024 classes defined in the JavaTM Platform Standard Ed.7 version, and 51641 methods contained in these classes, and set them as the scope of the experimental prediction.

考虑到Java语言中的一些结构会对预测产生负面影响,本发明基于Java分析工具Soot构建程序分析框架对源代码进行预处理。主要针对以下三种情况,针对别名现象,采用SPARK框架对源代码进行别名分析,将指向同一地址的对象标记为相同对象;针对条件选择结构,将函数体根据条件的不同分成若干个相互独立的、不含有条件选择结构的函数体;针对循环结构,将循环的迭代次数统一限定为2次。Considering that some structures in the Java language will have a negative impact on the prediction, the present invention builds a program analysis framework based on the Java analysis tool Soot to preprocess the source code. Mainly for the following three situations, for the aliasing phenomenon, the SPARK framework is used to analyze the source code aliases, and the objects pointing to the same address are marked as the same object; for the conditional selection structure, the function body is divided into several independent ones according to different conditions , A function body that does not contain a conditional selection structure; for a loop structure, the number of iterations of the loop is uniformly limited to 2 times.

实验将90%的源代码作为训练集,剩下的10%作为测试集。在测试集中随机删除100个方法调用,标记为待预测的点P.P(Predicting Point),如表3所示,并确保调用P.P的对象在之前至少涉及到2个方法的调用。In the experiment, 90% of the source code is used as the training set, and the remaining 10% is used as the test set. Randomly delete 100 method calls in the test set and mark them as P.P (Predicting Point), as shown in Table 3, and ensure that the object calling P.P involves at least 2 method calls before.

表3测试集中删除方法的示例Table 3 Examples of removal methods in the test set

实验整体分为模型训练部分和代码预测部分,如图4所示。在模型训练部分,我们对训练集进行程序分析,得到中间代码,并抽取方法调用语句序列,对得到的方法调用语句进行N-gram切分,构建CPMMC模型。在代码预测部分,我们将测试集中不完整的代码片段进行程序分析,并抽取带有P.P的方法调用语句序列,构建CPMMC模型。然后根据训练得到的代码特征词序列的概率分布,给出P.P的推荐列表。The whole experiment is divided into model training part and code prediction part, as shown in Figure 4. In the model training part, we analyze the training set to obtain the intermediate code, extract the sequence of method call statements, segment the obtained method call statements into N-grams, and construct the CPMMC model. In the code prediction part, we analyze the incomplete code fragments in the test set for program analysis, and extract the method call statement sequence with P.P to construct the CPMMC model. Then, according to the probability distribution of the code feature word sequence obtained through training, a recommendation list of P.P is given.

为了更好的评估代码预测的准确率,实验设计了3个指标,分别为Top1、Top3和Top15。Top1表示代码预测模型推荐的首个方法即为正确方法的概率,Top3表示推荐列表中前3个方法包含正确方法的概率,Top15表示推荐列表中前15个方法包含正确方法的概率。In order to better evaluate the accuracy of code prediction, three indicators were designed in the experiment, namely Top1, Top3 and Top15. Top1 indicates the probability that the first method recommended by the code prediction model is the correct method, Top3 indicates the probability that the first three methods in the recommendation list contain the correct method, and Top15 indicates the probability that the first 15 methods in the recommendation list contain the correct method.

为了评估程序分析在构建代码预测模型时的重要性,本发明分别构建了经过程序分析得到的CPMMC模型、未进行别名分析得到的CPMMC_A模型、未处理条件选择结构的CPMMC_S模型,并对100个相同的P.P进行预测。实验结果如图5所示。In order to assess the importance of program analysis in building code prediction models, the present invention respectively constructs the CPMMC model obtained through program analysis, the CPMMC_A model obtained without alias analysis, and the CPMMC_S model of the unprocessed condition selection structure, and compares 100 identical The P.P for prediction. The experimental results are shown in Figure 5.

通过对比实验可以看出,别名分析和对条件选择结构的处理对预测的准确率的影响非常大,尤其是对条件选择结构的处理十分关键。这也验证了在构建代码预测模型前对源代码进行程序分析的必要性。Through comparative experiments, it can be seen that the alias analysis and the processing of the conditional selection structure have a great influence on the prediction accuracy, especially the processing of the conditional selection structure is very critical. This also verifies the necessity of program analysis on the source code before building the code prediction model.

为了评估CPMMC模型的先进性,我们将本发明的CPMMC模型与当前预测准确率最高的SLANG(Statistical Language Model)模型进行比较。此外,本发明没有将方法的参数列表作为代码特征词之一,为验证本发明模型的合理性,我们还构建了带有参数列表的CPMMC模型(用CPMMC_P表示)。我们用这3种模型对100个相同的P.P上进行预测。实验结果如下图6所示。In order to evaluate the advancement of the CPMMC model, we compare the CPMMC model of the present invention with the SLANG (Statistical Language Model) model with the highest prediction accuracy. In addition, the present invention does not use the parameter list of the method as one of the code feature words, for verifying the rationality of the model of the present invention, we have also constructed the CPMMC model (represented by CPMMC_P) with the parameter list. We use these 3 models to make predictions on 100 identical P.P. The experimental results are shown in Figure 6 below.

可以看出,相比于SLANG模型,CPMMC模型的准确率有明显的提高,Top1从0.56提高到0.64,Top3从0.80提高到0.87,Top15从0.94提高到0.95。这也验证了CPMMC模型在解决了对象的个性化命名问题,以及加入了同一行代码中的不同元素的约束关系后,明显提高了代码预测的准确性。两种模型在Top15指标上相差不大,都达到了94%以上,主要是因为随着指标的放宽,这两种模型都可以有效的捕获开发人员对Java方法的使用规则。It can be seen that compared with the SLANG model, the accuracy of the CPMMC model has been significantly improved. Top1 has increased from 0.56 to 0.64, Top3 has increased from 0.80 to 0.87, and Top15 has increased from 0.94 to 0.95. This also verifies that the CPMMC model significantly improves the accuracy of code prediction after solving the problem of personalized naming of objects and adding the constraint relationship of different elements in the same line of code. There is little difference between the two models in the Top15 indicators, both reaching more than 94%, mainly because as the indicators are relaxed, the two models can effectively capture the usage rules of developers for Java methods.

通过比较CPMMCP模型和CPMMC_P模型的实验结果,可以发现CPMMC_P模型预测的准确率有明显的下降。从表面上看,加入参数列表会使模型数据更加稀疏,例如面向对象代码中常见的String类,它的一个方法indexOf经重载后拥有4个同名方法indexOf(int)、indexOf(string)、indexOf(int,int)、indexOf(string,int),虽然indexOf可能得到较高的概率值,但是如果分散到4个方法中,每个方法概率值反而不高。单从本质上分析,由于预测的目标是方法名称,并未深入到方法的重载,加入参数列表会造成过拟合现象,所以CPMMC_P模型的效果反而不好。By comparing the experimental results of the CPMMCP model and the CPMMC_P model, it can be found that the prediction accuracy of the CPMMC_P model has decreased significantly. On the surface, adding a parameter list will make the model data more sparse. For example, the String class, which is common in object-oriented codes, has a method indexOf that has four methods with the same name indexOf(int), indexOf(string), and indexOf after being overloaded. (int, int), indexOf(string, int), although indexOf may get a higher probability value, but if it is dispersed into 4 methods, the probability value of each method is not high. From the essential analysis, since the target of the prediction is the method name, and it does not go deep into the overloading of the method, adding the parameter list will cause overfitting, so the effect of the CPMMC_P model is not good.

为了评估CPMMC模型的时间复杂度,我们分别计算了对于不同大小的训练集,代码抽象和构建N-gram模型所需要的时间,并与SLANG模型进行比较。实验结果如表4所示。In order to evaluate the time complexity of the CPMMC model, we calculated the time required for code abstraction and building the N-gram model for different sizes of training sets, and compared it with the SLANG model. The experimental results are shown in Table 4.

表4训练阶段所需时间Table 4 Time required for training phase

通过实验结果可以发现,CPMMC模型的时间复杂度随着样本量的增加呈线性增长,与SLANG模型相比较,增加的时间在可控范围内,可以应用于大规模开源代码。Through the experimental results, it can be found that the time complexity of the CPMMC model increases linearly with the increase of the sample size. Compared with the SLANG model, the increased time is within a controllable range and can be applied to large-scale open source code.

为了研究方法约束关系的强弱对基于方法约束关系的模型的影响,进一步讨论这类模型在代码预测领域的适用范围,本发明将Java的类分成三类。对此,本发明通过实验数据统计得出Java代码中使用频率最高的60个类,并根据数据结构类、内部功能类和外部接口类分别选取6个有代表性的类,如表5所示。实验对每一个类都分别构建CPMMC模型并对100个P.P点进行预测,实验结果如图7所示。In order to study the influence of the strength of the method constraint relationship on the model based on the method constraint relationship, and further discuss the scope of application of this type of model in the field of code prediction, the present invention divides Java classes into three categories. In this regard, the present invention obtains the 60 most frequently used classes in Java codes through experimental data statistics, and selects 6 representative classes respectively according to data structure classes, internal function classes and external interface classes, as shown in Table 5 . In the experiment, a CPMMC model was constructed for each class and 100 P.P points were predicted. The experimental results are shown in Figure 7.

表5 3种类型的Java类代表Table 5 3 types of Java class representatives

通过实验结果可以看出,CPMMC模型在对外部接口类的预测准确率最高,而对于数据结构类的预测准确率最低。考虑到一般情况下,外部接口类的方法约束关系最强,数据结构类的方法约束关系最弱,我们可以得出这样一个结论,在基于方法约束关系的代码预测模型中,预测的准确率与方法约束关系的强弱成正相关。与此同时,这也说明基于方法约束关系的代码预测模型对外部接口类的适用性最好。It can be seen from the experimental results that the CPMMC model has the highest prediction accuracy rate for external interface classes, and the lowest prediction accuracy rate for data structure classes. Considering that in general, the method constraint relationship of the external interface class is the strongest, and the method constraint relationship of the data structure class is the weakest, we can draw such a conclusion that in the code prediction model based on the method constraint relationship, the prediction accuracy is the same as The strength of the method constraint relationship is positively correlated. At the same time, this also shows that the code prediction model based on the method constraint relationship is the best for external interface classes.

针对现有基于自然语言的代码预测模型未能充分利用编程语言中方法之间相互关系这一问题,本发明提出了方法的约束关系这一概念,并基于此概念,提出CPMMC模型,以同一对象涉及的方法调用语句序列为研究对象,将类名、方法名、返回类型、对象和方法的关系等要素作为代码特征词,构建代码预测模型,对某个对象的下一个方法调用语句进行预测。实验结果表明,该CPMMC模型能较好的利用代码中方法的约束关系,提高了代码预测的准确率,并对Java语言中的外部接口类有良好的预测能力。Aiming at the problem that existing natural language-based code prediction models fail to make full use of the relationship between methods in programming languages, the present invention proposes the concept of method constraint relationship, and based on this concept, proposes the CPMMC model, which uses the same object The sequence of method call statements involved is the research object. Elements such as class name, method name, return type, and the relationship between objects and methods are used as code feature words to build a code prediction model to predict the next method call statement of an object. Experimental results show that the CPMMC model can make good use of the constraint relationship of methods in the code, improve the accuracy of code prediction, and has good prediction ability for external interface classes in Java language.

以上所示仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。What is shown above is only a preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, some improvements and modifications can also be made without departing from the principles of the present invention. It should be regarded as the protection scope of the present invention.

Claims (5)

1. the construction method of the code prediction model based on method restriction relation, which is characterized in that include the following steps:
Step 1:Object-oriented code is abstracted as the method call statement sequence set that object is related to, the method call statement Arrangement set includes multiple method call statement sequences;The method call statement sequence includes all sides that an object is related to Method call statement;
Step 2:According to the restriction relation between method, the method call sentence of the method call statement sequence is expressed as four First feature group obtains the code characteristic word sequence set of the object-oriented code according to the quaternary feature group;
Step 3:It is multiple 3-gram sequences, extraction that the code characteristic word sequence set, which is slided cutting, using N-gram models The 3-gram sequences are called, code prediction model is built.
2. the construction method of the code prediction model according to claim 1 based on method restriction relation, which is characterized in that The step 1 includes:
Step 1.1:Method call sentence in object-oriented code is classified according to the difference of object, method call sentence Between mutually sequentially remain unchanged, more parts are replicated for the method call sentence comprising multiple objects, is put into each object, Code={ Object1,Object2,…,Objectn, Object=Method1·Method2…Methodn
Wherein ObjectnIndicate all method call sentences that n-th of object is related to, i.e. the method tune that n-th of object is related to Use statement sequence;MethodnIndicate n-th of method call sentence in the same method call statement sequence;
Source code is converted into the method call statement sequence set that several objects are related to;
Step 1.2:To in object-oriented code alias phenomenon and code structure analyze, and according to analysis result adjust institute The set of method call statement sequence described in step 1.1 is stated, including:
Different alias to being directed toward same address carry out merger, are considered as same target;
Structure is selected for condition, by ObjectnIt is divided into that several are mutually independent, without choosing of having ready conditions according to the difference of condition Select the subsequence of structure;
For loop structure, the iterations of cycle are uniformly limited to 2 times;
Step 1.3:Method call statement sequence set after the step 1.2 is adjusted is classified according to the difference of affiliated class Code={ Class1,Class2,…,Classn, the method call sentence that each object is related in class is considered as an independent sequence Arrange Class={ SeqObject1,SeqObject2,…,SeqObjectn, Seq=Method1·Method2…Methodn, wherein ClassnIndicate all method call sentences that n-th of class is related to, SeqObjectnIndicate that n-th of object is related in same class The method call statement sequence arrived;
To sum up, object-oriented code is conceptualized as 3 levels, i.e. the method call statement arrangement set is:
3. the construction method of the code prediction model according to claim 1 based on method restriction relation, which is characterized in that The quaternary feature group is:
Method=<ClassN,Name,Return,Relation> (2)
Wherein Method is a method call sentence of object, and ClassN is class name, and Name is method name, and Return is method Return type, Relation be object and method relationship;
The Relation value modes are as follows:
4. the construction method of the code prediction model according to claim 2 or 3 based on method restriction relation, feature exist In the code characteristic word sequence collection is combined into:
5. the construction method of the code prediction model according to claim 2 based on method restriction relation, which is characterized in that The step 3 includes:
Step 3.1:Using N-gram models, the code characteristic sequence of code characteristic word sequence set in the step 2 is slided It is cut into 3-gram segments, Code={ s1,s2,…,sn, s=Method1·Method2·Method3,
Method=<ClassN,Name,Return,Relation>, wherein snFor n-th of 3-gram segment, s is a 3- Gram segments indicate 3 method call statement sequences that object is continuously related to;
Step 3.2:By being trained to object-oriented code, the frequency that the 3-gram segments s for counting different occurs;
Step 3.3:Assume thought using Markov, code prediction model, institute are built by the statistical result of the step 3.2 Code prediction model is stated for prediction technique call statement Method3The probability of appearance, the code prediction model are:
Wherein α and V is smoothing parameter.
CN201810327625.7A 2018-04-12 2018-04-12 Construction method of code prediction model based on method constraint relationship Active CN108572917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810327625.7A CN108572917B (en) 2018-04-12 2018-04-12 Construction method of code prediction model based on method constraint relationship

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810327625.7A CN108572917B (en) 2018-04-12 2018-04-12 Construction method of code prediction model based on method constraint relationship

Publications (2)

Publication Number Publication Date
CN108572917A true CN108572917A (en) 2018-09-25
CN108572917B CN108572917B (en) 2021-05-25

Family

ID=63574645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810327625.7A Active CN108572917B (en) 2018-04-12 2018-04-12 Construction method of code prediction model based on method constraint relationship

Country Status (1)

Country Link
CN (1) CN108572917B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538807A (en) * 2020-04-16 2020-08-14 上海交通大学 System and method for acquiring knowledge of Web API based on Stack Overflow website

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929781A (en) * 2012-11-12 2013-02-13 桂林电子科技大学 Queue communication concurrency recursive program verification method based on context delimiting
US9069755B2 (en) * 2010-03-11 2015-06-30 Microsoft Technology Licensing, Llc N-gram model smoothing with independently controllable parameters
US20150254162A1 (en) * 2014-03-05 2015-09-10 Concurix Corporation N-Gram Analysis of Software Behavior in Production and Testing Environments
US9594665B2 (en) * 2014-03-05 2017-03-14 Microsoft Technology Licensing, Llc Regression evaluation using behavior models of software applications
CN107766337A (en) * 2017-09-25 2018-03-06 沈阳航空航天大学 Translation Forecasting Methodology based on deep semantic association

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9069755B2 (en) * 2010-03-11 2015-06-30 Microsoft Technology Licensing, Llc N-gram model smoothing with independently controllable parameters
CN102929781A (en) * 2012-11-12 2013-02-13 桂林电子科技大学 Queue communication concurrency recursive program verification method based on context delimiting
US20150254162A1 (en) * 2014-03-05 2015-09-10 Concurix Corporation N-Gram Analysis of Software Behavior in Production and Testing Environments
US9594665B2 (en) * 2014-03-05 2017-03-14 Microsoft Technology Licensing, Llc Regression evaluation using behavior models of software applications
CN107766337A (en) * 2017-09-25 2018-03-06 沈阳航空航天大学 Translation Forecasting Methodology based on deep semantic association

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
S.EMAIL AUTHOR等: "Using web corpus statistics for program analysi", 《ACM SIGPLAN NOTICES》 *
SHARON SHOHAM等: "Typestate-Based Semantic Code Search over Partial Programs", 《ACM SIGPLAN NOTICES》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538807A (en) * 2020-04-16 2020-08-14 上海交通大学 System and method for acquiring knowledge of Web API based on Stack Overflow website
CN111538807B (en) * 2020-04-16 2023-04-07 上海交通大学 System and method for acquiring Web API knowledge based on Stack Overflow website

Also Published As

Publication number Publication date
CN108572917B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN111209749B (en) Method for applying deep learning to Chinese word segmentation
CN108416058B (en) A Relation Extraction Method Based on Bi-LSTM Input Information Enhancement
US10409911B2 (en) Systems and methods for text analytics processor
Passos et al. Lexicon infused phrase embeddings for named entity resolution
US7478038B2 (en) Language model adaptation using semantic supervision
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN112215013B (en) A deep learning-based clone code semantic detection method
US8239349B2 (en) Extracting data
CN107273358B (en) End-to-end English chapter structure automatic analysis method based on pipeline mode
CN111881264B (en) A method and electronic device for long text retrieval in open domain question answering tasks
CN110147444A (en) neural network language model, text prediction method, device and storage medium
CN101751385A (en) Multilingual information extraction method adopting hierarchical pipeline filter system structure
CN110413972A (en) A kind of table name field name intelligence complementing method based on NLP technology
CN115098848B (en) Small sample password set guessing method based on multitask learning
CN110298044A (en) A kind of entity-relationship recognition method
Qian et al. Fine-grained entity typing without knowledge base
EP3598321A1 (en) Method for parsing natural language text with constituent construction links
CN120012777A (en) Text similarity recognition method, system, device and storage medium
US10810368B2 (en) Method for parsing natural language text with constituent construction links
CN112818110B (en) Text filtering method, equipment and computer storage medium
CN118585641A (en) A text summary generation method based on pre-training model
Sarikaya et al. Shrinkage based features for slot tagging with conditional random fields.
CN108572917A (en) Construction Method of Code Prediction Model Based on Method Constraint Relationship
Kotiyal et al. Text Classification Using N-Grams for Providing Effective Response in Chatbot
Ekbal et al. Classifier ensemble selection using genetic algorithm for named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant