CN117130645A

CN117130645A - Automatic program repairing method and system based on large language model and completion engine

Info

Publication number: CN117130645A
Application number: CN202311384703.4A
Authority: CN
Inventors: 胡鹏飞; 郝立鹏; 刘健雄; 李峰; 金岩
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2023-11-28

Abstract

The invention belongs to the field of program repair, and specifically relates to an automatic program repair method and system based on a large language model and a completion engine. First, the large language model is used to provide the probability of the next identifier in the generated patch code, and then the completion engine is queried. , modify the probability list by dynamically zeroing out the probability of an invalid identifier, picking from the new probability list to select the next identifier. In addition, given the completion engine's ability to provide code completion suggestions, when the completion engine only provides a candidate identifier suffix to complete the context, we use that identifier. This not only allows the method proposed in this patent to generate efficient, uncommon and long identifier patch codes, but also reduces the workload of large language models required to iteratively generate long identifier names.

Description

Automatic program repair method and system based on large language model and completion engine

技术领域Technical field

本发明属于程序修复领域，具体涉及一种基于大型语言模型和补全引擎的自动程序修复方法及系统。The invention belongs to the field of program repair, and specifically relates to an automatic program repair method and system based on a large language model and a completion engine.

背景技术Background technique

随着软件系统变得越来越复杂，软件漏洞也越来越复杂。传统的最先进的自动程序修复工具主要基于手工修复模板来匹配有缺陷的代码并应用相应的代码补丁。虽然优于其他传统技术，但此类工具只能修复预设模板内的错误类型，而不能泛化到新的漏洞类型。As software systems become more complex, so do software vulnerabilities. Traditional state-of-the-art automated program repair tools are mainly based on manual repair templates to match defective code and apply corresponding code patches. While superior to other traditional techniques, such tools can only fix bug types within preset templates and cannot generalize to new vulnerability types.

随着深度学习技术的发展，研究人员基于神经机器翻译架构构建了基于深度学习的自动程序修复工具。如现有技术CN116755753A使用训练神经机器翻译模型，通过学习从开源项目提交的代码库中抓取的有缺陷代码和修复代码，将有缺陷的代码转换为正确的代码。然而，这些工具的训练集的大小可能受到限制，并且还包含不相关或嘈杂的数据，降低了模型预训练的准确性。With the development of deep learning technology, researchers have built automatic program repair tools based on deep learning based on neural machine translation architecture. For example, the existing technology CN116755753A uses a trained neural machine translation model to convert defective codes into correct codes by learning defective codes and repair codes captured from code libraries submitted by open source projects. However, the training sets of these tools may be limited in size and also contain irrelevant or noisy data, reducing the accuracy of model pre-training.

近期，随着大型语言模型技术的发展，越来越多的研究已经证明其在帮助开发人员完成各种编码任务上的作用，也可直接应用在自动化程序修复工作中，用于生成补丁代码。然而，大多数的大型语言模型将程序视为由标识符序列构成的逻辑体，这意味着它们不了解目标编程语言的底层语义约束。这会导致生成大量无效的补丁代码，从而降低了该技术的实用性。Recently, with the development of large-scale language model technology, more and more studies have proven its role in helping developers complete various coding tasks. It can also be directly applied in automated program repair work to generate patch codes. However, most large-scale language models treat programs as logical entities composed of sequences of identifiers, which means that they do not understand the underlying semantic constraints of the target programming language. This results in the generation of a large amount of invalid patch code, thereby reducing the usefulness of the technique.

发明内容Contents of the invention

为了解决上述问题，本专利提出了基于大型语言模型和编程语言自动补全引擎的自动化程序修复方法，通过将大型语言模型与补全引擎融合来生成更有效的补丁代码，利用人工智能技术增强自动化程序修复工作的准确性和可用性。补全引擎可以解析不完整的程序代码，并以高容错的方式推理代码语义。可以将大型语言模型的自回归标识符生成过程比作人类开发人员的代码编写工作，其中补全引擎可以提供实时更新以检查人类/大型语言模型编写的部分代码是否有效。其技术方案为，In order to solve the above problems, this patent proposes an automated program repair method based on a large language model and a programming language automatic completion engine. It generates more effective patch code by integrating the large language model with the completion engine, and uses artificial intelligence technology to enhance automation. Accuracy and availability of program fixes. The completion engine can parse incomplete program code and reason about code semantics in a highly fault-tolerant manner. The autoregressive identifier generation process for large language models can be likened to the work of a human developer writing code, where a completion engine can provide real-time updates to check whether parts of the code written by humans/large language models are valid. Its technical solution is,

一种基于大型语言模型和补全引擎的自动程序修复方法，包括以下步骤：An automatic program repair method based on large language models and completion engines, including the following steps:

S1.使用大型语言模型生成补丁代码的标识符库T（t₁,t₂,…t_n）以及对应的概率P(p₁,p₂,…p_n)；S1. Use a large language model to generate the identifier library T (t ₁ , t ₂ ,...t _n ) of the patch code and the corresponding probability P (p ₁ , p ₂ ,...p _n );

S2.按设定顺序对标识符库T中的标识符进行采样，获取标识符t_i（1≤i≤n），并检查记忆体中是否存储有该标识符对应的记录：若采样的标识符t_i在记忆体中有记录被命中，则无需调用补全引擎一，根据情况选择保留或直接进行剪枝操作；若采样的标识符t_i在记忆体中没有被命中，则调用补全引擎一，根据采样的标识符t_i、当前程序代码上下文和当前插入符位置，生成补全结果，如果结果不是未知的且补全结果为空，则判定当前采样的标识符t_i被拒绝，通过将采样的标识符t_i对应的概率p_i（1≤i≤n）置零，完成剪枝操作并更新记忆体记录，否则判定当前采样的标识符t_i被采纳，并更新记忆体记录，继续对标识符库T中的标识符进行采样，重复步骤S2，直到标识符库T中所有的标识符采样完毕；S2. Sample the identifiers in the identifier library T in the set order, obtain the identifier _ti (1≤i≤n), and check whether the record corresponding to the identifier is stored in the memory: if the sampled identifier If the symbol t _i has a record hit in the memory, there is no need to call the completion engine 1. It can be retained or directly pruned according to the situation; if the sampled identifier t _i is not hit in the memory, the completion engine 1 is called. Engine 1 generates a completion result based on the sampled identifier _ti , the current program code context and the current caret position. If the result is not unknown and the completion result is empty, it is determined that the current sampled identifier _ti is rejected. By setting the probability p _i (1≤i≤n) corresponding to the sampled identifier _ti to zero, the pruning operation is completed and the memory record is updated. Otherwise, the current sampled identifier _ti is determined to be adopted, and the memory record is updated. , continue to sample the identifiers in the identifier library T, and repeat step S2 until all identifiers in the identifier library T are sampled;

S3.补全引擎二对更新后的标识符库T主动补全标识符字符串，以给出一系列候选的补全标识符作为被采纳的标识符字符串候选使用的延续，利用更新后的标识符库重复步骤 S1-S3 ，直至补全所有标识符字符串，形成完整的补丁代码。S3. The completion engine 2 automatically completes the identifier string for the updated identifier library T to provide a series of candidate completion identifiers as a continuation of the adopted identifier string candidate, using the updated The identifier library repeats steps S1-S3 until all identifier strings are completed to form a complete patch code.

优选的，使用大型语言模型提供生成补丁代码的标识符库T（t₁,t₂,…t_n）及对应的概率库P(p₁,p₂,…p_n)，将标识符库T及对应的概率库P映射到搜索空间上，采样时按照概率大小从高到低进行。Preferably, a large language model is used to provide the identifier library T (t ₁ , t ₂ ,...t _n ) and the corresponding probability library P (p ₁ , p ₂ ,...p _n ) for generating patch codes, and the identifier library T And the corresponding probability library P is mapped to the search space, and sampling is performed from high to low according to the probability.

优选的，剪枝操作为：剪枝模块将搜索空间中的标识符t_i（1≤i≤n）对应的概率p_i（1≤i≤n）置为零；当某一标识符概率为零时，不对其进行采样。Preferably, the pruning operation is: the pruning module sets the probability p _i (1≤i≤n) corresponding to the identifier _ti (1≤i≤n) in the search space to zero; when the probability of a certain identifier is When zero, no sampling is performed.

优选的，步骤S2对每一个标识符库T进行采样判断过程中，不涉及大语言模型再次生成新的标识符库，即步骤S2只对大语言模型的本次输出结果进行判断。Preferably, the process of sampling and judging each identifier library T in step S2 does not involve the large language model generating a new identifier library again, that is, step S2 only judges the current output result of the large language model.

优选的，补丁代码生产方法：Preferred, patch code production method:

首先使用标识符作为屏蔽代码来替换有漏洞的代码块，形成补丁雏形；First, use the identifier as a shielding code to replace the vulnerable code block to form a prototype patch;

然后使用大型语言模型从漏洞位置周围的代码的上下文合成修复代码块来替换标识符。A large language model is then used to synthesize a fix code block from the context of the code surrounding the vulnerability location to replace the identifier.

优选的，步骤S2中，记忆体存储已知不可行或/和可行的标识符，标识符被命中分为以下三种情况：Preferably, in step S2, the memory stores identifiers that are known to be infeasible or/and feasible. The identifier is hit in the following three situations:

如果命中记忆体中被拒绝的记录，则判定当前标识符不可用，剪枝模块直接对搜索空间进行剪枝操作；If a rejected record in the memory is hit, it is determined that the current identifier is unavailable, and the pruning module directly prunes the search space;

如果命中记忆体中被拒绝标识符的前缀树Trie，则检查命中记录中任何标识符是否是新生成的t_i（1≤i≤n）的前缀，若是，剪枝模块直接对搜索空间进行剪枝操作，并重新从标识符库T中采样；If the prefix tree Trie of the rejected identifier in the memory is hit, check whether any identifier in the hit record is the prefix of the newly generated _ti (1≤i≤n). If so, the pruning module directly prunes the search space. branch operation and resample from the identifier library T;

如果命中记忆体中存储的可行记录，被采纳的标识符需要保留。If a hit is made to a feasible record stored in memory, the adopted identifier needs to be retained.

优选的，步骤S2中，修剪搜索空间过程为：Preferably, in step S2, the process of pruning the search space is:

首先，根据大型语言模型给出的标识符库T与对应的概率P的映射对候选的下一个标识符进行采样，相应地更新当前程序代码体，并将插入符号移动到新采样生成的标识符t_i（1≤i≤n）之后；First, the candidate next identifier is sampled according to the mapping between the identifier library T given by the large language model and the corresponding probability P, the current program code body is updated accordingly, and the caret is moved to the newly sampled identifier. After t _i (1≤i≤n);

然后，根据当前生成的标识符字串结果，调用补全引擎一，检查当前的字串结果；如果结果不是未知且没有完成补全，则意味着当前生产的标识符之后无法形成进一步候选的延续，因此，通过剪枝模块将搜索空间中的标识符t_i（1≤i≤n）对应的概率p_i（1≤i≤n）置零来对其进行剪枝操作，重新从标识符库T进行采样，执行下一个循环；否则，认为当前标识符可行。Then, based on the currently generated identifier string result, the completion engine 1 is called to check the current string result; if the result is not unknown and the completion is not completed, it means that the currently produced identifier cannot form a further candidate continuation. , therefore, through the pruning module, the probability p _i (1≤i≤n) corresponding to the identifier _ti (1≤i≤n) in the search space is set to zero to perform the pruning operation, and the identifier library is re-used. T is sampled and the next loop is executed; otherwise, the current identifier is considered feasible.

优选的，步骤S2中，所述记忆体构建被拒绝标识符的前缀树，具体步骤为：如果一个标识符被拒绝，意味着在该标识符之后无法形成候选的延续来获得静态有效的补丁代码，那么任何有此类前缀的标识符都应该被拒绝，构建并不断更新给定程序代码体和插入符位置的所有被拒绝的标识符的前缀树，并检查其中的任何标识符是否是新生成的下一个标识符的前缀；如果是，则重新从T中采样。Preferably, in step S2, the memory constructs a prefix tree of rejected identifiers. The specific steps are: if an identifier is rejected, it means that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code. , then any identifier with such a prefix should be rejected, build and continuously update a prefix tree of all rejected identifiers given the program code body and caret position, and check whether any identifier in it is newly generated The prefix of the next identifier; if so, resample from T.

优选的，步骤S3中，主动补全操作步骤为：Preferably, in step S3, the active completion operation steps are:

根据给定的程序代码体和当前插入符位置，获取补全结果，并检查当前标识符字串是否为未知；如果是，则结果将设置为空字符串，这意味着不会生成额外的补全；否则，计算所有补全结果的公共前缀，并调整补全结果以适应语言模型的词汇表要求，并返回结果。Gets the completion result based on the given program code body and the current caret position, and checks whether the current identifier string is unknown; if so, the result will be set to an empty string, which means no additional completions will be generated. Complete; otherwise, calculate the common prefix of all completion results, adjust the completion results to adapt to the vocabulary requirements of the language model, and return the results.

一种基于大型语言模型和补全引擎的自动程序修复系统，包括大型语言模型、剪枝模块、补全引擎一和补全引擎二；An automatic program repair system based on a large language model and a completion engine, including a large language model, a pruning module, a completion engine one and a completion engine two;

大型语言模型用于提供生成补丁代码的标识符库T（t₁,t₂,…t_n）及对应的概率P(p₁,p₂,…p_n)，并将标识符库T以及对应的概率P映射至搜索空间；The large language model is used to provide the identifier library T (t ₁ , t ₂ ,...t _n ) and the corresponding probability P (p ₁ , p ₂ ,...p _n ) that generate the patch code, and combine the identifier library T and the corresponding The probability P is mapped to the search space;

剪枝模块用于检查标识符库T中被采样的标识符t_i（1≤i≤n）是否命中记忆体中的内容，并根据情况对搜索空间进行剪枝；The pruning module is used to check whether the sampled identifier _ti (1≤i≤n) in the identifier library T hits the content in the memory, and prunes the search space according to the situation;

在标识符t_i没有命中记忆体时，补全引擎一获取记忆体查询结果，对标识符字串进行补全操作，若补全结果不是未知的且补全结果为空，则触发剪枝模块进行搜索空间剪枝，否则标识符t_i被采纳；When the identifier _ti does not hit the memory, the completion engine obtains the memory query result and performs a completion operation on the identifier string. If the completion result is not unknown and the completion result is empty, the pruning module is triggered. Perform search space pruning, otherwise identifier _ti is adopted;

补全引擎二主动补全标识符字符串，以给出一系列候选的补全标识符作为当前被命中的标识符字符串候选使用的延续，生成补全结果。The completion engine 2 actively completes the identifier string to provide a series of candidate completion identifiers as a continuation of the currently hit identifier string candidate, and generates a completion result.

与现有技术相比，本申请具有以下有益效果：Compared with the existing technology, this application has the following beneficial effects:

1）本专利提出的方法针对与修复单一漏洞的场景，补丁是在已经提供了准确的故障代码位置定位的前提下开展的，通过更改定位点出的连续代码来获得补丁代码。更进一步的，本专利提出的方法可以通过使用单独的填充标识符来替换故障代码位置出的代码，并借助大型语言模型生成替换代码来扩展实现对多个漏洞代码的修补。本专利提出的方法可以直接适用于通用编程语言，同时引入最小的开销，并且可以使用补全引擎主动完成当前补丁代码标识符的生成，而无需多次调用大型语言模型。1) The method proposed in this patent is aimed at repairing a single vulnerability. The patch is carried out on the premise that accurate location of the fault code has been provided, and the patch code is obtained by changing the consecutive codes pointed out by the location. Furthermore, the method proposed in this patent can be extended to patch multiple vulnerable codes by using a separate padding identifier to replace the code at the faulty code location, and using a large language model to generate replacement code. The method proposed in this patent can be directly applied to general-purpose programming languages while introducing minimal overhead, and can use the completion engine to proactively complete the generation of current patch code identifiers without the need to call large language models multiple times.

2）本专利提出的方法，首先使用大型语言模型提供生成的补丁代码中下一个标识符的概率，然后查询补全引擎，通过动态将无效令牌的概率归零来修改概率列表，从新的概率列表中进行选取以选择下一个标识符。此外，鉴于补全引擎具备的提供代码补全建议的能力，当补全引擎只提供一个候选的标识符后缀来补全上下文时，我们就使用采用该标识符。这不仅允许本专利提出的方法可以生成有效的、不常见的和长标识符的补丁代码，而且还减少了迭代生成长标识符名称所需的大型语言模型的工作量。2) The method proposed in this patent first uses a large language model to provide the probability of the next identifier in the generated patch code, and then queries the completion engine to modify the probability list by dynamically returning the probability of invalid tokens to zero, starting from the new probability Select the next identifier in the list. In addition, given the completion engine's ability to provide code completion suggestions, when the completion engine only provides a candidate identifier suffix to complete the context, we use that identifier. This not only allows the method proposed in this patent to generate efficient, uncommon and long identifier patch codes, but also reduces the workload of large language models required to iteratively generate long identifier names.

附图说明Description of the drawings

图1为本申请流程图。Figure 1 is a flow chart of this application.

图2为本申请实施例图。Figure 2 is a diagram of an embodiment of the present application.

具体实施方式Detailed ways

以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。The following detailed description is illustrative and is intended to provide further explanation of the present application. Unless otherwise defined, all technical and scientific terms used herein have the same meanings commonly understood by one of ordinary skill in the art to which this application belongs. It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application.

S1.使用大型语言模型提供生成补丁代码的标识符库T（t₁,t₂,…t_n）及对应的概率库P(p₁,p₂,…p_n)，将标识符库T及对应的概率库P映射到搜索空间上，采样时按照概率大小从高到低进行。S1. Use a large language model to provide the identifier library T (t ₁ , t ₂ ,...t _n ) and the corresponding probability library P (p ₁ , p ₂ ,...p _n ) for generating patch codes, and combine the identifier library T and The corresponding probability library P is mapped to the search space, and sampling is performed from high to low according to the probability.

S2.按设定顺序对标识符库T中的标识符进行采样，获取标识符t_i（1≤i≤n），并检查记忆体中是否存储有该标识符对应的记录：若采样的标识符t_i在记忆体中有记录被命中，则无需调用补全引擎一，直接进行剪枝操作；若采样的标识符t_i在记忆体中没有被命中，则调用补全引擎一，根据采样的标识符t_i、当前程序代码上下文和当前插入符位置，生成补全结果，如果结果不是未知的且补全结果为空，则判定当前采样的标识符t_i被拒绝，通过将采样的标识符t_i对应的概率p_i（1≤i≤n）置零，完成剪枝操作并更新记忆体记录，否则判定当前采样的标识符t_i被采纳，并更新记忆体记录，继续对标识符库T中的标识符进行采样，重复步骤S2，直到标识符库T中所有的标识符采样完毕。可以把步骤S2看作是标识符采样内循环过程。S2. Sample the identifiers in the identifier library T in the set order, obtain the identifier _ti (1≤i≤n), and check whether the record corresponding to the identifier is stored in the memory: if the sampled identifier If the symbol t _i has a record hit in the memory, there is no need to call the completion engine 1, and the pruning operation is performed directly; if the sampled identifier t _i is not hit in the memory, the completion engine 1 is called, and the pruning operation is performed directly according to the sample. The identifier _ti , the current program code context and the current caret position are used to generate a completion result. If the result is not unknown and the completion result is empty, it is determined that the currently sampled identifier _ti is rejected, and the sampled identifier is The probability p _i (1≤i≤n) corresponding to symbol _ti is set to zero, the pruning operation is completed and the memory record is updated, otherwise it is determined that the currently sampled identifier _ti is adopted, and the memory record is updated, and the identifier is continued. The identifiers in the identifier library T are sampled, and step S2 is repeated until all identifiers in the identifier library T are sampled. Step S2 can be regarded as an identifier sampling inner loop process.

如果结果不是未知的且补全结果为空，补全引擎一判断条件的逻辑相当于：If the result is not unknown and the completion result is empty, the completion engine's logic for judging the condition is equivalent to:

（1）假设采样的t_i可用，把它放到已经生成的代码字符串后边，形成标识符字串S`。(1) Assume that the sampled _ti is available, put it behind the generated code string to form the identifier string S`.

（2）把S`交给补全引擎一，判断S`是否合法。若S`合法，则要么t_i是最后一个单词（补全结果是未知的），要么可以在S`的基础上继续生成可能的延续（补全结果不为空）；否则，若补全引擎一生成的结果不是未知的且补全结果为空，则表明S`不合法，即把t_i放在已经生成的标识符字串后边不合适，也就是t_i是被拒绝的。(2) Hand S` to completion engine 1 to determine whether S` is legal. If S` is legal, then either t _i is the last word (the completion result is unknown), or possible continuations can be generated based on S` (the completion result is not empty); otherwise, if the completion engine Once the generated result is not unknown and the completion result is empty, it means that S` is illegal, that is, it is inappropriate to put _ti after the generated identifier string, that is, _ti is rejected.

（3）根据第（2）中结果判定t_i是被拒绝还是被采纳。(3) Determine whether _ti is rejected or accepted based on the result in (2).

剪枝操作为：剪枝模块将搜索空间中的标识符t_i（1≤i≤n）对应的概率p_i（1≤i≤n）置为零；当某一标识符概率为零时，不对其进行采样。The pruning operation is: the pruning module sets the probability p _i (1≤i≤n) corresponding to the identifier _ti (1≤i≤n) in the search space to zero; when the probability of a certain identifier is zero, It is not sampled.

步骤S2对每一个标识符库T进行采样判断过程中，不涉及大语言模型再次生成新的标识符库，即步骤S2只对大语言模型的本次输出结果进行判断。The process of sampling and judging each identifier library T in step S2 does not involve the large language model regenerating a new identifier library, that is, step S2 only judges the current output result of the large language model.

记忆体存储已知可行的标识符时，若采样的标识符t_i（1≤i≤n）在记忆体中有记录被命中，则无需调用补全引擎一，直接进行剪枝操作。When the memory stores known feasible identifiers, if the sampled identifier _ti (1≤i≤n) has a record hit in the memory, there is no need to call the completion engine 1 and the pruning operation is performed directly.

记忆体存储已知被拒绝的标识符、已知被采纳的标识符和已知被拒绝的标识符的前缀树，标识符记录被命中分为以下三种情况：The memory stores the prefix tree of known-rejected identifiers, known-adopted identifiers, and known-rejected identifiers. The identifier record is hit in the following three situations:

（1）如果命中记忆体中被拒绝的记录，则判定当前标识符t_i（1≤i≤n）不可用，剪枝模块直接对搜索空间进行剪枝操作，并继续从标识符库T中采样。(1) If a rejected record in the memory is hit, it is determined that the current identifier t _i (1≤i≤n) is not available. The pruning module directly performs pruning operation on the search space and continues to extract the identifier from the identifier library T. sampling.

（2）如果命中记忆体中被采纳的记录，则判定当前标识符t_i（1≤i≤n）可用，维持当前标识符的概率，并继续从标识符库T中采样。(2) If the adopted record in the memory is hit, it is determined that the current identifier _ti (1≤i≤n) is available, the probability of the current identifier is maintained, and sampling from the identifier library T is continued.

（3）如果命中记忆体中被拒绝标识符的前缀树Trie，则检查命中记录中任何标识符是否是新生成的标识符字串的前缀，若是，剪枝模块直接对搜索空间进行剪枝操作，并重新从标识符库T中采样；(3) If the prefix tree Trie of the rejected identifier in the memory is hit, check whether any identifier in the hit record is the prefix of the newly generated identifier string. If so, the pruning module directly prunes the search space. , and resample from the identifier library T;

所述记忆体构建被拒绝标识符的前缀树，具体步骤为：如果一个标识符被拒绝，意味着在该标识符之后无法形成候选的延续来获得静态有效的补丁代码，那么任何有此类前缀的标识符都应该被拒绝，构建并不断更新给定程序代码体和插入符位置的所有被拒绝的标识符的前缀树，并检查其中的任何标识符是否是新生成的下一个标识符的前缀；如果是，则重新从T中采样。The memory constructs a prefix tree of rejected identifiers. The specific steps are: if an identifier is rejected, which means that no candidate continuation can be formed after the identifier to obtain a statically valid patch code, then any prefix with such a of identifiers should all be rejected, construct and continuously update a prefix tree of all rejected identifiers for a given program code body and caret position, and check whether any identifier in it is a prefix of the newly generated next identifier ; If so, resample from T.

如记忆体中没有存储当前采样的标识符时，则调用补全引擎一，根据采样的标识符t_i（1≤i≤n）、当前程序代码上下文和当前插入符位置，生成补全结果，如果结果不是未知的且补全结果为空，则判定当前采样的标识符t_i被拒绝，通过将采样的标识符t_i对应的概率p_i（1≤i≤n）置零，完成剪枝操作并更新记忆体记录，否则判定当前采样的标识符t_i被采纳，并更新记忆体记录。If the identifier of the current sample is not stored in the memory, the completion engine 1 is called to generate the completion result based on the sample identifier _ti (1≤i≤n), the current program code context and the current caret position. If the result is not unknown and the completion result is empty, it is determined that the currently sampled identifier t _i is rejected, and pruning is completed by setting the probability p _i (1≤i≤n) corresponding to the sampled identifier t _i to zero. Operate and update the memory record, otherwise it is determined that the identifier _ti of the current sample is adopted, and the memory record is updated.

S3.补全引擎二主动补全标识符字符串，以给出一系列可能的补全标识符作为当前被命中的标识符字符串候选使用的延续，利用更新后的标识符库重复步骤 S1-S3，直至补全所有标识符字符串，形成完整的补丁代码。S3. The completion engine 2 actively completes the identifier string to provide a series of possible completion identifiers as a continuation of the currently hit identifier string candidate, and repeats step S1- using the updated identifier library. S3 until all identifier strings are completed to form a complete patch code.

补丁代码生产方法：Patch code production method:

步骤S3中，根据给定的程序代码体和当前插入符位置，获取补全结果，并检查当前标识符字串是否为未知；如果是，则结果将设置为空字符串，这意味着不会生成额外的补全；否则，计算所有补全结果的公共前缀，并调整补全结果以适应语言模型的词汇表要求，并返回结果。In step S3, obtain the completion result based on the given program code body and the current caret position, and check whether the current identifier string is unknown; if so, the result will be set to an empty string, which means there will be no Generate additional completions; otherwise, compute the common prefix of all completion results, adjust the completion results to fit the language model's vocabulary requirements, and return the results.

整个步骤S1-S3可以看成是一个大的外循环。The entire steps S1-S3 can be regarded as a large outer loop.

大型语言模型用于提供生成补丁代码的标识符库T（t₁,t₂,…t_n）以及对应的概率P(p₁,p₂,…p_n)，并将标识符库T以及对应的概率P映射至搜索空间；The large language model is used to provide the identifier library T (t ₁ , t ₂ ,...t _n ) and the corresponding probability P (p ₁ , p ₂ ,...p _n ) that generate the patch code, and the identifier library T and the corresponding The probability P is mapped to the search space;

在标识符t_i没有命中记忆体时，补全引擎一获取记忆体查询结果，对标识符字串进行补全操作，若补全结果不是未知的且补全结果为空，则触发剪枝模块进行搜索空间剪枝，否则标识符t_i被采纳。When the identifier _ti does not hit the memory, the completion engine obtains the memory query result and performs a completion operation on the identifier string. If the completion result is not unknown and the completion result is empty, the pruning module is triggered. Search space pruning is performed, otherwise identifier _ti is adopted.

实施例，如图2所示，展示的是本专利提出的实现补丁代码生成的过程。The embodiment, as shown in Figure 2, shows the process of patch code generation proposed in this patent.

生成过程由内、外两个循环组成一个大循环，大循环不断使用大型语言模型和补全引擎二之间的协同生成新的标识符来更新生成的结果。The generation process consists of two inner and outer loops forming a large loop. The large loop continuously uses the collaboration between the large language model and the completion engine to generate new identifiers to update the generated results.

首先外部循环应用当前产生的标识符作为大型语言模型的输入（图2中的①），大型语言模型返回从给出的当前采样的标识符库T（String,Name,End,…）及对应的概率库P（91%,3%,0.2%,…），将标识符库T及对应的概率库P映射到内循环的搜索空间上，采样时按照概率大小从高到低进行。然后，进入内循环的标识符采样阶段，从搜索空间中重复进行标识符采样过程，检查其可行性，并修剪搜索空间，直到标识符被采纳。First, the outer loop applies the currently generated identifier as the input of the large language model (① in Figure 2). The large language model returns the given currently sampled identifier library T (String, Name, End,...) and the corresponding The probability library P (91%, 3%, 0.2%,...) maps the identifier library T and the corresponding probability library P to the search space of the inner loop, and the sampling is performed from high to low according to the probability. Then, enter the identifier sampling phase of the inner loop, repeat the identifier sampling process from the search space, check its feasibility, and prune the search space until the identifier is adopted.

每次对标识符库T进行采样获取t_i（1≤i≤n）后，首先检查标识符t_i是否命中记忆体中的内容(图2中的②)，该记忆体存储了已知可行或不可行的标识符记录。不可行标识符的记忆体记录包括了我们自定义的前缀树数据结构（Trie），将在后续内容中介绍。Each time the identifier library T is sampled to obtain _ti (1≤i≤n), first check whether the identifier _ti hits the content in the memory (② in Figure 2). The memory stores known feasible or infeasible identifier record. The memory record of infeasible identifiers includes our custom prefix tree data structure (Trie), which will be introduced in the following content.

当采样的标识符t_i命中记忆体中的不可行记录时，通过将该标识符t_i的概率p_i（1≤i≤n）设置为零（图2中的③）来修剪搜索空间，并且下一个采样操作将在更新的搜索空间上进行。这样，在标识符选择阶段就不会再次采样相同的标识符了，避免进行无用操作。如果采样的标识符t_i没有命中记忆体中的存储内容（即记忆体中没有存储标识符Name），则调用补全引擎一，根据采样的标识符t_i、当前程序代码上下文和当前插入符位置，生成补全结果，如果结果不是未知的且补全结果为空，则判定当前采样的标识符t_i被拒绝，通过将采样的标识符t_i对应的概率p_i（1≤i≤n）置零，完成剪枝操作并更新记忆体记录，否则判定当前采样的标识符t_i被采纳，并更新记忆体记录。在这两种情况下，记忆体都会更新（图2中的⑤）。接受标识符t_i后（图2中的⑥），我们进一步利用补全引擎二，尝试主动补全标识符字串（图2中的⑦）。When the sampled identifier _ti hits an infeasible record in the memory, the search space is pruned by setting the probability p _i (1≤i≤n ₎ of this identifier ti to zero (③ in Figure 2), And the next sampling operation will be performed on the updated search space. In this way, the same identifier will not be sampled again during the identifier selection phase, avoiding useless operations. If the sampled identifier _ti does not hit the storage content in the memory (that is, the identifier Name is not stored in the memory), the completion engine 1 is called. According to the sampled identifier _ti , the current program code context and the current caret position, generate the completion result. If the result is not unknown and the completion result is empty, it is determined that the currently sampled identifier t _i is rejected, and the probability p _i corresponding to the sampled identifier t _i is determined (1≤i≤n ) is set to zero, complete the pruning operation and update the memory record, otherwise it is determined that the identifier _ti of the current sample is adopted, and the memory record is updated. In both cases, the memory is updated (⑤ in Figure 2). After accepting the identifier t _i (⑥ in Figure 2), we further use the completion engine 2 to try to actively complete the identifier string (⑦ in Figure 2).

内循环进行时，不涉及大语言模型再次生成新的标识符库，即内循环只对大语言模型的本次输出结果进行判断。When the inner loop is in progress, the large language model is not involved in generating a new identifier library again, that is, the inner loop only judges the current output result of the large language model.

最后，本专利提出的方法将所有新生成的被接受的标识符附加到当前代系并开始新的循环，直到生成完整的补丁代码。当模型生成表示结束的特殊标志时循环停止。Finally, the method proposed by this patent appends all newly generated accepted identifiers to the current generation and starts a new cycle until a complete patch code is generated. The loop stops when the model generates a special flag indicating the end.

为了缓解内循环过程中采样测试和剪枝操作循环带来的效率下降，加快搜索过程，本专利提出的方法应用了记忆体技术来减少调用补全引擎一进行分析的频率。In order to alleviate the efficiency drop caused by the sampling test and pruning operation loop during the inner loop process and speed up the search process, the method proposed in this patent applies memory technology to reduce the frequency of calling the completion engine for analysis.

记忆体主要实现以下3个作用：Memory mainly fulfills the following three functions:

（1）记住被拒绝的标识符：(1) Remember rejected identifiers:

在实践操作中修复漏洞需要生成大量的样本，这意味着相同的程序代码体和当前插入符位置可能会在剪枝操作中重复出现，因此，我们可以通过将在采样过程中经过判断而被修剪掉的标识符存储在记忆体中，并将被拒绝的标识符概率置零，从而加快剪枝操作进程。Fixing vulnerabilities in practice requires generating a large number of samples, which means that the same program code body and current caret position may appear repeatedly in the pruning operation. Therefore, we can be pruned by judging during the sampling process. The dropped identifiers are stored in the memory and the probability of rejected identifiers is set to zero, thereby speeding up the pruning process.

（2）记住采纳的标识符：(2) Remember the adopted identifier:

除了被拒绝的标识符之外，我们还可以将之前被采纳的标识符存储在记忆体中，从而避免对补全引擎一的调用，而直接判定该标识符可用。In addition to rejected identifiers, we can also store previously accepted identifiers in memory, thereby avoiding a call to the completion engine and directly determining that the identifier is available.

（3）构建被拒绝标识符的前缀树：(3) Construct a prefix tree of rejected identifiers:

语言模型词汇表中的许多标识符可能是另一个标识符的前缀，这在语言模型中是很常见的。显然，如果一个标识符被拒绝，意味着在该标识符之后无法形成可能的延续来获得静态有效的补丁代码，那么任何有此类前缀的标识符都应该被拒绝。因此，我们构建并不断更新给定程序代码体和插入符位置的所有被拒绝的标识符的前缀树（表示为Trie），并检查其中的任何标识符是否是新生成的下一个标识符的前缀。如果是，则直接跳到下一次迭代，避免进一步分析。It is common in language models that many identifiers in a language model vocabulary may be a prefix of another identifier. Obviously, if an identifier is rejected, meaning that no possible continuation can be formed after that identifier to obtain a statically valid patch code, then any identifier with such a prefix should be rejected. Therefore, we build and continuously update a prefix tree (denoted as a Trie) of all rejected identifiers for a given program code body and caret position, and check whether any identifier in it is a prefix of the newly generated next identifier . If so, jump directly to the next iteration and avoid further analysis.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims

1. An automatic program repair method based on a large language model and completion engine, which is characterized by including the following steps:

S1. Use a large language model to generate the identifier library T (t ₁ , t ₂ ,...t _n ) of the patch code and the corresponding probability P (p ₁ , p ₂ ,...p _n );

S2. Sample the identifiers in the identifier library T in the set order, obtain the identifier _ti (1≤i≤n), and check whether the record corresponding to the identifier is stored in the memory: if the sampled identifier If the symbol t _i has a record hit in the memory, there is no need to call the completion engine 1. It can be retained or directly pruned according to the situation; if the sampled identifier t _i is not hit in the memory, the completion engine 1 is called. Engine 1 generates a completion result based on the sampled identifier _ti , the current program code context and the current caret position. If the result is not unknown and the completion result is empty, it is determined that the current sampled identifier _ti is rejected. By setting the probability p _i (1≤i≤n) corresponding to the sampled identifier _ti to zero, the pruning operation is completed and the memory record is updated. Otherwise, the current sampled identifier _ti is determined to be adopted, and the memory record is updated. , continue to sample the identifiers in the identifier library T, and repeat step S2 until all identifiers in the identifier library T are sampled;

S3. The completion engine 2 automatically completes the identifier string for the updated identifier library T to provide a series of candidate completion identifiers as a continuation of the adopted identifier string candidate, using the updated The identifier library repeats steps S1-S3 until all identifier strings are completed to form a complete patch code.

2. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that the identifier library T and the corresponding probability library P are mapped to the search space, and the sampling is carried out according to the probability Size proceeds from high to low.

3. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that the pruning operation is: the pruning module removes the identifier _ti (1≤i ≤n) the corresponding probability p _i (1≤i≤n) is set to zero; when the probability of a certain identifier is zero, it is not sampled.

4. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that, in the process of sampling and judging each identifier library T in step S2, the large language model is not involved again. Generate a new identifier library, that is, step S2 only judges the current output result of the large language model.

5. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that the patch code production method:

First, use the identifier as a shielding code to replace the vulnerable code block to form a prototype patch;

A large language model is then used to synthesize a fix code block from the context of the code surrounding the vulnerability location to replace the identifier.

6. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that in step S2, the memory stores identifiers that are known to be infeasible or/and feasible. Being hit is divided into the following three situations:

If a rejected record in the memory is hit, it is determined that the current identifier is unavailable, and the pruning module directly prunes the search space;

If the prefix tree Trie of the rejected identifier in the memory is hit, check whether any identifier in the hit record is the prefix of the newly generated _ti (1≤i≤n). If so, the pruning module directly prunes the search space. branch operation and resample from the identifier library T;

If a hit is made to a feasible record stored in memory, the adopted identifier needs to be retained.

7. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that in step S2, the process of pruning the search space is:

First, the candidate next identifier is sampled according to the mapping between the identifier library T given by the large language model and the corresponding probability P, the current program code body is updated accordingly, and the caret is moved to the newly sampled identifier. After t _i (1≤i≤n);

Then, based on the currently generated identifier string result, the completion engine 1 is called to check the current string result; if the result is not unknown and the completion is not completed, it means that the currently produced identifier cannot form a further candidate continuation. , therefore, through the pruning module, the probability p _i (1≤i≤n) corresponding to the identifier _ti (1≤i≤n) in the search space is set to zero to perform the pruning operation, and the identifier library is re-used. T is sampled and the next loop is executed; otherwise, the current identifier is considered feasible.

8. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that, in step S2, the memory constructs a prefix tree of rejected identifiers, and the specific steps are: If an identifier is rejected, it means that no candidate continuation can be formed after the identifier to obtain statically valid patch code, then any identifier with such a prefix should be rejected, building and continuously updating the given program code body. and a prefix tree of all rejected identifiers at the caret position, and checks whether any identifier in it is a prefix of the newly generated next identifier; if so, resamples from T.

9. An automatic program repair method based on a large language model and a completion engine according to claim 1, characterized in that in step S3, the active completion operation steps are:

Gets the completion result based on the given program code body and the current caret position, and checks whether the current identifier string is unknown; if so, the result will be set to an empty string, which means no additional completions will be generated. Complete; otherwise, calculate the common prefix of all completion results, adjust the completion results to adapt to the vocabulary requirements of the language model, and return the results.

10. An automatic program repair system based on a large language model and a completion engine, characterized by including a large language model, a pruning module, a completion engine one and a completion engine two;

The large language model is used to provide the identifier library T (t ₁ , t ₂ ,...t _n ) and the corresponding probability P (p ₁ , p ₂ ,...p _n ) that generate the patch code, and combine the identifier library T and the corresponding The probability P is mapped to the search space;

The pruning module is used to check whether the sampled identifier _ti (1≤i≤n) in the identifier library T hits the content in the memory, and prunes the search space according to the situation;

When the identifier _ti does not hit the memory, the completion engine obtains the memory query result and performs a completion operation on the identifier string. If the completion result is not unknown and the completion result is empty, the pruning module is triggered. Perform search space pruning, otherwise identifier _ti is adopted;

The completion engine 2 actively completes the identifier string to provide a series of candidate completion identifiers as a continuation of the currently hit identifier string candidate, and generates a completion result.