WO2025092447A1

WO2025092447A1 - Parallelization strategy optimization method and system, device, and medium

Info

Publication number: WO2025092447A1
Application number: PCT/CN2024/125585
Authority: WO
Inventors: 颜深根
Original assignee: Shanghai Infinigence Ai Intelligent Technology Co Ltd
Current assignee: Shanghai Infinigence Ai Intelligent Technology Co Ltd
Priority date: 2023-11-03
Filing date: 2024-10-17
Publication date: 2025-05-08
Anticipated expiration: 2026-05-03
Also published as: CN117407793A; CN117407793B

Abstract

The present invention provides a parallelization strategy optimization method and system, a device, and a medium. The method comprises the following steps: replacing a scaling factor in a softmax function in a parallelization strategy with a maximum preset fixed value, to obtain an optimized softmax function; and performing exponential operation processing and sequence sum operation processing in parallel on the optimized softmax function, performing matrix multiplication processing after the exponential operation processing is completed, and correcting a matrix multiplication processing result by using a sequence sum operation processing result, so as to complete parallelization strategy optimization. The present application can improve the parallelization strategy operation efficiency of a large language model.

Description

A parallel strategy optimization method, system, device and medium

交叉引用Cross-references

本申请要求于2023年11月3日提交的中国专利申请202311456221.5的优先权，其全部内容通过引用整体结合在本申请中。This application claims priority to Chinese patent application 202311456221.5 filed on November 3, 2023, the entire contents of which are incorporated herein by reference in their entirety.

Technical Field

本发明属于深度学习技术领域，具体涉及一种并行化策略优化方法、系统、设备及介质。The present invention belongs to the field of deep learning technology, and specifically relates to a parallelization strategy optimization method, system, device and medium.

Background Art

大语言模型是指使用大量文本数据训练的深度学习模型，可以生成自然语言文本或理解语言文本的含义。大语言模型可以处理多种自然语言任务，如文本分类、问答、对话等，是通向人工智能的一条重要途径。A large language model is a deep learning model trained with a large amount of text data that can generate natural language text or understand the meaning of language text. Large language models can handle a variety of natural language tasks, such as text classification, question answering, and dialogue, and are an important path to artificial intelligence.

随着大语言模型在各个领域变得越来越重要，大语言模型推理的性能对大规模大语言模型应用至关重要。已有很多工作对大语言模型推理进行了优化；如图2所示，由Transformer组成的大语言模型可以分为Prefill和Decode两个阶段，两阶段的主要区别在于输入矩阵Q的大小不同，所执行的数据流是相似的，均是多个Transformer层组成，其中每一层可以分为线性运算和注意力机制运算，其中注意力机制的运算包括了两次通用矩阵乘法和一次softmax运算。As large language models become more and more important in various fields, the performance of large language model reasoning is crucial for large-scale large language model applications. A lot of work has been done to optimize large language model reasoning; as shown in Figure 2, a large language model composed of Transformer can be divided into two stages: Prefill and Decode. The main difference between the two stages is the size of the input matrix Q. The executed data flow is similar, both consisting of multiple Transformer layers, each of which can be divided into linear operations and attention mechanism operations, where the attention mechanism operation includes two general matrix multiplications and one softmax operation.

在大语言模型推理过程中，为了提高计算的并行度并减少数据的读取和写回的开销，已有工作FlashAttention在计算注意力机制的过程中改变了原有的整体计算方式如图3(a)所示，选择了将注意力矩阵进行切分，然后对每一部分进行部分softmax计算如图3(b)所示，所以计算过程中需要完成当前信息和过去信息的同步，并且完成对已有结果的更新操作。In the process of large language model inference, in order to improve the parallelism of calculations and reduce the overhead of reading and writing data, the existing work FlashAttention changed the original overall calculation method in the process of calculating the attention mechanism as shown in Figure 3(a). It chose to split the attention matrix and then perform partial softmax calculation on each part as shown in Figure 3(b). Therefore, the calculation process needs to complete the synchronization of current information and past information, and complete the update operation of existing results.

目前在大语言模型推理计算流中，注意力机制计算流水线存在的问题为：目前常见的注意力机制计算流水线是采用的部分softmax运算，其是利用部分矩阵数据进行计算结果，而因为每一部分所获取的数据不同，所以需要对每一部分的计算结果之间进行信息同步和结果更新，这种部分softmax同步更新计算会导致近20％的额外开销。 Currently, in the large language model inference calculation flow, the problem with the attention mechanism calculation pipeline is: the current common attention mechanism calculation pipeline adopts partial softmax operation, which uses partial matrix data to calculate the result. Because the data obtained from each part is different, it is necessary to synchronize information and update the results of the calculation of each part. This partial softmax synchronous update calculation will cause nearly 20% additional overhead.

因此，预期一种能够提高大语言模型的并行化策略运算效率的并行化策略优化方法。Therefore, a parallelization strategy optimization method that can improve the operational efficiency of the parallelization strategy for large language models is desired.

发明内容Summary of the invention

针对现有技术中存在的问题，本发明提供一种并行化策略优化方法、系统、设备及介质，至少部分解决现有技术中存在的问题。In view of the problems existing in the prior art, the present invention provides a parallelization strategy optimization method, system, device and medium, which at least partially solve the problems existing in the prior art.

在第一方面，本公开实施例提供了一种并行化策略优化方法，包括以下步骤：In a first aspect, an embodiment of the present disclosure provides a parallelization strategy optimization method, comprising the following steps:

将并行化策略中的softmax函数中缩放因子替换为最大预设固定值，得到优化softmax函数；Replace the scaling factor in the softmax function in the parallelization strategy with the maximum preset fixed value to obtain the optimized softmax function;

将优化softmax函数并列进行指数运算处理与序列和运算处理，其中指数运算处理完成后进行矩阵乘法运算处理，并利用序列和运算处理结果对矩阵乘法运算处理结果修正，完成并行化策略优化。The optimized softmax function is processed by exponential operation and sequence and operation in parallel, wherein matrix multiplication operation is performed after the exponential operation is completed, and the result of the sequence and operation is used to correct the result of the matrix multiplication operation to complete the parallelization strategy optimization.

根据本公开实施例的一种具体实现方式，所述并行化策略中的softmax函数为：
According to a specific implementation of the embodiment of the present disclosure, the softmax function in the parallelization strategy is:

其中，x为输入数据；为缩放因子，所述缩放因子为最大预设固定值；R为实数；i为输入数据的数量；x_i为第i个输入数据；e为纳皮尔常数；x_d为第d个输入数据。Where x is the input data; is a scaling factor, which is a maximum preset fixed value; R is a real number; i is the number of input data; _xi is the i-th input data; e is the Napier constant; and _xd is the d-th input data.

根据本公开实施例的一种具体实现方式，得到所述最大预设固定值的过程为：According to a specific implementation of the embodiment of the present disclosure, the process of obtaining the maximum preset fixed value is:

多次执行模型推理记录预处理阶段得到softmax函数的输入数据；Execute the model inference record preprocessing phase multiple times to get the input data of the softmax function;

分析输入数据的统计分布情况，得到最大预设固定值，所述最大预设固定值满足：Analyze the statistical distribution of the input data to obtain a maximum preset fixed value, which satisfies:

该模型统计的大部分的输入数据均不满足：输入数据x_i＞＞最大预设固定值或输入数据x_i＜＜最大预设固定值的情况。Most of the input data counted by this model do not satisfy: input data x _i >> maximum preset fixed value Or input data x _i ＜＜maximum preset fixed value situation.

根据本公开实施例的一种具体实现方式，所述该模型统计的大部分的输入数据为99.99％的输入数据。According to a specific implementation of an embodiment of the present disclosure, most of the input data counted by the model is 99.99% of the input data.

根据本公开实施例的一种具体实现方式，所述最大预设固定值的取值范围为：According to a specific implementation of the embodiment of the present disclosure, the value range of the maximum preset fixed value is:

-100<最大预设固定值 -100<Maximum preset fixed value

根据本公开实施例的一种具体实现方式，所述利用序列和运算处理结果对矩阵乘法运算处理结果修正后，进行优化softmax函数结果和特征矩阵的内循环运算处理。 According to a specific implementation method of the embodiment of the present disclosure, after the matrix multiplication operation result is corrected using the sequence and operation processing result, an inner loop operation processing of optimizing the softmax function result and the feature matrix is performed.

根据本公开实施例的一种具体实现方式，所述内循环运算处理为对特征矩阵中的每一个样本的特征向量进行优化softmax函数运算处理，得到该样本的概率分布。According to a specific implementation of the embodiment of the present disclosure, the inner loop operation processing is to perform an optimized softmax function operation processing on the feature vector of each sample in the feature matrix to obtain the probability distribution of the sample.

根据本公开实施例的一种具体实现方式，所述内循环运算处理过程中，优化softmax函数和特征矩阵的输入数据均单独进行异步处理。According to a specific implementation manner of the embodiment of the present disclosure, during the inner loop operation processing, the input data of the optimized softmax function and the feature matrix are processed asynchronously separately.

根据本公开实施例的一种具体实现方式，所述内循环运算处理过程中存在外层累加，且所述外层累加在所有部分向量处理完毕后，进行外部累加处理。According to a specific implementation manner of the embodiment of the present disclosure, there is an outer accumulation in the inner loop operation processing, and after all partial vectors are processed, the outer accumulation is performed.

根据本公开实施例的一种具体实现方式，所述特征矩阵为V矩阵，所述内循环运算处理的过程为：
According to a specific implementation of the embodiment of the present disclosure, the characteristic matrix is a V matrix, and the process of the inner loop operation processing is:

其中，x为输入数据；为缩放因子，所述缩放因子为最大预设固定值；R为实数；为输入数据x^(j)向量的数据第i个维度；x_i为输入向量第i个维度；e为纳皮尔常数；x_d为输入向量第d个维度；p为输入数据x^(j)向量的个数；j为输入数据的第j个向量；d/p为x^(j)向量的维度数量；为V矩阵中第j个列向量的第i个维度；为输入数据经过缩放和指数运算的结果。Where x is the input data; is a scaling factor, which is a maximum preset fixed value; R is a real number; is the data i-th dimension of the input data x ^(j) vector; _xi is the i-th dimension of the input vector; e is the Napier constant; _xd is the d-th dimension of the input vector; p is the number of input data x ^(j) vectors; j is the j-th vector of the input data; d/p is the number of dimensions of the x ^(j) vector; is the i-th dimension of the j-th column vector in the V matrix; For input data The result of scaling and exponential operation.

根据本公开实施例的一种具体实现方式，在所述内循环运算处理的过程中，在不失一般性的前提下，假定每个x_i的若或时，则终止对x_i所属的向量x的异步部分softmax计算，然后使用同步softmax方法重新计算优化softmax函数的值。According to a specific implementation of the embodiment of the present disclosure, in the process of the inner loop operation processing, without loss of generality, it is assumed that each x _i like or When , the asynchronous softmax calculation of the vector x to which _xi belongs is terminated, and then the synchronous softmax method is used to recalculate the value of the optimized softmax function.

第二方面，本公开实施例提供了一种并行化策略优化系统，所述系统包括：In a second aspect, an embodiment of the present disclosure provides a parallelization strategy optimization system, the system comprising:

预处理单元，被配置为The preprocessing unit is configured as

输出单元，被配置为Output unit, configured as

本公开实施例还提供了一种电子设备，该电子设备包括： The present disclosure also provides an electronic device, the electronic device comprising:

至少一个处理器；以及，at least one processor; and,

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行时，使所述至少一个处理器执行前述第一方面或第一方面的任一实现方式中的用于并行化策略优化方法。The memory stores instructions that can be executed by the at least one processor. When the instructions are executed by the at least one processor, the at least one processor executes the method for parallelization strategy optimization in the aforementioned first aspect or any implementation of the first aspect.

第四方面，本公开实施例还提供了一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令当由至少一个处理器执行时使所述至少一个处理器执行前述第一方面或第一方面的任一实现方式中的用于并行化策略优化方法。In a fourth aspect, an embodiment of the present disclosure further provides a non-transitory computer-readable storage medium, which stores computer instructions, and when the computer instructions are executed by at least one processor, the at least one processor executes the method for parallelization strategy optimization in the aforementioned first aspect or any implementation of the first aspect.

第五方面，本公开实施例还提供了一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行前述第一方面或第一方面的任一实现方式中的用于并行化策略优化方法。In the fifth aspect, the embodiments of the present disclosure also provide a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer executes the method for parallelization strategy optimization in the aforementioned first aspect or any implementation of the first aspect.

本发明实施例的其他可选特征和技术效果一部分在下文描述，一部分可通过阅读本文而明白。Other optional features and technical effects of the embodiments of the present invention are partially described below, and partially can be understood by reading this document.

与现有技术相比，本发明具有以下有益的技术效果：Compared with the prior art, the present invention has the following beneficial technical effects:

本发明提供一种并行化策略优化方法、系统、设备及介质，包括以下步骤：将并行化策略中的softmax函数中缩放因子替换为最大预设固定值，得到优化softmax函数；将优化softmax函数并列进行指数运算处理与序列和运算处理，其中指数运算处理完成后进行矩阵乘法运算处理，并利用序列和运算处理结果对矩阵乘法运算处理结果修正，完成并行化策略优化；本申请能够提高大语言模型的并行化策略运算效率。The present invention provides a parallelization strategy optimization method, system, device and medium, comprising the following steps: replacing the scaling factor in the softmax function in the parallelization strategy with a maximum preset fixed value to obtain an optimized softmax function; performing exponential operation processing and sequence sum operation processing on the optimized softmax function in parallel, wherein matrix multiplication operation processing is performed after the exponential operation processing is completed, and the matrix multiplication operation processing result is corrected using the sequence sum operation processing result to complete the parallelization strategy optimization; the application can improve the parallelization strategy operation efficiency of a large language model.

BRIEF DESCRIPTION OF THE DRAWINGS

以下，结合附图来详细说明本发明的实施例，所示出的元件不受附图所显示的比例限制，附图中相同或相似的附图标记表示相同或类似的元件，其中：Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The elements shown are not limited to the proportions shown in the accompanying drawings. The same or similar reference numerals in the accompanying drawings represent the same or similar elements, wherein:

图1为根据本公开实施例的一种并行化策略优化方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a parallelization strategy optimization method according to an embodiment of the present disclosure;

图2为现有技术中大语言模型推理计算流程示意图；FIG2 is a schematic diagram of a large language model reasoning calculation process in the prior art;

图3为现有技术中不同softmax计算方式比较示意图；FIG3 is a schematic diagram showing a comparison of different softmax calculation methods in the prior art;

图4为根据本公开实施例的最大预设固定值的方法流程示意图；FIG4 is a schematic diagram of a method flow chart of a maximum preset fixed value according to an embodiment of the present disclosure;

图5为根据本公开实施例的重新计算所有部分向量的过程示意图； FIG5 is a schematic diagram of a process of recalculating all partial vectors according to an embodiment of the present disclosure;

图6为根据本公开实施例的Prefill阶段异步softmax提升效果图；FIG6 is a diagram showing the effect of asynchronous softmax improvement in the Prefill stage according to an embodiment of the present disclosure;

图7为根据本公开实施例的Decode阶段异步softmax提升效果图；FIG7 is a diagram showing the effect of asynchronous softmax improvement in the Decode stage according to an embodiment of the present disclosure;

图8为根据本公开实施例的并行化策略优化系统；以及FIG8 is a parallelization strategy optimization system according to an embodiment of the present disclosure; and

图9为根据本公开实施例的并行化策略优化设备。FIG. 9 is a diagram of a parallelization strategy optimization device according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

下面结合附图对本公开实施例进行详细描述。The embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

以下通过特定的具体实例说明本公开的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本公开的其他优点与功效。显然，所描述的实施例仅仅是本公开一部分实施例，而不是全部的实施例。本公开还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本公开的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。基于本公开中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。The following describes the embodiments of the present disclosure through specific examples, and those skilled in the art can easily understand other advantages and effects of the present disclosure from the contents disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all of the embodiments. The present disclosure can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present disclosure. It should be noted that the following embodiments and features in the embodiments can be combined with each other without conflict. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present disclosure.

需要说明的是，下文描述在所附权利要求书的范围内的实施例的各种方面。应显而易见，本文中所描述的方面可体现于广泛多种形式中，且本文中所描述的任何特定结构及/或功能仅为说明性的。基于本公开，所属领域的技术人员应了解，本文中所描述的一个方面可与任何其它方面独立地实施，且可以各种方式组合这些方面中的两者或两者以上。举例来说，可使用本文中所阐述的任何数目个方面来实施设备及/或实践方法。另外，可使用除了本文中所阐述的方面中的一或多者之外的其它结构及/或功能性实施此设备及/或实践此方法。It should be noted that various aspects of the embodiments within the scope of the appended claims are described below. It should be apparent that the aspects described herein may be embodied in a wide variety of forms, and any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, it should be understood by those skilled in the art that an aspect described herein may be implemented independently of any other aspect, and two or more of these aspects may be combined in various ways. For example, any number of aspects described herein may be used to implement the device and/or practice the method. In addition, other structures and/or functionalities other than one or more of the aspects described herein may be used to implement this device and/or practice this method.

还需要说明的是，以下实施例中所提供的图示仅以示意方式说明本公开的基本构想，图式中仅显示与本公开中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should also be noted that the illustrations provided in the following embodiments are only schematic illustrations of the basic concept of the present disclosure. The drawings only show components related to the present disclosure rather than being drawn according to the number, shape and size of components in actual implementation. In actual implementation, the type, quantity and proportion of each component may be changed arbitrarily, and the component layout may also be more complicated.

另外，在以下描述中，提供具体细节是为了便于透彻理解实例。然而，所属领域的技术人员将理解，可在没有这些特定细节的情况下实践所述方面。Additionally, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, one skilled in the art will appreciate that the aspects described may be practiced without these specific details.

为解决现有的电路仿真方法在仿真精度和仿真速度两个方面无法兼顾的问题，本发明提出了一种用于模块化电路行为仿真的方法。 In order to solve the problem that existing circuit simulation methods cannot take both simulation accuracy and simulation speed into consideration, the present invention proposes a method for modular circuit behavior simulation.

本发明提出的用于模块化电路行为仿真的方法包含两个部分：对硬件行为建模的软件接口和执行仿真过程的程序，其中，对硬件行为建模的软件接口以任务为基本单位，每个任务包含开始事件和结束事件。The method for modular circuit behavior simulation proposed in the present invention includes two parts: a software interface for hardware behavior modeling and a program for executing the simulation process, wherein the software interface for hardware behavior modeling takes tasks as basic units, and each task includes a start event and an end event.

接下来将参照图1至图9对根据本公开实施例的并行化策略优化方法、系统、设备及介质进行说明。Next, the parallelization strategy optimization method, system, device and medium according to the embodiments of the present disclosure will be described with reference to FIGS. 1 to 9 .

图1示出了本实施例一种并行化策略优化方法100，如图1所示，包括以下步骤：在步骤S101处，将并行化策略中的softmax函数中缩放因子替换为最大预设固定值，得到优化softmax函数。FIG1 shows a parallelization strategy optimization method 100 of the present embodiment. As shown in FIG1 , the method includes the following steps: At step S101 , the scaling factor in the softmax function in the parallelization strategy is replaced with a maximum preset fixed value to obtain an optimized softmax function.

在本发明的实施例中，所述并行化策略中的softmax函数为：
In an embodiment of the present invention, the softmax function in the parallelization strategy is:

在本发明的实施例中，得到所述最大预设固定值的方法200如图4所示，包括以下步骤：In an embodiment of the present invention, a method 200 for obtaining the maximum preset fixed value is shown in FIG4 , and includes the following steps:

在步骤S210处，多次执行模型推理记录预处理阶段得到softmax函数的输入数据。At step S210, the model inference record preprocessing stage is executed multiple times to obtain input data of the softmax function.

接下来，转到步骤S220。Next, go to step S220.

在步骤S220处，分析输入数据的统计分布情况，得到最大预设固定值，所述最大预设固定值满足：In step S220, the statistical distribution of the input data is analyzed to obtain a maximum preset fixed value, which satisfies:

在本发明的实施例中，所述该模型统计的大部分的输入数据为99.99％的输入数据。In an embodiment of the present invention, the majority of input data counted by the model is 99.99% of the input data.

在本发明的实施例中，所述最大预设固定值的取值范围为：In an embodiment of the present invention, the maximum preset fixed value has a value range of:

-100<最大预设固定值 -100<Maximum preset fixed value

接下来，转到步骤S120。Next, go to step S120.

在步骤S120处，将优化softmax函数并列进行指数运算处理与序列和运算处理，其中指数运算处理完成后进行矩阵乘法运算处理，并利用序列和运算处理结果对矩阵乘法运算处理结果修正，完成并行化策略优化；需要说明的是，在矩阵乘法中，我们通常会利用序列和运算结果对矩阵乘法运算结果进行修正。这个过程可以看作是先进行一次常规的矩阵乘法，然后根据序列和运算的结果，对乘法的结果进行修正；具体来说，假设我们有两个矩阵A和B，我们想要计算A*B，在进行常规的矩阵乘法后，我们会得到一个初步的结果C，然后，我们会将C与一个序列和运算的结果进行修正，得到最终的乘法结果D；这个修正过程能够有效地提高矩阵乘法的精度和稳定性，特别是在处理大规模、复杂的数据时。At step S120, the optimized softmax function is subjected to exponential operation processing and sequence sum operation processing in parallel, wherein the exponential operation processing is completed and the matrix multiplication operation processing is performed, and the sequence sum operation processing result is used to multiply the matrix. Correct the result of matrix multiplication operation and complete the parallelization strategy optimization; It should be noted that in matrix multiplication, we usually use the sequence and operation results to correct the result of matrix multiplication operation. This process can be regarded as first performing a regular matrix multiplication, and then correcting the result of multiplication according to the sequence and the result of the operation; Specifically, suppose we have two matrices A and B, and we want to calculate A*B. After performing regular matrix multiplication, we will get a preliminary result C, and then we will correct C with the result of a sequence and operation to get the final multiplication result D; This correction process can effectively improve the accuracy and stability of matrix multiplication, especially when processing large-scale and complex data.

在本发明的实施例中，所述利用序列和运算处理结果对矩阵乘法运算处理结果修正后，进行优化softmax函数结果和特征矩阵的内循环运算处理。In an embodiment of the present invention, after the matrix multiplication operation result is corrected using the sequence and operation processing result, an inner loop operation processing of optimizing the softmax function result and the feature matrix is performed.

在本发明的实施例中，所述内循环运算处理为对特征矩阵中的每一个样本的特征向量进行优化softmax函数运算处理，得到该样本的概率分布。In an embodiment of the present invention, the inner loop operation processing is to perform an optimized softmax function operation processing on the feature vector of each sample in the feature matrix to obtain the probability distribution of the sample.

在本发明的实施例中，所述内循环运算处理过程中，优化softmax函数和特征矩阵的输入数据均单独进行异步处理；所述的异步处理是一种处理方式，它不需要等待处理完成就可以进行后续的操作，异步处理的核心逻辑是，它不阻塞当前线程来等待处理完成，而是允许后续操作，直至其它线程将处理完成，并回调通知此线程，这种处理方式类似于短信交流，发送消息之后无需保持等待状态；例如在编程中，异步处理可以用于处理长时间运行的操作，如网络请求或文件IO操作，以提高程序的响应性能和并发能力，在自然语言处理中，异步处理可以用于训练语言模型等计算密集型任务，以充分利用计算资源并提高训练效率。In an embodiment of the present invention, during the inner loop operation processing, the input data of the optimized softmax function and the feature matrix are processed asynchronously separately; the asynchronous processing is a processing method that does not need to wait for the processing to be completed before performing subsequent operations. The core logic of asynchronous processing is that it does not block the current thread to wait for the processing to be completed, but allows subsequent operations until other threads complete the processing and call back to notify this thread. This processing method is similar to SMS communication, and there is no need to maintain a waiting state after sending a message; for example, in programming, asynchronous processing can be used to process long-running operations, such as network requests or file IO operations, to improve the program's responsiveness and concurrency. In natural language processing, asynchronous processing can be used for computationally intensive tasks such as training language models to make full use of computing resources and improve training efficiency.

在本发明的实施例中，所述内循环运算处理过程中存在外层累加，且所述外层累加在所有部分向量处理完毕后，进行外部累加处理；需要说明的是，所述的外层累加通常是指在循环或迭代过程中，对一个外部变量进行累加操作，这个外部变量通常用于计算某个累加和，或者在循环结束后用来计算所有循环迭代的总和；在循环的每一次迭代中，外部变量会与循环中的某个内部变量进行累加操作，从而逐渐增加外部变量的值；当循环结束后，外部变量的值就是所有循环迭代的累加和；外层累加通常用于统计某个事件发生的次数，或者计算某个变量在循环中的总和，这种累加操作可以方便地计算出循环中所有迭代的结果之和，并且可以避免在循环内部进行重复的累加操作，从而提高代码的效率和可读性。In an embodiment of the present invention, there is an outer accumulation in the inner loop operation processing, and the outer accumulation performs external accumulation processing after all partial vectors are processed; it should be noted that the outer accumulation usually refers to an accumulation operation on an external variable during a loop or iteration process, and this external variable is usually used to calculate a certain accumulation sum, or is used to calculate the sum of all loop iterations after the loop ends; in each iteration of the loop, the external variable will be accumulated with an internal variable in the loop, thereby gradually increasing the value of the external variable; when the loop ends, the value of the external variable is the accumulation sum of all loop iterations; the outer accumulation is usually used to count the number of times an event occurs, or to calculate the sum of a variable in the loop. This accumulation operation can easily calculate the sum of the results of all iterations in the loop, and can avoid repeated accumulation operations within the loop, thereby improving the efficiency and readability of the code.

在本发明的实施例中，所述特征矩阵为V矩阵，所述内循环运算处理的过程为：
In an embodiment of the present invention, the characteristic matrix is a V matrix, and the inner loop operation processing process is:

在本发明的实施例中，在所述内循环运算处理的过程中，在不失一般性的前提下，假定每个x_i的若或时，则终止对x_i所属的向量x的异步部分softmax计算，然后使用同步softmax方法重新计算优化softmax函数的值。In the embodiment of the present invention, during the inner loop operation process, without loss of generality, it is assumed that each x _i like or When , the asynchronous softmax calculation of the vector x to which _xi belongs is terminated, and then the synchronous softmax method is used to recalculate the value of the optimized softmax function.

需要说明的是，优化softmax函数和V矩阵进行内循环运算的目的是：为了获得一组概率分布；softmax函数是一种常用的函数，它可以将任意实数映射到[0,1]之间的值，这些值加起来等于1，因此可以解释为概率分布，而V矩阵通常是一个特征矩阵，每一行代表一个样本的特征向量；因此，softmax函数和V矩阵的内循环运算可以理解为：对每一个样本的特征向量进行softmax函数运算，得到该样本的概率分布，这个概率分布可以用于表示该样本属于每个类别的概率，从而为后续的分类或聚类等任务提供依据。It should be noted that the purpose of optimizing the softmax function and the V matrix for inner loop operations is to obtain a set of probability distributions. The softmax function is a commonly used function that can map any real number to a value between [0,1], and the sum of these values is equal to 1, so it can be interpreted as a probability distribution, and the V matrix is usually a feature matrix, each row represents the feature vector of a sample; therefore, the inner loop operation of the softmax function and the V matrix can be understood as: performing a softmax function operation on the feature vector of each sample to obtain the probability distribution of the sample, and this probability distribution can be used to represent the probability of the sample belonging to each category, thereby providing a basis for subsequent tasks such as classification or clustering.

在本发明的实施例中，图5示出了一种并行化策略优化方法的示例；预先设置a＝-3,b＝3,两个向量x和y是由Q·K^T计算得出，并分为两个局部向量；同时省略了从Q·K^T到这些局部向量的过程。对于每个x_i，有对x的第一个部分向量使用和进行处理x的第一个部分向量。有两个异步线程，每个线程进行相应的计算，分别为：In an embodiment of the present invention, FIG. 5 shows an example of a parallelization strategy optimization method; a=-3, b=3 are preset, The two vectors x and y are calculated by Q·K ^T and divided into two local vectors; at the same time, the process from Q·K ^T to these local vectors is omitted. For each x _i , there is For the first partial vector of x, use and Process the first partial vector of x. There are two asynchronous threads, each thread performs corresponding calculations, respectively:

和 and

两个线程在处理完所有部分向量后同步进行，并执行最后的除法运算，对于y，第一个部分向量的处理方法类似，但是，那么两个线程将被终止，第一个线程将根据计算结果重新计算所有部分向量。The two threads synchronize after processing all partial vectors and perform the final division operation. For y, the first partial vector is processed similarly, but Then both threads will be terminated and the first thread will recalculate all partial vectors based on the calculation results.

图5示出了本实施例中重新计算所有部分向量的过程： FIG5 shows the process of recalculating all partial vectors in this embodiment:

图5(a)示出了每个部分softmax结果都是单独处理的，没有同步更新，图5(b)示出了当溢出时，需要重新计算所有部分需要对所有部分softmax计算进行重新计算。Figure 5(a) shows that each partial softmax result is processed separately without synchronous update, and Figure 5(b) shows that when overflow occurs, all partial softmax calculations need to be recalculated.

在本发明的实施例中的实验结果为：The experimental results in the embodiments of the present invention are:

在本实施例中优化softmax函数也可以称之为异步softmax方案可同时应用于Prefill阶段和Decode阶段，测试了所提出的方案与最先进的注意力实现方案进行了对比测试，在英伟达^TM GPU上的测试结果如图6和图7所示，对于在Prefill阶段，该方案与xformers[5]和FlashAttetion2相比，分别实现了1.52倍和1.19倍的平均提速，在Decode阶段，该方案表现优于定制decode的xformers实现，图8中表示为xformers-decoder，在长上下文情况下，比现有技术FlashDecoding的速度提高了2.02倍。In this embodiment, the optimization of the softmax function can also be called an asynchronous softmax solution, which can be applied to both the Prefill stage and the Decode stage. The proposed solution is tested against the most advanced attention implementation solution ^. The test results on GPU are shown in Figures 6 and 7. In the Prefill stage, the proposed scheme achieves an average speedup of 1.52 times and 1.19 times compared with xformers[5] and FlashAttetion2, respectively. In the Decode stage, the proposed scheme outperforms the xformers implementation of customized decoding, denoted as xformers-decoder in Figure 8, and is 2.02 times faster than the prior art FlashDecoding in the case of long context.

图8示出了本发明提供的一种并行化策略优化系统300，系统300包括：预处理单元310和输出单元320FIG8 shows a parallelization strategy optimization system 300 provided by the present invention. The system 300 includes: a preprocessing unit 310 and an output unit 320

预处理单元310，被配置为将并行化策略中的softmax函数中缩放因子替换为最大预设固定值，得到优化softmax函数；A preprocessing unit 310 is configured to replace the scaling factor in the softmax function in the parallelization strategy with a maximum preset fixed value to obtain an optimized softmax function;

输出单元320，被配置为将优化softmax函数并列进行指数运算处理与序列和运算处理，其中指数运算处理完成后进行矩阵乘法运算处理，并利用序列和运算处理结果对矩阵乘法运算处理结果修正，完成并行化策略优化。The output unit 320 is configured to perform exponential operation processing and sequence sum operation processing on the optimized softmax function in parallel, wherein matrix multiplication operation processing is performed after the exponential operation processing is completed, and the matrix multiplication operation processing result is corrected using the sequence sum operation processing result to complete the parallelization strategy optimization.

图9示出了一种可以实施本发明实施例的方法或实现本发明实施例的电子设备1000的示意图，在一些实施例中可以包括比图示更多或更少的电子设备。在一些实施例中，可以利用单个或多个电子设备实施。在一些实施例中，可以利用云端或分布式的电子设备实施。FIG9 shows a schematic diagram of an electronic device 1000 that can implement a method or implement an embodiment of the present invention, which may include more or fewer electronic devices than shown in the figure in some embodiments. In some embodiments, it can be implemented using a single or multiple electronic devices. In some embodiments, it can be implemented using cloud or distributed electronic devices.

如图9所示，电子设备1000包括处理器1001，其可以根据存储在只读存储器(ROM)1002中的程序和/或数据或者从存储部分1008加载到随机访问存储器(RAM)1003中的程序和/或数据而执行各种适当的操作和处理。处理器1001可以是一个多核的处理器，也可以包含多个处理器。在一些实施例中，处理器1001可以包含一个通用的主处理器以及一个或多个特殊的协处理器，例如，中央处理器(CPU)、图形处理器(GPU)、神经网络处理器(NPU)、数字信号处理器(DSP)等等。在RAM 1003中，还存储有电子设备1000操作所需的各种程序和数据。处理器1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG9 , the electronic device 1000 includes a processor 1001, which can perform various appropriate operations and processes according to the programs and/or data stored in the read-only memory (ROM) 1002 or the programs and/or data loaded from the storage part 1008 to the random access memory (RAM) 1003. The processor 1001 can be a multi-core processor or can include multiple processors. In some embodiments, the processor 1001 can include a general main processor and one or more special coprocessors, such as a central processing unit (CPU), a graphics processing unit (GPU), a neural network processor (NPU), a digital signal processor (DSP), etc. Various programs and data required for the operation of the electronic device 1000 are also stored in the RAM 1003. Processor 1001, ROM 1002 and RAM 1003 are connected to each other via a bus 1004. To the bus 1004, an input/output (I/O) interface 1005 is also connected.

上述处理器与存储器共同用于执行存储在存储器中的程序，所述程序被计算机执行时能够实现上述各实施例描述的方法、步骤或功能。The processor and the memory are used together to execute the program stored in the memory. When the program is executed by the computer, the methods, steps or functions described in the above embodiments can be implemented.

以下部件连接至I/O接口1005：包括键盘、鼠标、触摸屏等的输入部分1006；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1007；包括硬盘等的存储部分1008；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1010也根据需要连接至I/O接口1005。可拆卸介质1011，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1010上，以便于从其上读出的计算机程序根据需要被安装入存储部分1008。图9中仅示意性示出部分组件，并不意味着计算机系统1000只包括图9所示组件。The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, a touch screen, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1008 including a hard disk, etc.; and a communication section 1009 including a network interface card such as a LAN card, a modem, etc. The communication section 1009 performs communication processing via a network such as the Internet. A drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1010 as needed, so that a computer program read therefrom is installed into the storage section 1008 as needed. FIG. 9 only schematically shows some components, which does not mean that the computer system 1000 only includes the components shown in FIG. 9.

上述实施例阐明的系统、装置、模块或单元，可以由计算机或其关联部件实现。计算机例如可以为移动终端、智能电话、个人计算机、膝上型计算机、车载人机交互设备、个人数字助理、媒体播放器、导航设备、游戏控制台、平板电脑、可穿戴设备、智能电视、物联网系统、智能家居、工业计算机、服务器或者其组合。The systems, devices, modules or units described in the above embodiments may be implemented by a computer or its associated components. The computer may be, for example, a mobile terminal, a smart phone, a personal computer, a laptop computer, a vehicle-mounted human-computer interaction device, a personal digital assistant, a media player, a navigation device, a game console, a tablet computer, a wearable device, a smart TV, an Internet of Things system, a smart home, an industrial computer, a server or a combination thereof.

尽管未示出，在本发明实施例中，提供一种存储介质，所述存储介质存储有计算机程序，所述计算机程序配置成被运行时执行任一本发明实施例的基于文件差异的编译方法。Although not shown, in an embodiment of the present invention, a storage medium is provided, the storage medium storing a computer program, the computer program being configured to execute any file difference-based compilation method of the embodiments of the present invention when executed.

在本发明的实施例的存储介质包括永久性和非永久性、可移动和非可移动的可以由任何方法或技术来实现信息存储的物品。存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。Storage media in embodiments of the present invention include permanent and non-permanent, removable and non-removable items that can be used to store information by any method or technology. Examples of storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

在本发明的实施例的方法、程序、系统、装置等，可以在单个或多个连网的计算机中执行或实现，也可以在分布式计算环境中实践。在本说明书实施例中，在这些分布式计算环境中，可以由通过通信网络而被连接的远程处理设备来执行任务。 The methods, programs, systems, devices, etc. of the embodiments of the present invention may be executed or implemented in a single or multiple networked computers, or may be practiced in a distributed computing environment. In the embodiments of this specification, in these distributed computing environments, tasks may be performed by remote processing devices connected via a communication network.

本领域技术人员应明白，本说明书的实施例可提供为方法、系统或计算机程序产品。因此，本领域技术人员可想到，上述实施例阐明的功能模块/单元或控制器以及相关方法步骤的实现，可以用软件、硬件和软/硬件结合的方式实现。Those skilled in the art should understand that the embodiments of the present specification may be provided as methods, systems or computer program products. Therefore, those skilled in the art may imagine that the functional modules/units or controllers and related method steps described in the above embodiments may be implemented in software, hardware or a combination of software/hardware.

除非明确指出，根据本发明实施例记载的方法、程序的动作或步骤并不必须按照特定的顺序来执行并且仍然可以实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。Unless explicitly stated, the actions or steps of the methods, programs, and embodiments of the present invention do not have to be performed in a specific order and can still achieve the desired results. In some implementations, multitasking and parallel processing are also possible or may be advantageous.

在本文中，针对本发明的多个实施例进行了描述，但为简明起见，各实施例的描述并不是详尽的，各个实施例之间相同或相似的特征或部分可能会被省略。在本文中，“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”意指适用于根据本发明的至少一个实施例或示例中，而非所有实施例。上述术语并不必然意味着指代相同的实施例或示例。在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In this article, multiple embodiments of the present invention are described, but for the sake of brevity, the description of each embodiment is not exhaustive, and the same or similar features or parts between the embodiments may be omitted. In this article, "one embodiment", "some embodiments", "examples", "specific examples", or "some examples" are meant to be applicable to at least one embodiment or example according to the present invention, but not all embodiments. The above terms do not necessarily mean to refer to the same embodiment or example. In the absence of contradiction, those skilled in the art can combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples.

已参考上述实施例具体示出并描述了本发明的示例性系统及方法，其仅为实施本系统及方法的最佳模式的示例。本领域的技术人员可以理解的是可以在实施本系统及/或方法时对这里描述的系统及方法的实施例做各种改变而不脱离界定在所附权利要求中的本发明的精神及范围。 The exemplary systems and methods of the present invention have been specifically shown and described with reference to the above embodiments, which are merely examples of the best modes for implementing the present systems and methods. It will be appreciated by those skilled in the art that various changes may be made to the embodiments of the systems and methods described herein when implementing the present systems and/or methods without departing from the spirit and scope of the present invention as defined in the appended claims.

Claims

A parallel strategy optimization method, characterized in that it comprises the following steps:

Replace the scaling factor in the softmax function in the parallelization strategy with the maximum preset fixed value to obtain the optimized softmax function;

The optimized softmax function is processed by exponential operation and sequence and operation in parallel, wherein matrix multiplication operation is performed after the exponential operation is completed, and the result of the sequence and operation is used to correct the result of the matrix multiplication operation to complete the parallelization strategy optimization.

The parallelization strategy optimization method according to claim 1 is characterized in that the softmax function in the parallelization strategy is:

Where x is the input data; is a scaling factor, which is a maximum preset fixed value; R is a real number; i is the number of input data; _xi is the i-th input data; e is the Napier constant; and _xd is the d-th input data.

The parallelization strategy optimization method according to claim 1 is characterized in that the process of obtaining the maximum preset fixed value is:

Execute the model inference record preprocessing phase multiple times to get the input data of the softmax function;

Analyze the statistical distribution of the input data to obtain a maximum preset fixed value, which satisfies:

Most of the input data counted by this model do not satisfy: input data x _i >> maximum preset fixed value Or input data x _i ＜＜maximum preset fixed value situation.

The parallelization strategy optimization method according to claim 3 is characterized in that the majority of the input data counted by the model is 99.99% of the input data.

The parallelization strategy optimization method according to claim 1 is characterized in that the value range of the maximum preset fixed value is:
-100<Maximum preset fixed value

The parallelization strategy optimization method according to claim 1 is characterized in that after the matrix multiplication operation processing result is corrected using the sequence and operation processing result, an inner loop operation processing of optimizing the softmax function result and the feature matrix is performed.

The parallelization strategy optimization method according to claim 6 is characterized in that the inner loop operation processing is to optimize the softmax function operation processing of the feature vector of each sample in the feature matrix to obtain the sample The probability distribution of .

The parallelization strategy optimization method according to claim 6 is characterized in that, during the inner loop operation processing, the input data of the optimized softmax function and the feature matrix are processed asynchronously separately.

The parallelization strategy optimization method according to claim 6 is characterized in that there is an outer accumulation in the inner loop operation processing, and the outer accumulation is performed after all partial vectors are processed.

The parallelization strategy optimization method according to claim 6 is characterized in that the characteristic matrix is a V matrix, and the process of the inner loop operation processing is:

Where x is the input data; is a scaling factor, which is a maximum preset fixed value; R is a real number; is the data i-th dimension of the input data x ^(j) vector; _xi is the i-th dimension of the input vector; e is the Napier constant; _xd is the d-th dimension of the input vector; p is the number of input data x ^(j) vectors; j is the j-th vector of the input data; d/p is the number of dimensions of the x ^(j) vector; is the i-th dimension of the j-th column vector in the V matrix; For input data The result of scaling and exponential operation.

The parallelization strategy optimization method according to claim 10 is characterized in that, in the process of the inner loop operation processing, without loss of generality, it is assumed that each x _i like or When , the asynchronous softmax calculation of the vector x to which _xi belongs is terminated, and then the synchronous softmax method is used to recalculate the value of the optimized softmax function.

A parallelization strategy optimization system, characterized in that the parallelization strategy optimization method according to any one of claims 1 to 11 comprises:

The preprocessing unit is configured as

Output unit, configured as

A computer device, the electronic device comprising:

at least one processor; and,

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor executes the parallelization strategy optimization method according to any one of claims 1 to 11.

A non-transitory computer-readable storage medium storing computer instructions, which, when executed by at least one processor, cause the at least one processor to perform the parallelization strategy optimization method according to any one of claims 1 to 11.

A computer program product, comprising a computing program stored on a non-transitory computer-readable storage medium, wherein the computer program comprises program instructions, and when the program instructions are executed by a computer, the computer is caused to execute the parallelization strategy optimization method as described in any one of claims 1 to 11.