WO2025185502A9

WO2025185502A9 - Data processing method and apparatus

Info

Publication number: WO2025185502A9
Application number: PCT/CN2025/079222
Authority: WO
Inventors: 郭天生; 师一博; 马奕; 王晶; 李林格
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2024-03-04
Filing date: 2025-02-26
Publication date: 2025-11-20
Anticipated expiration: 2026-09-04
Also published as: CN120597956A; WO2025185502A1; WO2025185502A8

Abstract

A data processing method, applied to the field of artificial intelligence, and comprising: acquiring first data, wherein the first data is obtained by means of a machine learning model on the basis of first input data; performing non-uniform quantization processing on the first data to obtain first compressed data, and storing the first compressed data in a memory; and reading the first compressed data from the memory, and performing inverse quantization processing corresponding to the non-uniform quantization processing on the first compressed data to obtain second data, wherein the second data and second input data are used for being input into the machine learning model, and the second input data is data input into the machine learning model after the first input data. According to the present application, non-uniform quantization is performed on data on the basis of non-uniform distribution characteristics of the data, so that the distribution characteristics of the data can be better met, and a quantization result having a smaller overall error or average error can be obtained, thereby improving the processing precision of models.

Description

A data processing method and apparatus

本申请要求于2024年03月04日提交国家知识产权局、申请号为202410245713.8、申请名称为“一种数据处理方法及其装置”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 202410245713.8, filed on March 4, 2024, entitled “A Data Processing Method and Apparatus”, the entire contents of which are incorporated herein by reference.

Technical Field

本申请涉及人工智能领域，尤其涉及一种数据处理方法及其装置。This application relates to the field of artificial intelligence, and more particularly to a data processing method and apparatus thereof.

Background Technology

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个分支，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

在对输入数据通过机器学习模型进行处理时，其中，机器学习模型可以包括注意力层，注意力层可以对输入的数据(例如数据为token，为了方便，以下用token描述)进行注意力计算，且，注意力层在对token进行注意力计算时，会得到在之后对token进行注意力计算时需要复用的中间结果。例如，该中间结果可以为KV(key-value)数据，也就是KV缓存。When processing input data using a machine learning model, the machine learning model may include an attention layer. This attention layer performs attention calculations on the input data (e.g., data represented by tokens; for convenience, we will refer to this as tokens below). Furthermore, when performing attention calculations on the tokens, the attention layer obtains intermediate results that need to be reused in subsequent attention calculations on the tokens. For example, these intermediate results can be key-value (KV) data, i.e., a KV cache.

在这个过程中，在处理新的token时，对于得到的可以复用的中间结果，可以将其存储到存储器中，以便之后对其他token进行注意力计算时需要复用的时候，可以从存储器中读取该中间结果，并基于该中间结果进行之后token的注意力计算。In this process, when processing a new token, the reusable intermediate results can be stored in memory so that when attention calculations are needed for other tokens later, the intermediate results can be read from memory and used as the basis for attention calculations for subsequent tokens.

然而，随着输入的数据尺寸不断变大，需要存储的可以复用的中间结果的量随着推理的进行会迅速增长，导致对于存储的需求量很大。此外，过大的中间结果也会使推理过程变得极为缓慢，因此需要对可以复用的中间结果进行压缩。However, as the size of the input data continues to increase, the amount of reusable intermediate results that need to be stored grows rapidly as inference progresses, resulting in a large storage requirement. Furthermore, excessively large intermediate results can also make the inference process extremely slow, thus necessitating the compression of reusable intermediate results.

Summary of the Invention

第一方面，本申请提供了一种数据处理方法，所述方法包括：获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型得到的；对所述第一数据进行非均匀量化处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器；从所述存储器中读取所述第一压缩数据，并对所述第一压缩数据进行所述非均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。In a first aspect, this application provides a data processing method, the method comprising: acquiring first data, the first data being obtained by a machine learning model based on first input data; performing non-uniform quantization processing on the first data to obtain first compressed data, and storing the first compressed data in a memory; reading the first compressed data from the memory, and performing inverse quantization processing on the first compressed data corresponding to the non-uniform quantization processing to obtain second data, the second data and second input data being used as inputs into the machine learning model, the second input data being data input into the machine learning model after the first input data.

应理解，对所述第一数据进行非均匀量化处理之后还可以进行其他处理(例如熵编码等)，得到第一压缩数据，本申请实施例并不限定。类似的，在对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理之后还可以进行其他处理，得到第二数据，本申请实施例并不限定。It should be understood that after performing non-uniform quantization on the first data, other processing (such as entropy coding) can be performed to obtain the first compressed data, and this application embodiment is not limited to this. Similarly, after performing the inverse transform processing corresponding to the nonlinear transform and the inverse quantization processing corresponding to the uniform quantization processing on the first compressed data, other processing can be performed to obtain the second data, and this application embodiment is not limited to this.

其中，第二输入数据可以为第一输入数据之后输入到机器学习模型的数据，例如，第二输入数据可以为第一输入数据之后且相邻的输入到机器学习模型的数据，例如，第二输入数据可以为第一输入数据之后且间隔了多个输入数据的输入到机器学习模型的数据。The second input data can be data input into the machine learning model after the first input data. For example, the second input data can be data input into the machine learning model after and adjacent to the first input data, or data input into the machine learning model after the first input data but separated by multiple input data.

其中，第一数据可以为机器学习模型(的中间网络层，例如，注意力层)对输入的token进行线性变换得到的，例如，第一数据可以为K数据和/或V数据。The first data can be obtained by linearly transforming the input token by the intermediate network layer (e.g., attention layer) of the machine learning model. For example, the first data can be K data and/or V data.

当前缓存压缩方法中使用的量化策略均为均匀量化。而大语言模型输出的数据的分布一般呈现出类似于高斯分布的非均匀分布。均匀量化无法很好地适应数据的这种非均匀分布特性，会导致整体量化误差较大，对数值小的数据带来较大的影响(主要原因在于，数据的数据单元会集中分布在某一个小区间内，如果使用均匀量化，那么集中分布在某一个小区间内的数据单元会被转换为同一个或者很少数量的量化值，这会导致精度的损失很大)。本申请实施例中，基于数据的非均匀分布特性，对数据进行非均匀量化，可以更符合数据的分布特征，得到整体误差或者平均误差更小的量化结果，从而提升模型的处理精度。Current caching and compression methods all use uniform quantization. However, the data output by large language models typically exhibits a non-uniform distribution, similar to a Gaussian distribution. Uniform quantization cannot well adapt to this non-uniform distribution characteristic, leading to a larger overall quantization error and significantly impacting small numerical values (primarily because data units tend to cluster within a small interval; uniform quantization converts these clustered units into the same or a small number of quantized values, resulting in significant accuracy loss). In this embodiment, non-uniform quantization is applied based on the non-uniform distribution characteristics of the data, better reflecting its distribution features and yielding quantization results with smaller overall or average errors, thereby improving the model's processing accuracy.

以数据为KV数据为例，当前KV缓存压缩方法中使用的量化策略均为均匀量化。而大语言模型输出的KV数据的分布一般呈现出类似于高斯分布的非均匀分布。均匀量化无法很好地适应KV数据的这种非均匀分布特性，会导致整体量化误差较大，对数值小的数据带来较大的影响(主要原因在于，KV数据的数据单元会集中分布在某一个小区间内，如果使用均匀量化，那么集中分布在某一个小区间内的数据单元会被转换为同一个或者很少数量的量化值，这会导致精度的损失很大)。本申请实施例中，基于KV数据的非均匀分布特性，对KV数据进行非均匀量化，可以更符合KV数据的分布特征，得到整体误差或者平均误差更小的量化结果，从而提升模型的处理精度。Taking key-value (KV) data as an example, current KV caching and compression methods all use uniform quantization. However, the distribution of KV data output by large language models generally exhibits a non-uniform distribution similar to a Gaussian distribution. Uniform quantization cannot well adapt to this non-uniform distribution characteristic of KV data, leading to a larger overall quantization error and a significant impact on small numerical values (mainly because KV data units are concentrated in a small interval; if uniform quantization is used, these concentrated data units will be converted into the same or a small number of quantized values, resulting in a significant loss of accuracy). In this embodiment, based on the non-uniform distribution characteristic of KV data, non-uniform quantization is performed on the KV data, which better matches the distribution characteristics of KV data, resulting in a quantization result with smaller overall or average error, thereby improving the processing accuracy of the model.

具体的，本申请实施例中可以对第一数据中分布更密集的数值区间内进行更细粒度的量化，也就是插入更多的量化值以及对应的区间(由于量化值更多，因此区间的数值宽度会变低)，从而使得量化可以更符合KV数据的分布特征，得到整体误差或者平均误差更小的量化结果，例如，所述第一数据包括多个数据单元，所述第一压缩数据包括每个所述数据单元对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，每个所述数据单元所在的数据区间对应的量化值用于作为所述数据单元对应的量化值，其中，数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。Specifically, in this embodiment, finer-grained quantization can be performed on denser numerical intervals in the first data, that is, more quantized values and corresponding intervals can be inserted (since there are more quantized values, the numerical width of the intervals will be lower), so that the quantization can better conform to the distribution characteristics of KV data and obtain a quantization result with smaller overall error or average error. For example, the first data includes multiple data units, the first compressed data includes quantized values corresponding to each data unit, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the quantized value corresponding to the data interval in which each data unit is located is used as the quantized value corresponding to the data unit, wherein the numerical interval includes a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

在一种可能的实现中，所述对所述第一数据进行非均匀量化处理，得到第一压缩数据，包括：基于预设的映射关系(例如表格的形式)将所述第一数据的数据单元转换为对应的量化值，得到第一压缩数据；其中，所述映射关系包括多个数值区间以及每个数值区间对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，所述多个数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。In one possible implementation, the non-uniform quantization processing of the first data to obtain the first compressed data includes: converting the data units of the first data into corresponding quantized values based on a preset mapping relationship (e.g., in the form of a table) to obtain the first compressed data; wherein, the mapping relationship includes multiple numerical intervals and quantized values corresponding to each numerical interval, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the multiple numerical intervals include a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

在一种可能的实现中，所述对所述第一数据进行非均匀量化处理，得到第一压缩数据，包括：对所述第一数据进行非线性变换，得到变换后的第一数据；对所述变换后的第一数据进行均匀量化处理。In one possible implementation, the step of performing non-uniform quantization processing on the first data to obtain first compressed data includes: performing a non-linear transformation on the first data to obtain transformed first data; and performing uniform quantization processing on the transformed first data.

函数类型可以是次方函数、对数函数、gamma映射函数、A律曲线、μ律曲线或PWL曲线等。The function type can be a power function, a logarithmic function, a gamma mapping function, an A-law curve, a μ-law curve, or a PWL curve, etc.

关于：次方函数/幂函数(power Function)About: Power functions

定义：表示为x的n次方Definition: Represented as x raised to the power of n

数学表达式：y＝xⁿ，其中n是实数Mathematical expression: y = x^ ⁿ , where n is a real number

关于：对数函数(Logarithmic Function)About: Logarithmic Function

定义：对数函数是指数函数的反函数Definition: The logarithmic function is the inverse function of the exponential function.

数学表达式：y＝log_b(x)其中b是对数的底数Mathematical expression: y = log _b (x) where b is the base of the logarithm.

关于：Gamma映射函数(Gamma Correction Function)About: Gamma Correction Function

定义：类似于幂函数，通常在归一化的数据上进行映射Definition: Similar to a power function, it is typically used to map data onto normalized data.

数学表达式：y＝x^γ其中，x是输入像素值，γ是Gamma值，通常是大于0的实数Mathematical expression: y = x + ^γ , where x is the input pixel value and γ is the Gamma value, usually a real number greater than 0.

关于：A律曲线(A-law Curve)About: A-law Curve

定义：A律曲线是一种非线性编码曲线，常用于音频信号的压缩。Definition: The A-law curve is a non-linear coding curve, often used for audio signal compression.

数学表达式：y＝ln(1+A*|x|)/ln(1+A)其中，x是输入信号，A是一个常数Mathematical expression: y = ln(1 + A * |x|) / ln(1 + A) where x is the input signal and A is a constant.

关于：μ律曲线(μ-law Curve)About: μ-law curve

定义：μ律曲线也是一种非线性编码曲线，常用于音频信号的压缩。Definition: The μ-law curve is also a non-linear coding curve, often used for audio signal compression.

数学表达式：y＝sgn(x)*(ln(1+μ*|x|)/ln(1+μ))其中，x是输入信号，μ是一个常数。Mathematical expression: y＝sgn(x)*(ln(1+μ*|x|)/ln(1+μ)) where x is the input signal and μ is a constant.

关于：PWL曲线(Piecewise Linear Curve)About: PWL curve (Piecewise Linear Curve)

定义：PWL曲线是由一系列线段组成的分段线性函数。Definition: A PWL curve is a piecewise linear function composed of a series of line segments.

数学表达式：
Mathematical expression:

其中(x₁,y₁),(x₂,y₂)…(x_n,y_n)是曲线上的控制点，m₁,m₂,…m_n是相邻控制点之间的斜率。Where ( _x1 , _y1 ), ( _x2 , _y2 )...( _xn , _yn ) are control points on the curve, and _m1 , _m2 ,... _mn are the slopes between adjacent control points.

在一种可能的实现中，所述方法还包括：根据所述第一数据的量化精度，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the method further includes: determining the type of nonlinear function or the parameter values included in the nonlinear transformation based on the quantization precision of the first data.

通过上述方式，非均匀量化中的非均匀映射关系可以根据量化精度自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，从而降低整体量化误差。Using the above method, the non-uniform mapping relationship in non-uniform quantization can be adaptively determined according to the quantization accuracy, further reducing quantization error. This reduces the overall quantization error while maintaining the compression gains brought by quantization.

在一种可能的实现中，所述方法还包括：根据所述注意力层所在的网络层在所述机器学习模型中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the method further includes: determining the type of nonlinear function or the parameter values included in the nonlinear transformation based on the position of the network layer containing the attention layer in the machine learning model.

在一种可能的实现中，所述第一数据为根据第一输入数据，通过机器学习模型的注意力层中的目标head得到的；所述方法还包括：根据所述目标head在所述注意力层中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the first data is obtained from the first input data through the target head in the attention layer of the machine learning model; the method further includes: determining the type of nonlinear function or the parameter values included in the nonlinear transformation based on the position of the target head in the attention layer.

在一种可能的实现中，可以根据所述第一数据与所述机器学习模型得到的最新数据之间的生成间隔(也就是第几个生成的，生成顺序可以通过在存储器中的存储顺序确定，该信息可以称之为sequence序号)，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the type of nonlinear function or the parameter values used when performing the nonlinear transformation can be determined based on the generation interval (i.e., which generation order is determined by the storage order in memory, which can be called the sequence number) between the first data and the latest data obtained by the machine learning model.

通过上述方式，非均匀量化中的非均匀映射关系可以根据数据分组信息(例如包括所述注意力层所在的网络层在所述机器学习模型中所处的位置、目标head在所述注意力层中所处的位置、生成顺序中的至少一种)自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，从而降低整体量化误差。In this way, the non-uniform mapping relationship in non-uniform quantization can be adaptively determined based on data grouping information (e.g., the position of the network layer containing the attention layer in the machine learning model, the position of the target head in the attention layer, and the generation order, at least one of these), further reducing quantization error. This reduces the overall quantization error while maintaining the compression benefits of quantization.

在一种可能的实现中，所述方法还包括：根据所述第一数据的数据分布，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值；所述数据分布通过所述第一数据的分布统计信息指示，或者通过标识来指示，其中，不同的所述标识对应不同特征的数据分布。In one possible implementation, the method further includes: determining the category of the nonlinear function or the parameter values included in the nonlinear transformation based on the data distribution of the first data; the data distribution is indicated by the distribution statistics of the first data or by an identifier, wherein different identifiers correspond to data distributions with different characteristics.

非均匀量化中的非均匀映射关系可以根据数据分布信息自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，从而降低整体量化误差。The non-uniform mapping relationship in non-uniform quantization can be adaptively determined based on data distribution information, further reducing quantization error. This allows for a reduction in overall quantization error while maintaining the compression gains brought by quantization.

第二方面，本申请提供了一种数据处理方法，所述方法包括：Secondly, this application provides a data processing method, the method comprising:

获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型计算得到的；Obtain first data, which is calculated by a machine learning model based on first input data;

通过非线性函数对所述第一数据进行非线性变换，得到变换后的第一数据；The first data is transformed by a nonlinear function to obtain the transformed first data.

对所述变换后的第一数据进行均匀量化处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器，从所述存储器中读取所述第一压缩数据，并对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。The transformed first data is subjected to uniform quantization to obtain first compressed data, and the first compressed data is stored in a memory. The first compressed data is read from the memory, and the first compressed data is subjected to inverse transformation processing corresponding to the nonlinear transformation and inverse quantization processing corresponding to the uniform quantization processing to obtain second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

在一种可能的实现中，所述非线性函数的类别或者包括的参数数值为基于如下的至少一种确定的：In one possible implementation, the category of the nonlinear function or the numerical values of its included parameters are determined based on at least one of the following:

所述第一数据的量化精度；或，The quantization precision of the first data; or,

生成所述第一数据的网络在所述机器学习模型中所处的位置；或，The position of the network that generated the first data within the machine learning model; or,

所述第一数据在所述机器学习模型中的生成顺序；或，The order in which the first data was generated in the machine learning model; or,

所述第一数据的数据分布。The data distribution of the first data.

应理解，上述确定非线性函数的方法以及结合非线性函数进行的分均匀量化方法可以适用于KV数据之外的其他数据的压缩和解压缩过程，“其他数据”可以但不限于为机器学习模型推理过程需要复用的数据。It should be understood that the above-mentioned method for determining the nonlinear function and the method of uniform quantization combined with the nonlinear function can be applied to the compression and decompression process of other data besides KV data. "Other data" can be, but is not limited to, data that needs to be reused in the inference process of machine learning models.

在一种可能的实现中，所述第一数据包括多个数据单元，所述第一压缩数据包括每个所述数据单元对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，每个所述数据单元所在的数据区间对应的量化值用于作为所述数据单元对应的量化值，其中，数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。In one possible implementation, the first data includes multiple data units, the first compressed data includes a quantized value corresponding to each data unit, the data units of the first data include data within a first numerical range and data within a second numerical range, the data within the first numerical range being more densely packed than the data within the second numerical range, and the quantized value corresponding to the data interval in which each data unit is located is used as the quantized value corresponding to the data unit, wherein the numerical interval includes a first numerical interval and a second numerical interval, the first numerical interval belonging to the first numerical range, the second numerical interval belonging to the second numerical range, and the numerical width of the first numerical interval being smaller than the numerical width of the second numerical interval.

在一种可能的实现中，所述对所述第一数据进行非均匀量化处理，得到第一压缩数据，包括：In one possible implementation, the step of performing non-uniform quantization on the first data to obtain the first compressed data includes:

基于预设的映射关系将所述第一数据的数据单元转换为对应的量化值，得到第一压缩数据；其中，所述映射关系包括多个数值区间以及每个数值区间对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，所述多个数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。Based on a preset mapping relationship, the data units of the first data are converted into corresponding quantized values to obtain the first compressed data; wherein, the mapping relationship includes multiple numerical intervals and the quantized value corresponding to each numerical interval, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the multiple numerical intervals include a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

对所述第一数据进行非线性变换，得到变换后的第一数据；The first data is subjected to a nonlinear transformation to obtain the transformed first data.

对所述变换后的第一数据进行均匀量化处理。The transformed first data is then subjected to uniform quantization.

第三方面，本申请提供了一种数据处理装置，所述装置包括：Thirdly, this application provides a data processing apparatus, the apparatus comprising:

获取模块，用于获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型得到的；从所述存储器中读取第一压缩数据；An acquisition module is used to acquire first data, which is obtained by a machine learning model based on first input data; and to read first compressed data from the memory.

处理模块，用于对所述第一数据进行非均匀量化处理，得到所述第一压缩数据，并将所述第一压缩数据存储至存储器，并对所述第一压缩数据进行所述非均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。The processing module is used to perform non-uniform quantization processing on the first data to obtain the first compressed data, store the first compressed data in a memory, and perform inverse quantization processing on the first compressed data corresponding to the non-uniform quantization processing to obtain the second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

其中，处理模块可以进行更细粒度的拆分，例如，处理模块可以包括压缩模块、解压缩模块，压缩模块可以对所述第一数据进行非均匀量化处理，得到所述第一压缩数据，解压缩模块可以对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理，得到第二数据。The processing module can be further divided into finer-grained parts. For example, the processing module may include a compression module and a decompression module. The compression module can perform non-uniform quantization processing on the first data to obtain the first compressed data. The decompression module can perform inverse transformation processing corresponding to the nonlinear transformation and inverse quantization processing corresponding to the uniform quantization processing on the first compressed data to obtain the second data.

在一种可能的实现中，所述处理模块，具体用于：In one possible implementation, the processing module is specifically used for:

在一种可能的实现中，所述处理模块，还用于：In one possible implementation, the processing module is further configured to:

根据所述第一数据的量化精度，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。Based on the quantization precision of the first data, determine the type of nonlinear function or the parameter values included in it when performing the nonlinear transformation.

根据所述注意力层所在的网络层在所述机器学习模型中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。Based on the position of the network layer containing the attention layer in the machine learning model, the type of nonlinear function or the parameter values included in the nonlinear transformation are determined.

在一种可能的实现中，所述第一数据为根据第一输入数据，通过机器学习模型的注意力层中的目标head得到的；所述处理模块，还用于：In one possible implementation, the first data is obtained from the first input data through the target head in the attention layer of a machine learning model; the processing module is further configured to:

根据所述目标head在所述注意力层中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。Based on the position of the target head in the attention layer, determine the type of nonlinear function or the parameter values included in the nonlinear transformation.

根据所述第一数据与所述机器学习模型得到的最新数据之间的生成间隔，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。Based on the generation interval between the first data and the latest data obtained by the machine learning model, the category of the nonlinear function or the parameter values included in the nonlinear transformation are determined.

根据所述第一数据的数据分布，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值；所述数据分布通过所述第一数据的分布统计信息指示，或者通过标识来指示，其中，不同的所述标识对应不同特征的数据分布。Based on the data distribution of the first data, the category of the nonlinear function or the parameter values included in the nonlinear transformation are determined; the data distribution is indicated by the distribution statistics of the first data or by an identifier, wherein different identifiers correspond to data distributions with different characteristics.

第四方面，本申请提供了一种数据处理装置，所述装置包括：Fourthly, this application provides a data processing apparatus, the apparatus comprising:

获取模块，用于获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型计算得到的，从所述存储器中读取所述第一压缩数据；An acquisition module is used to acquire first data, which is calculated based on first input data through a machine learning model, and to read the first compressed data from the memory;

处理模块，用于通过非线性函数对所述第一数据进行非线性变换，得到变换后的第一数据；对所述变换后的第一数据进行均匀量化处理，得到所述第一压缩数据，并将所述第一压缩数据存储至存储器，并对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。The processing module is configured to perform a nonlinear transformation on the first data using a nonlinear function to obtain the transformed first data; perform uniform quantization processing on the transformed first data to obtain the first compressed data, store the first compressed data in a memory, and perform inverse transformation processing corresponding to the nonlinear transformation and inverse quantization processing corresponding to the uniform quantization processing on the first compressed data to obtain second data. The second data and the second input data are used to input into the machine learning model, and the second input data is the data input into the machine learning model after the first input data.

所述第一数据的数据分布。The data distribution of the first data.

第五方面，本申请实施例提供了一种数据处理装置，可以包括存储器、处理器以及总线系统，其中，存储器用于存储程序，处理器用于执行存储器中的程序，以执行如上述第一方面及其任一可选的方法、或者如上述第二方面及其任一可选的方法。Fifthly, embodiments of this application provide a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute the program in the memory to perform the methods described in the first aspect above and any optional methods thereof, or the methods described in the second aspect above and any optional methods thereof.

第六方面，本申请实施例提供了一种计算机可读存储介质，计算机可读存储介质中存储有计算机程序，当其在计算机上运行时，使得计算机执行上述第一方面及其任一可选的方法、或者如上述第二方面及其任一可选的方法。In a sixth aspect, embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first aspect and any optional methods thereof, or the methods described in the second aspect and any optional methods thereof.

第七方面，本申请实施例提供了一种计算机程序，当其在计算机上运行时，使得计算机执行上述第一方面及其任一可选的方法、或者如上述第二方面及其任一可选的方法。In a seventh aspect, embodiments of this application provide a computer program that, when run on a computer, causes the computer to perform the methods described in the first aspect and any optional methods thereof, or the methods described in the second aspect and any optional methods thereof.

第八方面，本申请提供了一种芯片系统，该芯片系统包括处理器，用于支持执行数据处理装置实现上述方面中所涉及的功能，例如，发送或处理上述方法中所涉及的数据；或，信息。在一种可能的设计中，所述芯片系统还包括存储器，所述存储器，用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统，可以由芯片构成，也可以包括芯片和其他分立器件。Eighthly, this application provides a chip system including a processor for supporting an execution data processing device in implementing the functions involved in the foregoing aspects, such as transmitting or processing data involved in the foregoing methods; or, information. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the execution device or training device. This chip system may be composed of chips or may include chips and other discrete devices.

Attached Figure Description

图1为人工智能主体框架的一种结构示意图；Figure 1 is a schematic diagram of a structural framework for artificial intelligence.

图2至图4为本发明的应用系统框架示意；Figures 2 to 4 are schematic diagrams of the application system framework of the present invention;

图5为本申请实施例提供的一种数据处理方法的流程示意；Figure 5 is a flowchart illustrating a data processing method provided in an embodiment of this application;

图6A和图6B为本申请实施例提供的一种网络结构示意；Figures 6A and 6B are schematic diagrams of a network structure provided in an embodiment of this application;

图6C和图6D为数据分布示意；Figures 6C and 6D illustrate the data distribution.

图6E为非线性函数的示意；Figure 6E is a schematic diagram of a nonlinear function;

图7A至图7D为本申请实施例提供的一种数据处理方法的处理示意；Figures 7A to 7D are schematic diagrams of a data processing method provided in an embodiment of this application;

图7E和图7F为本申请实施例提供的一种效果示意；Figures 7E and 7F are schematic diagrams of the effects provided by an embodiment of this application;

图8为本申请实施例提供的数据处理装置的一种结构示意图；Figure 8 is a schematic diagram of a data processing device provided in an embodiment of this application;

图9为本申请实施例提供的执行设备的一种结构示意图；Figure 9 is a schematic diagram of an execution device provided in an embodiment of this application;

图10为本申请实施例提供的训练设备一种结构示意图；Figure 10 is a schematic diagram of a training device provided in an embodiment of this application;

图11为本申请实施例提供的芯片的一种结构示意图。Figure 11 is a schematic diagram of a chip structure provided in an embodiment of this application.

Detailed Implementation

下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释，而非旨在限定本发明。The embodiments of the present invention will now be described with reference to the accompanying drawings. The terminology used in the embodiments section is for illustrative purposes only and is not intended to limit the scope of the invention.

下面结合附图，对本申请的实施例进行描述。本领域普通技术人员可知，随着技术的发展和新场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments of this application will now be described with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

本文中所用用语“基本(substantially)”、“大约(about)”及类似用语用作近似用语、而并非用作程度用语，且旨在考虑到所属领域中的普通技术人员将知的测量值或计算值的固有偏差。此外，在阐述本发明实施例时使用“可(may)”是指“可能的一个或多个实施例”。本文中所用用语“使用(use)”、“正使用(using)”、及“被使用(used)”可被视为分别与用语“利用(utilize)”、“正利用(utilizing)”、及“被利用(utilized)”同义。另外，用语“示例性(exemplary)”旨在指代实例或例示。The terms “substantially,” “about,” and similar terms used herein are used as approximations rather than as terms of degree, and are intended to take into account the inherent biases of measurements or calculations known to those skilled in the art. Furthermore, the use of “may” in describing embodiments of the invention refers to “one or more possible embodiments.” The terms “use,” “using,” and “used” used herein are to be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively. Additionally, the term “exemplary” is intended to refer to an instance or illustration.

首先对人工智能系统总体工作流程进行描述，请参见图1，图1示出的为人工智能主体框架的一种结构示意图，下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中，“智能信息链”反映从数据的获取到处理的一列过程。举例来说，可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中，数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程，反映人工智能为信息技术产业带来的价值。First, the overall workflow of an artificial intelligence system is described, as shown in Figure 1. Figure 1 is a structural diagram of the main framework of artificial intelligence. The framework is then elaborated on from two dimensions: the "Intelligent Information Chain" (horizontal axis) and the "IT Value Chain" (vertical axis). The "Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT Value Chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.

(1)基础设施(1) Infrastructure

基础设施为人工智能系统提供计算能力支持，实现与外部世界的沟通，并通过基础平台实现支撑。通过传感器与外部沟通；计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供；基础平台包括分布式计算框架及网络等相关的平台保障和支持，可以包括云存储和计算、互联互通网络等。举例来说，传感器和外部沟通获取数据，这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。Infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. This communication occurs through sensors; computing power is provided by intelligent chips (hardware acceleration chips such as CPUs, NPUs, GPUs, ASICs, and FPGAs); and the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.

(2)数据(2) Data

基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本，还涉及到传统设备的物联网数据，包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.

(3)数据处理(3) Data processing

数据处理通常包括数据训练，机器学习，深度学习，搜索，推理，决策等方式。Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.

其中，机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.

推理是指在计算机或智能系统中，模拟人类的智能推理方式，依据推理控制策略，利用形式化的信息进行机器思维和求解问题的过程，典型的功能是搜索与匹配。Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.

决策是指智能信息经过推理后进行决策的过程，通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.

(4)通用能力(4) General ability

对数据经过上面提到的数据处理后，进一步基于数据处理的结果可以形成一些通用的能力，比如可以是算法或者一个通用系统，例如，翻译，文本的分析，计算机视觉的处理，语音识别，图像的识别等等。After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5)智能产品及行业应用(5) Smart Products and Industry Applications

智能产品及行业应用指人工智能系统在各领域的产品和应用，是对人工智能整体解决方案的封装，将智能信息决策产品化、实现落地应用，其应用领域主要包括：智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Their application areas mainly include: intelligent terminals, intelligent transportation, intelligent healthcare, autonomous driving, smart cities, etc.

下面结合图2对本申请实施例提供的系统架构进行详细的介绍。The system architecture provided in the embodiments of this application will be described in detail below with reference to Figure 2.

图2为本申请实施例提供的系统架构示意图。如图2所示，系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。Figure 2 is a schematic diagram of the system architecture provided in an embodiment of this application. As shown in Figure 2, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.

执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501，预处理模块513和预处理模块514是可选的。The execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model/rule 501, while the preprocessing modules 513 and 514 are optional.

数据采集设备560用于采集训练样本。在采集到训练样本之后，数据采集设备560将这些训练样本存入数据库530。The data acquisition device 560 is used to collect training samples. After collecting the training samples, the data acquisition device 560 stores these training samples in the database 530.

训练设备520可以基于数据库530中维护训练样本，对待训练的神经网络(例如本申请实施例中的机器学习模型)，以得到目标模型/规则501。The training device 520 can maintain training samples in the database 530 to obtain the target model/rule 501 from the neural network to be trained (e.g., the machine learning model in the embodiments of this application).

应理解，训练设备520可以基于数据库530中维护训练样本，对待训练的神经网络进行预训练过程，或者是在预训练的基础上进行模型的微调。It should be understood that the training device 520 can perform a pre-training process on the neural network to be trained based on the training samples maintained in the database 530, or fine-tune the model based on the pre-training.

需要说明的是，在实际应用中，数据库530中维护的训练样本不一定都来自于数据采集设备560的采集，也有可能是从其他设备接收得到的。另外需要说明的是，训练设备520也不一定完全基于数据库530维护的训练样本进行目标模型/规则501的训练，也有可能从云端或其他地方获取训练样本进行模型训练，上述描述不应该作为对本申请实施例的限定。It should be noted that in practical applications, the training samples maintained in database 530 may not all come from the data acquisition device 560; they may also be received from other devices. Furthermore, it should be noted that training device 520 may not necessarily train the target model/rule 501 entirely based on the training samples maintained in database 530; it may also obtain training samples from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application.

根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中，如应用于图2所示的执行设备510，该执行设备510可以是终端，如手机终端，平板电脑，笔记本电脑，增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备，车载终端等，还可以是服务器等。The target model/rule 501 trained by the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in Figure 2. The execution device 510 can be a terminal, such as a mobile terminal, tablet computer, laptop computer, augmented reality (AR)/virtual reality (VR) device, vehicle terminal, etc., or it can be a server, etc.

具体的，训练设备520可以将训练后的模型传递至执行设备510。Specifically, the training device 520 can transfer the trained model to the execution device 510.

在图2中，执行设备510配置输入/输出(input/output，I/O)接口512，用于与外部设备进行数据交互，用户可以通过客户设备540向I/O接口512输入数据。In Figure 2, the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices. Users can input data to the I/O interface 512 through the client device 540.

预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理。应理解，可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时，可以直接采用计算模块511对输入数据进行处理。Preprocessing modules 513 and 514 are used to preprocess the input data received from the I/O interface 512. It should be understood that preprocessing modules 513 and 514 may be absent, or only one preprocessing module may be used. When preprocessing modules 513 and 514 are absent, the calculation module 511 can be used directly to process the input data.

在执行设备510对输入数据进行预处理，或者在执行设备510的计算模块511执行计算等相关的处理过程中，执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理，也可以将相应处理得到的数据、指令等存入数据存储系统550中。During the preprocessing of input data by the execution device 510, or during the calculation module 511 of the execution device 510 performing calculations and other related processes, the execution device 510 can call data, code, etc. in the data storage system 550 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 550.

最后，I/O接口512将处理结果提供给客户设备540，从而提供给用户。Finally, the I/O interface 512 provides the processing result to the client device 540, thereby providing it to the user.

在图2所示情况下，用户可以手动给定输入数据，该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下，客户设备540可以自动地向I/O接口512发送输入数据，如果要求客户设备540自动发送输入数据需要获得用户的授权，则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果，具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端，采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据，并存入数据库530。当然，也可以不经过客户设备540进行采集，而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果，作为新的样本数据存入数据库530。In the scenario shown in Figure 2, the user can manually provide input data, which can be done through the interface provided by I/O interface 512. Alternatively, the client device 540 can automatically send input data to I/O interface 512. If user authorization is required for the client device 540 to automatically send input data, the user can set the corresponding permissions in the client device 540. The user can view the output results of the execution device 510 on the client device 540, which can be presented in various forms such as display, sound, or animation. The client device 540 can also act as a data acquisition terminal, collecting the input data and output results of the input I/O interface 512 as shown in the figure, and storing them as new sample data in database 530. Alternatively, data can be collected directly from the I/O interface 512 without going through the client device 540, using the input data and output results of the input I/O interface 512 as shown in the figure, and storing them as new sample data in database 530.

值得注意的是，图2仅是本申请实施例提供的一种系统架构的示意图，图中所示设备、器件、模块等之间的位置关系不构成任何限制，例如，在图2中，数据存储系统550相对执行设备510是外部存储器，在其它情况下，也可以将数据存储系统550置于执行设备510中。应理解，上述执行设备510可以部署于客户设备540中。It is worth noting that Figure 2 is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation. For example, in Figure 2, the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the aforementioned execution device 510 can be deployed in the client device 540.

从模型的推理侧来说：From the inference side of the model:

本申请实施例中，上述执行设备510的计算模块511可以获取到数据存储系统550中存储的代码来实现本申请实施例中的和模型推理过程相关的步骤。In this embodiment, the computing module 511 of the execution device 510 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in this embodiment.

本申请实施例中，执行设备510的计算模块511可以包括硬件电路(如专用集成电路(application specific integrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)、通用处理器、数字信号处理器(digital signal processing，DSP)、微处理器或微控制器等等)、或这些硬件电路的组合，例如，训练设备520可以为具有执行指令功能的硬件系统，如CPU、DSP等，或者为不具有执行指令功能的硬件系统，如ASIC、FPGA等，或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In this embodiment of the application, the computing module 511 of the execution device 510 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

具体的，执行设备510的计算模块511可以为具有执行指令功能的硬件系统，本申请实施例提供的和模型推理过程相关的步骤可以为存储在存储器中的软件代码，执行设备510的计算模块511可以从存储器中获取到软件代码，并执行获取到的软件代码来实现本申请实施例提供的和模型推理过程相关的步骤。Specifically, the computing module 511 of the execution device 510 can be a hardware system with the function of executing instructions. The steps related to the model inference process provided in this application embodiment can be software code stored in the memory. The computing module 511 of the execution device 510 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model inference process provided in this application embodiment.

应理解，执行设备510的计算模块511可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合，本申请实施例提供的和模型推理过程相关的步骤的部分步骤还可以通过执行设备510的计算模块511中不具有执行指令功能的硬件系统来实现，这里并不限定。It should be understood that the computing module 511 of the execution device 510 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the model reasoning process provided in the embodiments of this application can also be implemented by the hardware system in the computing module 511 of the execution device 510 without the function of executing instructions, which is not limited here.

从模型的训练侧来说：From the training side of the model:

本申请实施例中，上述训练设备520可以获取到存储器(图2中未示出，可以集成于训练设备520或者与训练设备520分离部署)中存储的代码来实现本申请实施例中和模型训练相关的步骤。In this embodiment, the training device 520 can obtain the code stored in the memory (not shown in Figure 2, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in this embodiment.

本申请实施例中，训练设备520可以包括硬件电路(如专用集成电路(application specific integrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)、通用处理器、数字信号处理器(digital signal processing，DSP)、微处理器或微控制器等等)、或这些硬件电路的组合，例如，训练设备520可以为具有执行指令功能的硬件系统，如CPU、DSP等，或者为不具有执行指令功能的硬件系统，如ASIC、FPGA等，或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In this embodiment of the application, the training device 520 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

应理解，训练设备520可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合，本申请实施例提供的中和模型训练相关的部分步骤还可以通过训练设备520中不具有执行指令功能的硬件系统来实现，这里并不限定。It should be understood that the training device 520 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the training of the neutralization model provided in the embodiments of this application can also be implemented by the hardware system in the training device 520 without the function of executing instructions, which is not limited here.

本申请实施例中，涉及模型的前向传播过程，该过程可以由上述实施例中介绍的执行设备510或者训练设备520执行。In this embodiment, the forward propagation process of the model is involved, which can be executed by the execution device 510 or the training device 520 described in the above embodiments.

此外，执行设备510或者训练设备520可以对输入数据通过机器学习模型进行处理，其中，机器学习模型可以包括注意力层，注意力层可以对输入的token进行注意力计算，且，注意力层在对token进行注意力计算时，可以得到在之后对token进行注意力计算时需要复用的中间结果。例如，该中间结果可以为K数据或V数据，存储到存储器中的K数据和V数据就是KV缓存(KV cache)。在这个过程中，在处理新的token时，对于得到的可以复用的中间结果，可以将其存储到存储器中，以便之后对其他token进行注意力计算时需要复用的时候，可以从存储器中可以读取该中间结果，并基于该中间结果进行之后token的注意力计算。然而，随着输入的数据尺寸的不断变大，需要存储的可以复用的中间结果的量随着推理的进行会迅速增长，导致对于存储的需求量很大。此外，过大的中间结果也会使推理过程变得极为缓慢，因此对可以复用的中间结果的压缩就显得尤为重要。Furthermore, the execution device 510 or training device 520 can process the input data using a machine learning model. This machine learning model may include an attention layer, which performs attention calculations on the input tokens. During these attention calculations, the attention layer can obtain intermediate results that can be reused in subsequent attention calculations on the same tokens. For example, these intermediate results can be K-data or V-data, and the K-data and V-data stored in memory constitute a KV cache. In this process, when processing a new token, reusable intermediate results can be stored in memory so that they can be retrieved from memory and used as a basis for attention calculations on other tokens. However, as the size of the input data increases, the amount of reusable intermediate results that need to be stored grows rapidly as inference progresses, leading to a large storage requirement. Furthermore, excessively large intermediate results can severely slow down the inference process; therefore, compressing reusable intermediate results is particularly important.

在一种实现中，压缩过程可以由压缩模块进行，压缩模块可以和上述执行设备510或者训练设备520集中部署，例如，属于同一块芯片或者其他粒度的计算单元，也可以是分离部署，例如属于不同的芯片，例如，执行设备510或者训练设备520可以为AI芯片，压缩模块可以属于CPU。In one implementation, the compression process can be performed by a compression module, which can be centrally deployed with the execution device 510 or training device 520, for example, belonging to the same chip or other granular computing units, or it can be deployed separately, for example, belonging to different chips. For example, the execution device 510 or training device 520 can be an AI chip, and the compression module can belong to the CPU.

例如，参照图3和图4，图3和图4分别为本申请实施例的架构示意，其中，模型运行模块可以通过运行机器学习模型得到中间结果，压缩模块可以对中间结果进行压缩，并将压缩数据写入到存储器中，压缩模块可以从存储区中读取压缩数据并进行反压缩，得到反压缩结果并传递至模型运行模块。图3中的压缩模块和模型运行模块分离部署在不同的芯片上，图4中的压缩模块和模型运行模块集中部署在同一块芯片上。For example, referring to Figures 3 and 4, which are schematic diagrams of the architecture of an embodiment of this application, the model running module can obtain intermediate results by running a machine learning model, the compression module can compress the intermediate results and write the compressed data into the memory, and the compression module can read the compressed data from the memory and perform decompression to obtain the decompression result, which is then transmitted to the model running module. In Figure 3, the compression module and the model running module are deployed separately on different chips, while in Figure 4, the compression module and the model running module are centrally deployed on the same chip.

由于本申请实施例涉及大量神经网络的应用，为了便于理解，下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of this application involve a large number of neural network applications, for ease of understanding, the relevant terms and concepts such as neural networks involved in the embodiments of this application will be introduced below.

(1)神经网络(1) Neural Network

神经网络可以是由神经单元组成的，神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元，该运算单元的输出可以为：
A neural network can be composed of neural units, which can be defined as a computational unit that takes xs (i.e., input data) and an intercept of 1 as input. The output of this computational unit can be:

其中，s＝1、2、……n，n为大于1的自然数，Ws为xs的权重，b为神经单元的偏置。f为神经单元的激活函数(activation functions)，用于将非线性特性引入神经网络中，来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入，激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络，即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连，来提取局部接受域的特征，局部接受域可以是由若干个神经单元组成的区域。Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

(2)卷积神经网络(convolutional neuron network，CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器，该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中，一个神经元可以只与部分邻层神经元连接。一个卷积层中，通常包含若干个特征平面，每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重，这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化，在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外，共享权重带来的直接好处是减少卷积神经网络各层之间的连接，同时又降低了过拟合的风险。(2) A convolutional neural network (CNN) is a deep neural network with a convolutional structure. A CNN contains a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter. A convolutional layer refers to the layer of neurons in a CNN that performs convolutional processing on the input signal. In a convolutional layer of a CNN, a neuron can be connected to only some of the neurons in its neighboring layers. A convolutional layer typically contains several feature planes, each composed of a series of rectangularly arranged neural units. Neural units on the same feature plane share weights, which are the convolutional kernel. Shared weights can be understood as the way features are extracted being independent of their location. The convolutional kernel can be formalized as a matrix of random size, and during the training process of the CNN, the kernel can learn appropriate weights. Furthermore, the direct benefit of shared weights is reducing the connections between layers in the CNN, while also reducing the risk of overfitting.

CNN是一种非常常见的神经网络，下面对CNN的结构进行详细的介绍。如前文的基础概念介绍所述，卷积神经网络是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。CNN is a very common type of neural network. The following is a detailed introduction to the structure of CNN. As mentioned in the basic concept introduction above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture, which refers to learning at multiple levels of different abstraction using machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network, where each neuron can respond to the input image.

(3)深度神经网络(3) Deep Neural Networks

深度神经网络(Deep Neural Network，DNN)，也称多层神经网络，可以理解为具有很多层隐含层的神经网络，这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分，DNN内部的神经网络可以分为三类：输入层，隐含层，输出层。一般来说第一层是输入层，最后一层是输出层，中间的层数都是隐含层。层与层之间是全连接的，也就是说，第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂，但是就每一层的工作来说，其实并不复杂，简单来说就是如下线性关系表达式：其中，是输入向量，是输出向量，是偏移向量，W是权重矩阵(也称系数)，α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量由于DNN层数多，则系数W和偏移向量的数量也就很多了。这些参数在DNN中的定义如下所述：以系数W为例：假设在一个三层的DNN中，第二层的第4个神经元到第三层的第2个神经元的线性系数定义为上标3代表系数W所在的层数，而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是：第L-1层的第k个神经元到第L层的第j个神经元的系数定义为需要注意的是，输入层是没有W参数的。在深度神经网络中，更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言，参数越多的模型复杂度越高，“容量”也就越大，也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程，其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。Deep Neural Networks (DNNs), also known as multilayer neural networks, can be understood as neural networks with many hidden layers, though there's no specific metric for "many." DNNs can be categorized into three layers based on their position: input layers, hidden layers, and output layers. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. All layers are fully connected, meaning that any neuron in the i-th layer is connected to any neuron in the (i+1)-th layer. Although DNNs appear complex, the operation of each layer is actually quite simple, resembling a linear relationship as follows: in, It is the input vector. It is the output vector. α is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is simply an adjustment of the input vector. The output vector is obtained through such a simple operation. Because DNNs have many layers, the coefficients W and the offset vector... The number of these parameters is therefore quite large. The definitions of these parameters in a DNN are as follows: Taking the coefficient W as an example: Assuming a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as... The superscript 3 represents the layer number where coefficient W resides, while the subscript corresponds to the output third layer index 2 and the input second layer index 4. In summary, the coefficients from the k-th neuron in layer L-1 to the j-th neuron in layer L are defined as follows: It's important to note that the input layer does not have a W parameter. In deep neural networks, more hidden layers allow the network to better represent complex real-world situations. Theoretically, the more parameters a model has, the higher its complexity and "capacity," meaning it can perform more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrix of all layers in the trained deep neural network (a weight matrix formed by the vectors W from many layers).

(4)损失函数(4) Loss Function

在训练深度神经网络的过程中，因为希望深度神经网络的输出尽可能的接近真正想要预测的值，所以可以通过比较当前网络的预测值和真正想要的目标值，再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然，在第一次更新之前通常会有初始化的过程，即为深度神经网络中的各层预先配置参数)，比如，如果网络的预测值高了，就调整权重向量让它预测低一些，不断的调整，直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此，就需要预先定义“如何比较预测值和目标值之间的差异”，这便是损失函数(loss function)或目标函数(objective function)，它们是用于衡量预测值和目标值的差异的重要方程。其中，以损失函数举例，损失函数的输出值(loss)越高表示差异越大，那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In training a deep neural network, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value. Based on the difference, we update the weight vector of each layer (usually pre-configuring parameters before the initial update). For example, if the prediction is too high, the weight vector is adjusted to predict a lower value. This process continues until the deep neural network predicts the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are crucial equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, and training a deep neural network becomes a process of minimizing this loss.

(5)反向传播算法(5) Backpropagation algorithm

可以采用误差反向传播(back propagation，BP)算法在训练过程中修正初始模型中参数的大小，使得模型的误差损失越来越小。具体地，前向传递输入信号直至输出会产生误差损失，通过反向传播误差损失信息来更新初始模型中的参数，从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动，旨在得到最优的模型参数，例如权重矩阵。Backpropagation (BP) can be used during training to correct the parameters in the initial model, thereby reducing the model's error loss. Specifically, forward propagation of the input signal to the output generates error loss; this error loss information is then propagated back to update the parameters in the initial model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining optimal model parameters, such as the weight matrix.

(6)大语言模型(Large Language Model)：大语言模型是指在大规模数据上进行训练的自然语言处理模型，通常拥有数十亿或数百亿个参数。这些模型在预训练阶段通过学习大量文本数据来捕捉语言的通用特征，然后可以在下游任务上进行微调，适应特定任务的需求。(6) Large Language Model: A large language model is a natural language processing model trained on large-scale data, typically with billions or tens of billions of parameters. These models learn the general features of language by studying a large amount of text data during the pre-training stage, and can then be fine-tuned on downstream tasks to adapt to the needs of specific tasks.

(7)transformer：transformer是一种深度学习模型架构，最初用于序列到序列的任务，如机器翻译。它使用自注意力机制来处理输入序列，取得了在自然语言处理领域的巨大成功。大多数大型语言模型，如BERT、GPT和T5，都基于Transformer架构。(7) Transformer: The transformer is a deep learning model architecture originally used for sequence-to-sequence tasks, such as machine translation. It uses a self-attention mechanism to process input sequences and has achieved great success in the field of natural language processing. Most large language models, such as BERT, GPT, and T5, are based on the Transformer architecture.

(8)KV缓存(key-value cache)：KV缓存是指存储键-值对的缓存结构。在大语言模型中，KV缓存通常用于存储模型在处理文本时的中间结果或其他有用的信息，以便提高效率。通过使用KV缓存，模型可以在处理文本时避免重复计算。(8) Key-Value Cache: A key-value cache is a cache structure that stores key-value pairs. In large language models, key-value caches are often used to store intermediate results or other useful information that the model is processing text in order to improve efficiency. By using a key-value cache, the model can avoid redundant calculations when processing text.

(9)KV缓存量化(key-Value cache quantization)：KV缓存量化是指对KV缓存中的值进行量化，以减少存储空间和计算开销。在一些大型语言模型中，为了使模型适应有限的资源，可以对KV缓存的值进行量化，降低模型的存储和计算成本。(9) Key-Value Cache Quantization: Key-value cache quantization refers to quantizing the values in the key-value cache to reduce storage space and computational overhead. In some large language models, to adapt the model to limited resources, the values in the key-value cache can be quantized to reduce the model's storage and computational costs.

(10)PPL(Perplexity，困惑度)：PPL是用于评估语言模型性能的指标，表示模型对给定文本序列的预测能力。PPL是一个正实数，可以理解为模型对观察到的数据序列进行预测的平均困难程度。PPL越低，表示模型性能越好。(10) PPL (Perplexity): PPL is a metric used to evaluate the performance of a language model, representing the model's ability to predict a given text sequence. PPL is a positive real number, which can be understood as the average difficulty the model has in predicting the observed data sequence. The lower the PPL, the better the model performance.

(11)非均匀量化(Non-Uniform Quantization)：非均匀量化是一种量化方法，其中数值范围被分割成不同大小的区间，以更好地适应数据分布。与均匀量化不同，非均匀量化可以根据数据的分布对每个区间分配不同数量的数值范围。(11) Non-uniform quantization: Non-uniform quantization is a quantization method in which the numerical range is divided into intervals of different sizes to better adapt to the data distribution. Unlike uniform quantization, non-uniform quantization can assign different numbers of numerical ranges to each interval according to the distribution of the data.

(12)词(Token)：在自然语言处理中，"token"是文本字符串分割的基本单位。这可以是一个词、一个字符或一个子词片段。大型语言模型通常需要将输入文本分割成tokens，然后将这些tokens转换为模型能理解的数字表示(如词向量)。(12) Token: In natural language processing, a "token" is the basic unit for segmenting a text string. This can be a word, a character, or a fragment of a word. Large language models typically need to segment the input text into tokens and then convert these tokens into numerical representations (such as word vectors) that the model can understand.

(13)序列(sequence)：在大语言模型的上下文中，"sequence"是指具有一定顺序关系的元素序列，多个token组成了sequence。(13) Sequence: In the context of the large language model, "sequence" refers to a sequence of elements with a certain order relationship. Multiple tokens make up a sequence.

(14)增量式推理(Incremental Inference)：增量式推理允许模型仅处理新加入的输入部分，而不是每次都重新处理整个序列。这是通过在模型的内部状态中维护上下文信息来实现的，允许模型在接收到新输入时快速响应。增量式推理在交互式应用中特别有用，例如聊天机器人或实时翻译，因为它可以显著减少延迟和计算资源的使用。(14) Incremental Inference: Incremental inference allows the model to process only newly added parts of the input, rather than reprocessing the entire sequence each time. This is achieved by maintaining contextual information in the model's internal state, allowing the model to respond quickly when it receives new input. Incremental inference is particularly useful in interactive applications, such as chatbots or real-time translation, as it can significantly reduce latency and computational resource usage.

本申请实施例提供了一种数据处理方法。下面结合附图对本申请实施例的数据处理方法进行详细的介绍。This application provides a data processing method. The data processing method of this application embodiment will be described in detail below with reference to the accompanying drawings.

参照图5，图5为本申请实施例提供的一种数据处理方法的流程示意，如图5所示，本申请实施例提供的一种数据处理方法，可以包括步骤501至503，下面分别对这些步骤进行详细的描述。Referring to Figure 5, which is a flowchart of a data processing method provided in an embodiment of this application, the data processing method provided in this application may include steps 501 to 503, which will be described in detail below.

501、获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型得到的；501. Obtain first data, which is obtained from the first input data through a machine learning model;

上述过程中，每次通过机器学习模型对最新的输入数据进行处理时，需要获取到历史上已经生成的中间结果(例如K数据或V数据)，并输入到机器学习模型中，以对最新的输入数据进行注意力运算(也就是，基于注意力机制的运算)，在该过程中仍然会得到中间结果(也就是对后续的输入数据进行处理时仍然需要复用的数据)，新生成的中间结果和获取到的历史上已经生成的中间结果可以进行拼接并进行注意力权重的运算。In the above process, each time the machine learning model processes the latest input data, it is necessary to obtain the intermediate results that have been generated in the past (such as K data or V data) and input them into the machine learning model to perform attention operations on the latest input data (that is, operations based on attention mechanisms). In this process, intermediate results will still be obtained (that is, data that still needs to be reused when processing subsequent input data). The newly generated intermediate results and the obtained intermediate results generated in the past can be concatenated and attention weights can be calculated.

例如，参照图6B，K cache和V cache为历史上已经生成的中间结果，对最新的输入数据的token进行计算(例如，通过Wd矩阵、Wk矩阵和Wv矩阵进行线性变换)后，可以得到最新的输入数据的Q数据、K数据和V数据，之后可以将最新的输入数据的K数据和K cache进行拼接和转置，将最新的输入数据的V数据和V cache进行拼接，之后的运算可以得到参照现有技术中的注意力运算的相关介绍，这里不再赘述。最新的输入数据的K数据和V数据可以存储至存储器中，进而在机器学习模型对之后的输入数据进行运算时可以从存储器中获取。For example, referring to Figure 6B, K cache and V cache represent historically generated intermediate results. After calculating the token of the latest input data (e.g., through linear transformations of the Wd, Wk, and Wv matrices), the Q, K, and V data of the latest input data can be obtained. Then, the K data and K cache of the latest input data can be concatenated and transposed, and the V data and V cache of the latest input data can be concatenated. Subsequent operations can be described in the relevant introduction to attention operations in existing technologies, which will not be repeated here. The K and V data of the latest input data can be stored in memory and retrieved from memory when the machine learning model performs operations on subsequent input data.

其中，机器学习模型可以为语言模型。Among them, the machine learning model can be a language model.

其中，注意力层可以对输入的数据进行自注意力(self attention)计算。The attention layer can perform self-attention calculations on the input data.

以机器学习模型为基于注意力机制的transformer模型为例，例如，参照图6A机器学习模型可以为语言模型，transformer是一种注意力机制(attention mechanism)的模型架构，其中的self-attention结构是其核心组件。每个transformer层包含两个主要部分：多头自注意力层(multi-head self attention)和前馈神经网络(feedforward neural network，FNN)。self-attention结构允许模型在处理序列数据时动态地关注不同位置的信息。它由三个主要部分组成：查询(query数据)、键(key数据)、和值(value数据)。对于一个输入序列，通过计算Q、K、V的线性变换，然后执行softmax操作，得到每个位置对其他位置的注意力分布，最后将这些分布加权得到当前位置的输出。自注意力结构用于处理输入数据中不同位置的信息。在自注意力中，每个位置的表示是由序列中所有其他位置的加权平均得到的，权重通过计算当前位置的查询、键和值之间的关系来确定。这使得模型能够对输入序列中的不同部分进行动态关注。Taking a machine learning model based on an attention mechanism, such as the transformer model (see Figure 6A), as an example, the machine learning model could be a language model. The transformer is a model architecture based on an attention mechanism, with self-attention being its core component. Each transformer layer contains two main parts: a multi-head self-attention layer and a feedforward neural network (FNN). The self-attention structure allows the model to dynamically focus on information at different positions when processing sequential data. It consists of three main parts: query data, key data, and value data. For an input sequence, by calculating the linear transformations of Q, K, and V, and then performing a softmax operation, the attention distribution of each position to other positions is obtained. Finally, these distributions are weighted to obtain the output at the current position. The self-attention structure is used to process information at different positions in the input data. In self-attention, the representation of each position is obtained by a weighted average of all other positions in the sequence, with the weights determined by calculating the relationship between the query, key, and value at the current position. This allows the model to dynamically focus on different parts of the input sequence.

在推理过程中，self-attention结构会为每个位置生成一个键-值(KV)对。这些KV对在注意力机制中用于计算不同位置的权重。KV cache是指在某个时间步上产生的KV对的集合，存储历史KV可以避免重复计算，进而加速模型的推理。During inference, the self-attention structure generates a key-value (KV) pair for each position. These KV pairs are used in the attention mechanism to calculate the weights at different positions. The KV cache refers to the set of KV pairs generated at a certain time step. Storing historical KV pairs can avoid repeated calculations, thereby accelerating the model's inference.

在一种可能的实现中，所述第一数据可以为上述介绍的机器学习模型在对之后的输入数据的运算过程中需要复用的数据。In one possible implementation, the first data can be the data that the machine learning model described above needs to reuse during the computation of subsequent input data.

在一种可能的实现中，所述第一数据可以为K数据或者V数据。也就是说，所述第一数据可以为K数据，所述第一数据可以为V数据，或者，所述第一数据可以为K数据和V数据。In one possible implementation, the first data can be K data or V data. That is, the first data can be K data, the first data can be V data, or the first data can be both K data and V data.

随着输入的数据尺寸的不断变大，需要存储的可以复用的中间结果的量随着推理的进行会迅速增长，导致对于存储的需求量很大。此外，过大的中间结果也会使推理过程变得极为缓慢，因此对可以复用的中间结果的压缩就显得尤为重要，因此，在将进行注意力计算时需要复用的中间结果(也就是第一数据)进行存储前，需要进行压缩处理，并存储压缩后的数据，从而降低存储开销，且在使用压缩后的数据时，需要读取压缩后的数据并对其进行解压缩，这样可以确保在计算注意力分布时保持模型的准确性。As the size of the input data increases, the amount of reusable intermediate results that need to be stored grows rapidly as inference progresses, leading to a significant storage requirement. Furthermore, excessively large intermediate results can severely slow down the inference process. Therefore, compressing reusable intermediate results becomes crucial. Before storing the intermediate results (i.e., the initial data) that need to be reused during attention calculation, compression processing is required, and the compressed data should be stored to reduce storage overhead. When using the compressed data, it needs to be read and decompressed to ensure the model's accuracy is maintained when calculating the attention distribution.

在一种可能的实现中，所述第一数据可以为对所述中间结果进行压缩后得到的。In one possible implementation, the first data may be obtained by compressing the intermediate result.

在一种可能的实现中，所述第一数据具体为对所述中间结果进行压缩以及解压缩后得到的。In one possible implementation, the first data is specifically obtained by compressing and decompressing the intermediate result.

具体的，在一种可能的实现中，第一数据可以为根据最新的输入数据(第一输入数据)通过机器学习模型得到的，也可以是对根据最新的输入数据通过机器学习模型得到的中间结果进行压缩并解压缩后得到的结果。Specifically, in one possible implementation, the first data can be obtained by a machine learning model based on the latest input data (first input data), or it can be the result obtained by compressing and decompressing the intermediate result obtained by a machine learning model based on the latest input data.

具体的，在一种可能的实现中，第一数据可以为根据最新的输入数据通过机器学习模型得到的，也可以是对根据最新的输入数据通过机器学习模型得到的中间结果进行压缩并解压缩后得到的结果。Specifically, in one possible implementation, the first data can be obtained through a machine learning model based on the latest input data, or it can be the result obtained by compressing and decompressing the intermediate results obtained through a machine learning model based on the latest input data.

具体的，在一种可能的实现中，第一数据可以不是通过最新的输入数据(第一输入数据)得到的，而是从存储器中获取到的已经经过压缩后的数据，也可以是对从存储器中获取到的压缩数据通过解压缩得到的。应理解，压缩-解压-再次压缩的这个流程可以在根据sequence序号确定非均匀量化对应关系(或者是非线性映射函数的类型和参数)的时候才会出现。即同一个token产生的K或V，在不同时刻需要做不同的量化处理，如果在某一时刻，量化参数改变了，则需要解压缩---再次量化。Specifically, in one possible implementation, the first data may not be obtained from the latest input data (the first input data), but rather from compressed data retrieved from memory, or it may be obtained by decompressing compressed data retrieved from memory. It should be understood that this compression-decompression-recompression process only occurs when determining the non-uniform quantization correspondence (or the type and parameters of the nonlinear mapping function) based on the sequence number. That is, the K or V generated by the same token needs to undergo different quantization processes at different times. If the quantization parameters change at a certain time, decompression and requantization are required.

在第一数据为根据最新的输入数据得到的时，需要在将进行压缩并存储到存储器中。When the first data is obtained based on the latest input data, it needs to be compressed and stored in memory.

在第一数据为根据不是最新的输入数据得到的，而是从存储器中获取到的时，可能需要对第一数据进行其他强度的压缩(也就是不同精度损失的压缩)并存储到存储器中。When the first data is obtained from memory instead of the latest input data, it may be necessary to compress the first data with different strengths (i.e., compression with different degrees of precision loss) and store it in memory.

502、对所述第一数据进行非均匀量化处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器；502. Perform non-uniform quantization processing on the first data to obtain first compressed data, and store the first compressed data in the memory;

本申请实施例中，在获取到第一数据之后，可以对所述第一数据进行压缩处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器，例如，存储器为内存cache。In this embodiment of the application, after obtaining the first data, the first data can be compressed to obtain the first compressed data, and the first compressed data can be stored in a memory, for example, a memory cache.

当前KV缓存压缩方法中使用的量化策略均为均匀量化。而大语言模型输出的KV数据的分布一般呈现出类似于高斯分布的非均匀分布。如图6C和图6D所示，大语言模型Llama-7B某层输出的Key缓存数据和Value缓存数据呈现出了非均匀分布的状态。均匀量化无法很好地适应KV数据的这种非均匀分布特性，会导致整体量化误差较大，对数值小的数据带来较大的影响(主要原因在于，KV数据的数据单元会集中分布在某一个小区间内，如果使用均匀量化，那么集中分布在某一个小区间内的数据单元会被转换为同一个或者很少数量的量化值，这会导致精度的损失很大)。Current key-value (KV) caching compression methods all use uniform quantization. However, the distribution of KV data output by large language models generally exhibits a non-uniform distribution similar to a Gaussian distribution. As shown in Figures 6C and 6D, the key and value cache data output by a certain layer of the Llama-7B large language model exhibit a non-uniform distribution. Uniform quantization cannot well adapt to this non-uniform distribution characteristic of KV data, leading to a larger overall quantization error and a greater impact on small numerical values (mainly because KV data units are concentrated in a small interval; if uniform quantization is used, these concentrated data units will be converted into the same or a small number of quantized values, resulting in a significant loss of accuracy).

本申请实施例中，基于KV数据的非均匀分布特性，对KV数据进行非均匀量化，可以更符合KV数据的分布特征，得到整体误差或者平均误差更小的量化结果，从而提升模型的处理精度。In this embodiment, based on the non-uniform distribution characteristics of KV data, non-uniform quantization of KV data can better conform to the distribution characteristics of KV data, resulting in quantization results with smaller overall or average errors, thereby improving the processing accuracy of the model.

具体的，本申请实施例中可以对第一数据中分布更密集的数值区间内进行更细粒度的量化，也就是插入更多的量化值以及对应的区间(由于量化值更多，因此区间的数值宽度会变低)，例如，所述第一数据包括多个数据单元，所述第一压缩数据包括每个所述数据单元对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，每个所述数据单元所在的数据区间对应的量化值用于作为所述数据单元对应的量化值，其中，数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。Specifically, in this embodiment, finer-grained quantization can be performed on denser numerical intervals in the first data, that is, more quantized values and corresponding intervals can be inserted (since there are more quantized values, the numerical width of the intervals will be lower). For example, the first data includes multiple data units, the first compressed data includes quantized values corresponding to each data unit, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the quantized value corresponding to the data interval where each data unit is located is used as the quantized value corresponding to the data unit, wherein the numerical interval includes a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

在一种可能的实现中，可以对所述第一数据进行非线性变换，得到变换后的第一数据，对所述变换后的第一数据进行均匀量化处理。具体的，可以参照图7A所示的处理流程。In one possible implementation, the first data can be nonlinearly transformed to obtain transformed first data, and then the transformed first data can be uniformly quantized. Specifically, the processing flow shown in Figure 7A can be referred to.

例如，根据预定设置的非均匀映射函数(或者可以称之为非线性变换函数)对数据进行映射变换，非均匀映射函数可以是次方函数、对数函数、A律曲线、μ律曲线或分段线性曲线等，例如可以参照图6E所示的，图6E为非线性函数的一个示意；然后对映射后的数据进行均匀量化，均匀量化的量化系数可以为预设值，也可以是通过在线统计映射后数据的信息后自适应确定。For example, the data is mapped and transformed according to a predetermined non-uniform mapping function (or non-linear transformation function). The non-uniform mapping function can be a power function, a logarithmic function, an A-law curve, a μ-law curve, or a piecewise linear curve, etc. For example, as shown in Figure 6E, which is a schematic diagram of a non-linear function; then the mapped data is uniformly quantized. The quantization coefficient of uniform quantization can be a preset value or can be adaptively determined by online statistical analysis of the mapped data.

其中，上述非线性变换和均匀量化的步骤可以合并，此时非线性映射和均匀量化可合并成一个查表步骤：根据预先设定的非线性变换函数和均匀量化方式，可以构建出一个浮点数值区间和量化值的对应表，通过K数据或V数据的数值和表即可完成量化，得到量化后的K数据或V数据。Among them, the above-mentioned nonlinear transformation and uniform quantization steps can be combined. In this case, nonlinear mapping and uniform quantization can be combined into a table lookup step: according to the pre-set nonlinear transformation function and uniform quantization method, a correspondence table between floating-point value range and quantized value can be constructed. Quantization can be completed by matching the K data or V data with the table, and the quantized K data or V data can be obtained.

具体的，在一种可能的实现中，可以基于预设的映射关系将所述第一数据的数据单元转换为对应的量化值，得到第一压缩数据；其中，所述映射关系包括多个数值区间以及每个数值区间对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，所述多个数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。Specifically, in one possible implementation, the data units of the first data can be converted into corresponding quantized values based on a preset mapping relationship to obtain the first compressed data; wherein, the mapping relationship includes multiple numerical intervals and quantized values corresponding to each numerical interval, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the multiple numerical intervals include a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

在一种可能的实现中，可以根据所述第一数据的量化精度，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。不同的待量化数据，例如大语言模型不同层输出的KV缓存数据、不同head维度上的KV缓存数据、模型推理的不同时刻下的KV缓存数据，可能会使用不同的量化精度，可以根据量化精度(例如量化比特数)来自适应地确定量化映射关系，根据确定的量化映射关系进行量化。In one possible implementation, the type of nonlinear function or the parameter values included in the nonlinear transformation can be determined based on the quantization precision of the first data. Different data to be quantized, such as key-value cache data output from different layers of a large language model, key-value cache data at different head dimensions, and key-value cache data at different time points during model inference, may use different quantization precisions. The quantization mapping relationship can be adaptively determined based on the quantization precision (e.g., the number of quantization bits), and quantization can be performed according to the determined quantization mapping relationship.

示例性的，可以获取需要缓存的K数据或V数据，以及当前K数据或V数据的量化精度(量化比特数)，根据量化精度确定非线性变换函数的类型和参数，由非线性变换函数类型和参数可以确定数值映射关系，函数类型可以是次方函数、对数函数、gamma映射函数、A律曲线、μ律曲线或PWL曲线等；对待量化的KV缓存数据进行非线性映射；对变换后的KV缓存数据执行均匀量化，输出量化后的K数据或V数据For example, the K or V data to be cached, and the quantization precision (number of quantization bits) of the current K or V data can be obtained. The type and parameters of the nonlinear transformation function are determined based on the quantization precision. The numerical mapping relationship can be determined from the type and parameters of the nonlinear transformation function. The function type can be a power function, logarithmic function, gamma mapping function, A-law curve, μ-law curve, or PWL curve, etc. Nonlinear mapping is then performed on the KV cached data to be quantized. Uniform quantization is then performed on the transformed KV cached data, and the quantized K or V data is output.

非线性映射和均匀量化可合并成一个查表步骤：由确定的非线性变换函数和均匀量化方式，可以构建出一个浮点数值区间和量化值的对应表，通过K数据或V数据的数值和表即可完成量化，得到量化后的K数据或V数据。Nonlinear mapping and uniform quantization can be combined into a single lookup step: given a defined nonlinear transformation function and uniform quantization method, a table mapping floating-point value ranges to quantized values can be constructed. Quantization can then be completed by matching the values of K-data or V-data with the table, yielding the quantized K-data or V-data.

其中，根据量化精度确定非线性变换函数的类型和参数，可以通过多种方法实现：例如可以根据预设的量化精度和非线性变换函数的类型和参数的对应关系确定；也可以是一个量化精度对应多个预设的非线性变换函数的类型和参数，通过在验证集上进行测试进行挑选确定不同量化精度对应的非线性变换函数的类型和参数。具体的，可以参照图7A所示的处理流程。The determination of the type and parameters of the nonlinear transformation function based on the quantization precision can be achieved through various methods: for example, it can be determined based on the correspondence between a preset quantization precision and the type and parameters of the nonlinear transformation function; alternatively, one quantization precision can correspond to multiple preset types and parameters of nonlinear transformation functions, and the type and parameters of the nonlinear transformation function corresponding to different quantization precisions can be determined by testing on a validation set. Specifically, the processing flow shown in Figure 7A can be referenced.

通过上述方式，非均匀量化中的非均匀映射关系可以根据量化精度自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，从而降低整体量化误差。具体的，可以参照图7B所示的处理流程。Using the above method, the non-uniform mapping relationship in non-uniform quantization can be adaptively determined according to the quantization accuracy, further reducing quantization error. This reduces the overall quantization error while maintaining the compression gains brought by quantization. Specifically, refer to the processing flow shown in Figure 7B.

在一种可能的实现中，可以根据所述注意力层所在的网络层在所述机器学习模型中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the type of nonlinear function or the parameter values used when performing the nonlinear transformation can be determined based on the position of the network layer containing the attention layer in the machine learning model.

在一种可能的实现中，所述第一数据为根据第一输入数据，通过机器学习模型的注意力层中的目标head得到的K数据和V数据；可以根据所述目标head在所述注意力层中所处的位置，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值。In one possible implementation, the first data is K data and V data obtained from the target head in the attention layer of the machine learning model based on the first input data; the type of nonlinear function or the parameter values included in the nonlinear transformation can be determined based on the position of the target head in the attention layer.

量化映射关系除了会随量化精度变化而改变之外，还会考虑数据的分组信息。例如对于不同层输出的KV缓存数据，可以使用不同的量化映射关系；对于不同head维度或sequence维度的KV缓存数据，也可以使用不同的量化映射关系。具体的，获取需要缓存的K数据或V数据，当前K数据或V数据的量化精度(量化比特数)，以及分组信息。其中分组信息可以是输出当前K数据或V数据的层的序号、当前K数据或V数据的head维度序号、或者当前K数据或V数据的sequence序号、或者当前K数据或V数据的组序号，或者是验证集在当前分组的统计信息。Besides changing with quantization precision, the quantization mapping relationship also considers the grouping information of the data. For example, different quantization mapping relationships can be used for KV cached data output from different layers; different quantization mapping relationships can also be used for KV cached data with different head dimensions or sequence dimensions. Specifically, the required K or V data to be cached, the quantization precision (number of quantization bits) of the current K or V data, and the grouping information are obtained. The grouping information can be the sequence number of the layer outputting the current K or V data, the head dimension sequence number of the current K or V data, the sequence number of the current K or V data, the group sequence number of the current K or V data, or the statistical information of the validation set in the current group.

根据量化精度和数据分组信息确定非线性变换函数的类型和参数，由非线性变换函数类型和参数可以确定数值映射关系，函数类型可以是次方函数、对数函数、gamma映射函数、A律曲线、μ律曲线或PWL曲线等；对待量化的KV缓存数据进行非线性映射；对变换后的KV缓存数据执行均匀量化，输出量化后的K数据或V数据，非线性映射和均匀量化可合并成一个查表步骤：由确定的非线性变换函数和均匀量化方式，可以构建出一个浮点数值区间和量化值的对应表，通过K数据或V数据的数值和表即可完成量化，得到量化后的K数据或V数据。根据量化精度和数据分组信息确定非线性变换函数的类型和参数，可以通过多种方法实现：例如可以根据预设的量化精度、分组信息和非线性变换函数的类型和参数的对应关系确定；也可以是一个量化精度+分组信息对应多个预设的非线性变换函数的类型和参数，通过在验证集上进行测试确定不同量化精度和分组信息对应的非线性变换函数的类型和参数。The type and parameters of the nonlinear transformation function are determined based on the quantization precision and data grouping information. The numerical mapping relationship can be determined from the type and parameters of the nonlinear transformation function. The function type can be a power function, logarithmic function, gamma mapping function, A-law curve, μ-law curve, or PWL curve, etc. Nonlinear mapping is then performed on the KV buffer data to be quantized. Uniform quantization is then performed on the transformed KV buffer data, outputting the quantized K or V data. Nonlinear mapping and uniform quantization can be combined into a single lookup step: based on the determined nonlinear transformation function and uniform quantization method, a table mapping floating-point value ranges to quantized values can be constructed. Quantization can then be completed by matching the K or V data values to the table, yielding the quantized K or V data. Determining the type and parameters of the nonlinear transformation function based on the quantization precision and data grouping information can be achieved through various methods: for example, it can be determined based on a preset correspondence between the quantization precision, grouping information, and the type and parameters of the nonlinear transformation function; alternatively, one quantization precision plus grouping information can correspond to multiple preset types and parameters of nonlinear transformation functions, and the type and parameters of the nonlinear transformation function corresponding to different quantization precisions and grouping information can be determined by testing on a validation set.

通过上述方式，非均匀量化中的非均匀映射关系可以根据数据分组信息(例如包括所述注意力层所在的网络层在所述机器学习模型中所处的位置、目标head在所述注意力层中所处的位置、生成顺序中的至少一种)自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，从而降低整体量化误差。具体的，可以参照图7C所示的处理流程。In this way, the non-uniform mapping relationship in non-uniform quantization can be adaptively determined based on data grouping information (e.g., the position of the network layer containing the attention layer in the machine learning model, the position of the target head in the attention layer, and the generation order, at least one of these), further reducing quantization error. This reduces the overall quantization error while maintaining the compression benefits of quantization. Specifically, the processing flow shown in Figure 7C can be referenced.

在一种可能的实现中，还可以根据所述第一数据的数据分布，确定进行所述非线性变换时所采用的非线性函数的类别或者包括的参数数值；所述数据分布通过所述第一数据的分布统计信息指示，或者通过标识来指示，其中，不同的所述标识对应不同特征的数据分布。例如，量化映射关系除了会随量化精度变化而改变之外，还会考虑当前待量化数据的分布信息。例如可根据量化精度和当前量化数据的离散程度确定量化映射关系。In one possible implementation, the type of nonlinear function or the parameter values included in the nonlinear transformation can be determined based on the data distribution of the first data; the data distribution is indicated by the distribution statistics of the first data or by an identifier, wherein different identifiers correspond to data distributions with different characteristics. For example, the quantization mapping relationship changes not only with the quantization precision but also considers the distribution information of the current data to be quantized. For example, the quantization mapping relationship can be determined based on the quantization precision and the dispersion of the current quantized data.

具体的，可以获取需要缓存的K数据或V数据，当前K数据或V数据的量化精度(量化比特数)，以及当前K数据或V数据的分布信息。其中数据分布信息为从当前K数据或V数据中统计值，例如均值、方差、最大值、最小值、值域范围等信息；根据量化精度和数据分布信息确定非线性变换函数的类型和参数，由非线性变换函数类型和参数可以确定数值映射关系，函数类型可以是次方函数、对数函数、gamma映射函数、A律曲线、μ律曲线或PWL曲线等；对待量化的KV缓存数据进行非线性映射；对变换后的KV缓存数据执行均匀量化，输出量化后的K数据或V数据；保存数据的分布信息，供反量化阶段使用。具体的，可以参照图7D所示的处理流程。Specifically, the process involves obtaining the K or V data to be cached, the quantization precision (number of quantization bits) of the current K or V data, and the distribution information of the current K or V data. The data distribution information consists of statistical values from the current K or V data, such as mean, variance, maximum, minimum, and range. Based on the quantization precision and data distribution information, the type and parameters of the nonlinear transformation function are determined. The type and parameters of the nonlinear transformation function determine the numerical mapping relationship. The function type can be a power function, logarithmic function, gamma mapping function, A-law curve, μ-law curve, or PWL curve, etc. Nonlinear mapping is then performed on the KV cache data to be quantized. Uniform quantization is then performed on the transformed KV cache data, outputting the quantized K or V data. The data distribution information is saved for use in the dequantization stage. For details, refer to the processing flow shown in Figure 7D.

非线性映射和均匀量化可合并成一个查表步骤：由确定的非线性变换函数和均匀量化方式，可以构建出一个浮点数值区间和量化值的对应表，通过K数据或V数据的数值和表即可完成量化，得到量化后的K数据或V数据。根据量化精度和数据分布信息确定非线性变换函数的类型和参数，可以通过多种方法实现：例如可以通过在验证集上进行测试，得到标定的分布信息，在量化阶段和反量化阶段根据当前数据分布信息和标定的分布信息的差异，量化精度来确定非线性变换函数的类型和参数。Nonlinear mapping and uniform quantization can be combined into a single lookup table step: given a defined nonlinear transformation function and uniform quantization method, a table mapping floating-point value ranges to quantized values can be constructed. Quantization can then be completed by matching the K or V data values to this table, yielding the quantized K or V data. The type and parameters of the nonlinear transformation function can be determined based on quantization precision and data distribution information, which can be achieved through various methods. For example, testing on a validation set can obtain calibrated distribution information. During the quantization and dequantization phases, the difference between the current data distribution information and the calibrated distribution information, along with the quantization precision, can then be used to determine the type and parameters of the nonlinear transformation function.

非均匀量化中的非均匀映射关系可以根据数据分布信息自适应确定，进一步减小了量化误差。能够在保持量化所带来的压缩收益的前提下，大幅降低整体量化误差。The non-uniform mapping relationship in non-uniform quantization can be adaptively determined based on data distribution information, further reducing quantization error. It can significantly reduce overall quantization error while maintaining the compression gains brought by quantization.

503、从所述存储器中读取所述第一压缩数据，并对所述第一压缩数据进行所述非均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。503. Read the first compressed data from the memory and perform inverse quantization processing on the first compressed data corresponding to the non-uniform quantization processing to obtain second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

根据确定出非线性函数的方式的不同，可以通过但不限于如下的方式进行反量化处理。Depending on how the nonlinear function is determined, dequantization can be performed in, but is not limited to, the following ways.

在需要使用KV Cache数据时，需要对量化后的KV Cache数据进行反量化。反量化阶段的的具体步骤的示意如下：获取量化后的K数据或V数据；使用与量化阶段相同的均匀量化方式对量化后的KV缓存数据进行均匀反量化；使用与量化阶段的非线性变换对应的非线性逆变换函数对反量化后的数据进行逆变换；得到解压缩的KV Cache数据用于推理。When KV Cache data is needed, the quantized KV Cache data needs to be dequantized. The specific steps of the dequantization stage are illustrated as follows: Obtain the quantized K data or V data; perform uniform dequantization on the quantized KV cache data using the same uniform quantization method as in the quantization stage; perform inverse transformation on the dequantized data using the inverse nonlinear transformation function corresponding to the nonlinear transformation in the quantization stage; obtain the decompressed KV Cache data for inference.

其中，均匀反量化和非线性逆映射可合并成一个查表步骤：根据均匀反量化方式和预先设定的非线性逆变换函数，可以构建出量化值和浮点值的对应表，通过量化后的K数据或V数据的数值和表即可完成反量化，得到反量化后的K数据或V数据。Among them, uniform dequantization and nonlinear inverse mapping can be combined into a table lookup step: based on the uniform dequantization method and the pre-set nonlinear inverse transformation function, a correspondence table between quantized values and floating-point values can be constructed. Dequantization can be completed by matching the values of the quantized K data or V data with the table, and the dequantized K data or V data can be obtained.

在需要使用KV Cache数据时，需要对量化后的KV Cache数据进行反量化。反量化阶段的具体步骤如下：获取量化后的K数据或V数据，获取其量化精度，使用与量化阶段相同的均匀量化方式对量化后的KV缓存数据进行均匀反量化；根据量化精度确定非线性逆变换函数的类型和参数，函数为量化阶段时使用的函数对应的逆函数；使用非线性逆变换函数对反量化后的数据进行逆变换；得到解压缩的KV Cache数据用于推理。When KV Cache data is needed, the quantized KV Cache data needs to be dequantized. The specific steps of the dequantization stage are as follows: Obtain the quantized K data or V data, obtain its quantization precision, and perform uniform dequantization on the quantized KV cache data using the same uniform quantization method as in the quantization stage; determine the type and parameters of the nonlinear inverse transform function based on the quantization precision, and the function is the inverse function corresponding to the function used in the quantization stage; perform inverse transform on the dequantized data using the nonlinear inverse transform function; obtain the decompressed KV Cache data for inference.

非线性逆映射和均匀反量化可合并成一个查表步骤：根据均匀反量化方式和由确定的非线性逆变换函数，可以构建出量化值和浮点值的对应表，通过量化后的K数据或V数据的数值和表即可完成反量化，得到反量化后的K数据或V数据。Nonlinear inverse mapping and uniform dequantization can be combined into a single table lookup step: based on the uniform dequantization method and a determined nonlinear inverse transformation function, a table corresponding to quantized values and floating-point values can be constructed. Dequantization can then be completed by matching the values of the quantized K-data or V-data with the table, yielding the dequantized K-data or V-data.

在需要使用KV Cache数据时，需要对量化后的KV Cache数据进行反量化。反量化阶段的具体步骤如下：获取量化后的K数据或V数据，获取当前KV量化数据的量化精度(量化比特数)，以及分组信息。其中分组信息可以是输出当前K数据或V数据的层的序号、当前K数据或V数据的head维度序号、或者当前K数据或V数据的sequence序号、或者当前K数据或V数据的组序号；使用与量化阶段相同的均匀量化方式对量化后的KV缓存数据进行均匀反量化；根据步骤1获取的量化精度和数据分组信息确定非线性逆变换函数的类型和参数，函数为量化阶段时使用的函数对应的逆函数；使用非线性逆变换函数对反量化后的数据进行逆变换；得到解压缩的KV Cache数据用于推理。When KV Cache data is needed, the quantized KV Cache data needs to be dequantized. The specific steps of the dequantization stage are as follows: Obtain the quantized K data or V data, obtain the quantization precision (number of quantized bits) of the current KV quantized data, and the grouping information. The grouping information can be the layer number, head dimension number, sequence number, or group number of the current K data or V data; perform uniform dequantization on the quantized KV cache data using the same uniform quantization method as in the quantization stage; determine the type and parameters of the nonlinear inverse transform function based on the quantization precision and data grouping information obtained in step 1, where the function is the inverse function corresponding to the function used in the quantization stage; perform inverse transform on the dequantized data using the nonlinear inverse transform function; obtain the decompressed KV Cache data for inference.

非线性逆映射和均匀反量化可合并成一个查表步骤：根据均匀反量化方式和由步骤3确定的非线性逆变换函数，可以构建出量化值和浮点值的对应表，通过量化后的K数据或V数据的数值和表即可完成反量化，得到反量化后的K数据或V数据。Nonlinear inverse mapping and uniform dequantization can be combined into a single table lookup step: based on the uniform dequantization method and the nonlinear inverse transformation function determined in step 3, a correspondence table between quantized values and floating-point values can be constructed. Dequantization can be completed by matching the values of the quantized K data or V data with the table, thus obtaining the dequantized K data or V data.

在需要使用KV Cache数据时，需要对量化后的KV Cache数据进行反量化。反量化阶段的具体步骤如下：获取量化后的K数据或V数据，获取当前KV量化数据的量化精度(量化比特数)，以及当前K数据或V数据的分布信息。其中数据分布信息为从当前K数据或V数据中统计值，例如均值、方差、最大值、最小值、值域范围等信息；使用与量化阶段相同的均匀量化方式对量化后的KV缓存数据进行均匀反量化；根据获取的量化精度和分布信息确定非线性逆变换函数的类型和参数，函数为量化阶段时使用的函数对应的逆函数；使用非线性逆变换函数对反量化后的数据进行逆变换；得到解压缩的KV Cache数据用于推理。When KV Cache data is needed, it must be dequantized. The specific steps of the dequantization stage are as follows: Obtain the quantized K or V data, obtain the quantization precision (number of quantized bits) of the current KV quantized data, and the distribution information of the current K or V data. The data distribution information consists of statistical values from the current K or V data, such as mean, variance, maximum value, minimum value, and value range; perform uniform dequantization on the quantized KV cache data using the same uniform quantization method as in the quantization stage; determine the type and parameters of the nonlinear inverse transform function based on the obtained quantization precision and distribution information, where the function is the inverse function corresponding to the function used in the quantization stage; perform an inverse transform on the dequantized data using the nonlinear inverse transform function; and obtain the decompressed KV Cache data for inference.

非线性逆映射和均匀反量化可合并成一个查表步骤：根据均匀反量化方式和确定的非线性逆变换函数，可以构建出量化值和浮点值的对应表，通过量化后的K数据或V数据的数值和表即可完成反量化，得到反量化后的K数据或V数据。Nonlinear inverse mapping and uniform dequantization can be combined into a single table lookup step: based on the uniform dequantization method and the determined nonlinear inverse transformation function, a correspondence table between quantized values and floating-point values can be constructed. Dequantization can then be completed by matching the values of the quantized K-data or V-data with the table, yielding the dequantized K-data or V-data.

接下来结合实验介绍本申请实施例的有益效果：The beneficial effects of the embodiments of this application will be described below with reference to experiments:

在Llama-7B模型上测试wikitext2验证集，将推理过程中产生的KV Cache数据使用本发明方案提出的非均匀量化方法进行量化，推理过程需要使用历史缓存的K数据或V数据时对其进行反量化。在sequence长度为2048时，压缩率和测试困惑度(perplexity，PPL)的曲线如图7E所示。由图可见，和使用均匀量化相比，使用非均匀量化能够在保持相同的压缩率的前提下，得到更好的推理性能(PPL更低代表推理性能更好)。The wikitext2 validation set was tested on the Llama-7B model. The KV cache data generated during inference was quantized using the non-uniform quantization method proposed in this invention. When historical cached K or V data was needed during inference, it was dequantized. The compression ratio and perplexity (PPL) curves are shown in Figure 7E when the sequence length is 2048. As can be seen from the figure, compared with uniform quantization, non-uniform quantization achieves better inference performance while maintaining the same compression ratio (lower PPL indicates better inference performance).

在Llama-7B模型上测试wikitext2验证集，将推理过程中产生的KV Cache数据使用本发明方案提出的“根据量化精度确定量化映射关系的量化方式”的方法进行量化，推理过程需要使用历史缓存的K数据或V数据时对其进行反量化。在sequence长度为2048时，压缩率和测试PPL的曲线如图7F所示。由图可见，和使用预设的非均匀量化相比，根据量化精度确定量化映射关系的量化方式能够在保持相同的压缩率的前提下，得到更好的推理性能(PPL更低代表推理性能更好)。The wikitext2 validation set was tested on the Llama-7B model. The KV cache data generated during inference was quantized using the method proposed in this invention, which "determines the quantization mapping relationship based on quantization precision." Historical cached K or V data was dequantized when needed during inference. The compression ratio and test PPL curves are shown in Figure 7F when the sequence length is 2048. As can be seen from the figure, compared to using preset non-uniform quantization, the quantization method that determines the quantization mapping relationship based on quantization precision achieves better inference performance while maintaining the same compression ratio (lower PPL indicates better inference performance).

具体的，本申请提供了一种数据处理方法，所述方法包括：获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型计算得到的；通过非线性函数对所述第一数据进行非线性变换，得到变换后的第一数据；对所述变换后的第一数据进行均匀量化处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器；从所述存储器中读取所述第一压缩数据，并对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据；其中，所述非线性函数的类别或者包括的参数数值为基于如下的至少一种确定的：所述第一数据的量化精度；或，生成所述第一数据的网络在所述机器学习模型中所处的位置；或，所述第一数据在所述机器学习模型中的生成顺序；或，所述第一数据的数据分布。Specifically, this application provides a data processing method, the method comprising: acquiring first data, the first data being calculated by a machine learning model based on first input data; performing a nonlinear transformation on the first data using a nonlinear function to obtain transformed first data; performing uniform quantization processing on the transformed first data to obtain first compressed data, and storing the first compressed data in a memory; reading the first compressed data from the memory, and performing an inverse transformation processing corresponding to the nonlinear transformation and an inverse quantization processing corresponding to the uniform quantization processing on the first compressed data to obtain second data, the second data and the second input data being used as inputs into the machine learning model, the second input data being data input into the machine learning model after the first input data; wherein, the category of the nonlinear function or the parameter values included therein are determined based on at least one of the following: the quantization precision of the first data; or, the position of the network that generated the first data in the machine learning model; or, the generation order of the first data in the machine learning model; or, the data distribution of the first data.

在一种可能的实现中，可以基于预设的映射关系将所述第一数据的数据单元转换为对应的量化值，得到第一压缩数据；其中，所述映射关系包括多个数值区间以及每个数值区间对应的量化值，所述第一数据的数据单元包括第一数值范围的数据和第二数值范围的数据，所述第一数值范围的数据比所述第二数值范围的数据更密集，所述多个数值区间包括第一数值区间和第二数值区间，所述第一数值区间属于所述第一数值范围，所述第二数值区间属于所述第二数值范围，所述第一数值区间的数值宽度小于所述第二数值区间的数值宽度。In one possible implementation, the data units of the first data can be converted into corresponding quantized values based on a preset mapping relationship to obtain the first compressed data; wherein, the mapping relationship includes multiple numerical intervals and quantized values corresponding to each numerical interval, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the multiple numerical intervals include a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

在一种可能的实现中，在对所述第一数据进行非均匀量化处理时，可以对所述第一数据进行非线性变换，得到变换后的第一数据；对所述变换后的第一数据进行均匀量化处理。In one possible implementation, when performing non-uniform quantization on the first data, the first data can be non-linearly transformed to obtain the transformed first data; then, the transformed first data can be subjected to uniform quantization.

参照图8，图8为本申请实施例提供的一种数据处理装置的结构示意，如图8所示，本申请实施例提供的一种数据处理装置，所述装置800包括：Referring to Figure 8, which is a schematic diagram of the structure of a data processing apparatus provided in an embodiment of this application, as shown in Figure 8, the data processing apparatus 800 provided in this embodiment of the application includes:

获取模块801，用于获取第一数据，所述第一数据为根据第一输入数据，通过机器学习模型得到的，从所述存储器中读取第一压缩数据；The acquisition module 801 is used to acquire first data, which is obtained by a machine learning model based on first input data, and reads first compressed data from the memory;

关于获取模块801的具体描述可以参照上述实施例中步骤501的描述，这里不再赘述。For a detailed description of the acquisition module 801, please refer to the description of step 501 in the above embodiment, which will not be repeated here.

处理模块802，用于对所述第一数据进行非均匀量化处理，得到所述第一压缩数据，并将所述第一压缩数据存储至存储器，并对所述第一压缩数据进行所述非均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。The processing module 802 is used to perform non-uniform quantization processing on the first data to obtain the first compressed data, store the first compressed data in a memory, and perform inverse quantization processing on the first compressed data corresponding to the non-uniform quantization processing to obtain the second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

关于获取模块801的具体描述可以参照上述实施例中步骤502和503的描述，这里不再赘述。For a detailed description of the acquisition module 801, please refer to the descriptions of steps 502 and 503 in the above embodiments, which will not be repeated here.

在一种可能的实现中，所述处理模块802，具体用于：In one possible implementation, the processing module 802 is specifically used for:

在一种可能的实现中，所述处理模块802，还用于：In one possible implementation, the processing module 802 is further configured to:

在一种可能的实现中，所述第一数据为根据第一输入数据，通过机器学习模型的注意力层中的目标head得到的；所述处理模块802，还用于：In one possible implementation, the first data is obtained from the first input data through the target head in the attention layer of a machine learning model; the processing module 802 is further configured to:

本申请实施例还提供了一种数据处理装置，所述装置包括：This application embodiment also provides a data processing apparatus, the apparatus comprising:

处理模块，用于通过非线性函数对所述第一数据进行非线性变换，得到变换后的第一数据；对所述变换后的第一数据进行均匀量化处理，得到第一压缩数据，并将所述第一压缩数据存储至存储器，并对所述第一压缩数据进行所述非线性变换对应的逆变换处理，以及所述均匀量化处理对应的反量化处理，得到第二数据，所述第二数据和第二输入数据用于输入到所述机器学习模型中，所述第二输入数据为所述第一输入数据之后输入到所述机器学习模型中的数据。The processing module is configured to perform a nonlinear transformation on the first data using a nonlinear function to obtain transformed first data; perform uniform quantization on the transformed first data to obtain first compressed data, store the first compressed data in a memory, and perform inverse transformation processing corresponding to the nonlinear transformation and inverse quantization processing corresponding to the uniform quantization on the first compressed data to obtain second data. The second data and the second input data are used to input into the machine learning model, and the second input data is the data input into the machine learning model after the first input data.

所述第一数据的数据分布。The data distribution of the first data.

接下来介绍本申请实施例提供的一种终端设备，请参阅图9，图9为本申请实施例提供的终端设备的一种结构示意图，终端设备900具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备等，此处不做限定。具体的，终端设备900包括：接收器901、发射器902、处理器903和存储器904(其中终端设备900中的处理器903的数量可以一个或多个，图9中以一个处理器为例)，其中，处理器903可以包括应用处理器9031和通信处理器9032。在本申请的一些实施例中，接收器901、发射器902、处理器903和存储器904可通过总线或其它方式连接。The following describes a terminal device provided in an embodiment of this application. Please refer to Figure 9, which is a structural schematic diagram of a terminal device provided in an embodiment of this application. The terminal device 900 can specifically be a virtual reality (VR) device, a mobile phone, a tablet, a laptop computer, a smart wearable device, etc., and is not limited here. Specifically, the terminal device 900 includes: a receiver 901, a transmitter 902, a processor 903, and a memory 904 (the number of processors 903 in the terminal device 900 can be one or more; Figure 9 shows one processor as an example). The processor 903 may include an application processor 9031 and a communication processor 9032. In some embodiments of this application, the receiver 901, transmitter 902, processor 903, and memory 904 can be connected via a bus or other means.

存储器904可以包括只读存储器和随机存取存储器，并向处理器903提供指令和数据。存储器904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory，NVRAM)。存储器904存储有处理器和操作指令、可执行模块或者数据结构，或者它们的子集，或者它们的扩展集，其中，操作指令可包括各种操作指令，用于实现各种操作。Memory 904 may include read-only memory and random access memory, and provides instructions and data to processor 903. A portion of memory 904 may also include non-volatile random access memory (NVRAM). Memory 904 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

处理器903控制执行设备的操作。具体的应用中，执行设备的各个组件通过总线系统耦合在一起，其中总线系统除包括数据总线之外，还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见，在图中将各种总线都称为总线系统。Processor 903 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses in the diagram are referred to as the bus system.

上述本申请实施例揭示的方法可以应用于处理器903中，或者由处理器903实现。处理器903可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器903可以是通用处理器、数字信号处理器(digital signal processing，DSP)、微处理器或微控制器，还可进一步包括专用集成电路(application specific integrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器903可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器904，处理器903读取存储器904中的信息，结合其硬件完成上述方法中涉及模型训练或者模型推理过程的步骤。The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 903. The processor 903 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 903 or by instructions in software form. The processor 903 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor 903 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 904. The processor 903 reads the information from memory 904 and, in conjunction with its hardware, completes the steps involved in the model training or model inference process described above.

接收器901可用于接收输入的数字或字符信息，以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器902可用于通过第一接口输出数字或字符信息；发射器902还可用于通过第一接口向磁盘组发送指令，以修改磁盘组中的数据；发射器902还可以包括显示屏等显示设备。Receiver 901 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 902 can be used to output digital or character information through the first interface; transmitter 902 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 902 may also include a display device such as a display screen.

本申请实施例还提供了一种服务器，请参阅图10，图10是本申请实施例提供的服务器一种结构示意图，服务器1000可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(central processing units，CPU)1010(例如，一个或一个以上处理器)和存储器1032，一个或一个以上存储应用程序1042或数据1044的存储介质1030(例如一个或一个以上海量存储设备)。其中，存储器1032和存储介质1030可以是短暂存储或持久存储。存储在存储介质1030的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，中央处理器1010可以设置为与存储介质1030通信，在服务器1000上执行存储介质1030中的一系列指令操作。This application embodiment also provides a server. Referring to Figure 10, which is a schematic diagram of a server structure provided in this application embodiment, the server 1000 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 1010 (e.g., one or more processors) and memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) for storing application programs 1042 or data 1044. The memory 1032 and storage media 1030 can be temporary or persistent storage. The program stored in the storage media 1030 may include one or more modules (not shown in the figure), each module may include a series of instruction operations on the server. Furthermore, the CPU 1010 may be configured to communicate with the storage media 1030 and execute the series of instruction operations in the storage media 1030 on the server 1000.

服务器1000还可以包括一个或一个以上电源1026，一个或一个以上有线或无线网络接口1050，一个或一个以上输入输出接口1058；或，一个或一个以上操作系统1041，例如Windows ServerTM，Mac OS XTM，UnixTM，LinuxTM，FreeBSDTM等等。Server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058; or, one or more operating systems 1041, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

本申请实施例中，中央处理器1010，用于执行上述实施例中和模型训练或者模型推理相关的动作。In this embodiment, the central processing unit 1010 is used to perform actions related to model training or model inference in the above embodiments.

本申请实施例中还提供一种包括计算机程序产品，当其在计算机上运行时，使得计算机执行如前述执行设备所执行的步骤，或者，使得计算机执行如前述训练设备所执行的步骤。This application also provides a computer program product that, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

本申请实施例中还提供一种计算机可读存储介质，该计算机可读存储介质中存储有用于进行信号处理的程序，当其在计算机上运行时，使得计算机执行如前述执行设备所执行的步骤，或者，使得计算机执行如前述训练设备所执行的步骤。This application also provides a computer-readable storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片，芯片包括：处理单元和通信单元，所述处理单元例如可以是处理器，所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令，以使执行设备内的芯片执行上述实施例描述的数据处理方法，或者，以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地，所述存储单元为所述芯片内的存储单元，如寄存器、缓存等，所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元，如只读存储器(read-only memory，ROM)或可存储静态信息和指令的其他类型的静态存储设备，随机存取存储器(random access memory，RAM)等。The execution device, training device, or terminal device provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input/output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the execution device to execute the data processing method described in the above embodiments, or to cause the chip within the training device to execute the data processing method described in the above embodiments. Optionally, the storage unit is a storage unit within the chip, such as a register or cache. The storage unit can also be a storage unit located outside the chip within the wireless access device, such as read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

具体的，请参阅图11，图11为本申请实施例提供的芯片的一种结构示意图，所述芯片可以表现为神经网络处理器NPU 1100，NPU 1100作为协处理器挂载到主CPU(Host CPU)上，由Host CPU分配任务。NPU的核心部分为运算电路1103，通过控制器1104控制运算电路1103提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to Figure 11, which is a schematic diagram of a chip structure provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1100. The NPU 1100 is mounted as a coprocessor on the host CPU, and tasks are allocated by the host CPU. The core part of the NPU is the arithmetic circuit 1103, which is controlled by the controller 1104 to extract matrix data from the memory and perform multiplication operations.

在一些实现中，运算电路1103内部包括多个处理单元(Process Engine，PE)。在一些实现中，运算电路1103是二维脉动阵列。运算电路1103还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路1103是通用的矩阵处理器。In some implementations, the arithmetic circuit 1103 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1103 is a two-dimensional pulsating array. The arithmetic circuit 1103 can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1103 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路从权重存储器1102中取矩阵B相应的数据，并缓存在运算电路中每一个PE上。运算电路从输入存储器1101中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器(accumulator)1108中。For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1102 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1101 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is stored in the accumulator 1108.

统一存储器1106用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller，DMAC)1105，DMAC被搬运到权重存储器1102中。输入数据也通过DMAC被搬运到统一存储器1106中。Unified memory 1106 is used to store input and output data. Weight data is directly transferred to weight memory 1102 via Direct Memory Access Controller (DMAC) 1105. Input data is also transferred to unified memory 1106 via DMAC.

BIU为Bus Interface Unit即，总线接口单元1110，用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer，IFB)1109的交互。BIU stands for Bus Interface Unit, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1109.

总线接口单元1110(Bus Interface Unit，简称BIU)，用于取指存储器1109从外部存储器获取指令，还用于存储单元访问控制器1105从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The Bus Interface Unit (BIU) 1110 is used by the instruction fetch memory 1109 to fetch instructions from external memory, and also by the memory access controller 1105 to fetch the original data of the input matrix A or the weight matrix B from external memory.

DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1106或将权重数据搬运到权重存储器1102中或将输入数据数据搬运到输入存储器1101中。The DMAC is mainly used to move input data from external memory DDR to unified memory 1106, or to weight data to weight memory 1102, or to input data to input memory 1101.

向量计算单元1107包括多个运算处理单元，在需要的情况下，对运算电路1103的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。主要用于神经网络中非卷积/全连接层网络计算，如Batch Normalization(批归一化)，像素级求和，对特征平面进行上采样等。The vector computation unit 1107 includes multiple processing units that further process the output of the computation circuit 1103 when needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional/fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

在一些实现中，向量计算单元1107能将经处理的输出的向量存储到统一存储器1106。例如，向量计算单元1107可以将线性函数；或，非线性函数应用到运算电路1103的输出，例如对卷积层提取的特征平面进行线性插值，再例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元1107生成归一化的值、像素级求和的值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路1103的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, vector computation unit 1107 can store the processed output vector in unified memory 1106. For example, vector computation unit 1107 can apply a linear function, or a nonlinear function, to the output of computation circuit 1103, such as linear interpolation of feature planes extracted by convolutional layers, or, for example, a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 1107 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to computation circuit 1103, for example, for use in subsequent layers of the neural network.

控制器1104连接的取指存储器(instruction fetch buffer)1109，用于存储控制器1104使用的指令；The instruction fetch buffer 1109 connected to the controller 1104 is used to store the instructions used by the controller 1104;

统一存储器1106，输入存储器1101，权重存储器1102以及取指存储器1109均为On-Chip存储器。外部存储器私有于该NPU硬件架构。Unified memory 1106, input memory 1101, weight memory 1102, and instruction fetch memory 1109 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

其中，上述任一处提到的处理器，可以是一个通用中央处理器，微处理器，ASIC，或一个或多个用于控制上述程序执行的集成电路。The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

另外需说明的是，以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本申请提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

通过以上的实施方式的描述，所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现，当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下，凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现，而且，用来实现同一功能的具体硬件结构也可以是多种多样的，例如模拟电路、数字电路或专用电路等。但是，对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在可读取的存储介质中，如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，训练设备，或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘(Solid State Disk，SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

A data processing method, characterized in that the method includes:

Obtain first data, which is obtained from the first input data through a machine learning model;

The first data is subjected to non-uniform quantization processing to obtain the first compressed data, and the first compressed data is stored in the memory.

The first compressed data is read from the memory, and the first compressed data is subjected to the inverse quantization process corresponding to the non-uniform quantization process to obtain the second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

According to the method of claim 1, the first data is K data or V data.

According to the method of claim 1 or 2, the first data comprises a plurality of data units, the first compressed data comprises a quantization value corresponding to each data unit, the data units of the first data comprise data within a first numerical range and data within a second numerical range, the data within the first numerical range being more densely packed than the data within the second numerical range, the quantization value corresponding to the data interval in which each data unit is located is used as the quantization value corresponding to the data unit, wherein the numerical interval comprises a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

The method according to any one of claims 1 to 3, characterized in that, the step of performing non-uniform quantization processing on the first data to obtain the first compressed data includes:

Based on a preset mapping relationship, the data units of the first data are converted into corresponding quantized values to obtain the first compressed data; wherein, the mapping relationship includes multiple numerical intervals and the quantized value corresponding to each numerical interval, the data units of the first data include data in a first numerical range and data in a second numerical range, the data in the first numerical range is denser than the data in the second numerical range, the multiple numerical intervals include a first numerical interval and a second numerical interval, the first numerical interval belongs to the first numerical range, the second numerical interval belongs to the second numerical range, and the numerical width of the first numerical interval is smaller than the numerical width of the second numerical interval.

The method according to any one of claims 1 to 4, characterized in that, the step of performing non-uniform quantization processing on the first data to obtain the first compressed data includes:

The first data is subjected to a nonlinear transformation to obtain the transformed first data.

The transformed first data is then subjected to uniform quantization.

The method according to claim 5, characterized in that the method further comprises:

Based on the quantization precision of the first data, determine the type of nonlinear function or the parameter values included in it when performing the nonlinear transformation.

The method according to claim 5 or 6, characterized in that the first data is obtained from the first input data through the attention layer of a machine learning model, and the method further includes:

Based on the position of the network layer containing the attention layer in the machine learning model, the type of nonlinear function or the parameter values included in the nonlinear transformation are determined.

The method according to any one of claims 5 to 7, wherein the first data is obtained from the first input data through the target head in the attention layer of a machine learning model; the method further comprises:

Based on the position of the target head in the attention layer, determine the type of nonlinear function or the parameter values included in the nonlinear transformation.

The method according to any one of claims 5 to 8, characterized in that the method further comprises:

Based on the generation interval between the first data and the latest data obtained by the machine learning model, determine the type of nonlinear function or the parameter values included when performing nonlinear transformation on the first data.

The method according to any one of claims 5 to 9, characterized in that the method further comprises:

Based on the data distribution of the first data, the category of the nonlinear function or the parameter values included in the nonlinear transformation of the first data are determined; the data distribution is indicated by the distribution statistics of the first data or by an identifier, wherein different identifiers correspond to data distributions with different characteristics.

A data processing method, characterized in that the method includes:

Obtain first data, which is calculated by a machine learning model based on first input data;

The first data is transformed by a nonlinear function to obtain the transformed first data.

The transformed first data is subjected to uniform quantization to obtain first compressed data, and the first compressed data is stored in a memory.

The first compressed data is read from the memory, and the first compressed data is subjected to the inverse transformation process corresponding to the nonlinear transformation and the inverse quantization process corresponding to the uniform quantization process to obtain the second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

The method according to claim 11, wherein the category of the nonlinear function or the numerical values of the included parameters are determined based on at least one of the following:

The quantization precision of the first data; or,

The position of the network that generated the first data within the machine learning model; or,

The order in which the first data was generated in the machine learning model; or,

The data distribution of the first data.

A data processing apparatus, characterized in that the apparatus comprises:

An acquisition module is used to acquire first data, which is obtained by a machine learning model based on first input data, and to read first compressed data from the memory;

The processing module is used to perform non-uniform quantization processing on the first data to obtain the first compressed data, store the first compressed data in a memory, and perform inverse quantization processing on the first compressed data corresponding to the non-uniform quantization processing to obtain the second data. The second data and the second input data are used to input into the machine learning model. The second input data is the data input into the machine learning model after the first input data.

The apparatus according to claim 13, wherein the first data comprises a plurality of data units, the first compressed data comprises a quantization value corresponding to each data unit, the data units of the first data comprise data within a first numerical range and data within a second numerical range, the data within the first numerical range being more densely packed than the data within the second numerical range, the quantization value corresponding to the data interval in which each data unit is located is used as the quantization value corresponding to the data unit, wherein the numerical interval comprises a first numerical interval and a second numerical interval, the first numerical interval belonging to the first numerical range, the second numerical interval belonging to the second numerical range, and the numerical width of the first numerical interval being smaller than the numerical width of the second numerical interval.

The apparatus according to claim 13 or 14, wherein the processing module is specifically used for:

The transformed first data is then subjected to uniform quantization.

The apparatus according to claim 15, wherein the processing module is further configured to:

The apparatus according to claim 15 or 16, wherein the processing module is further configured to:

The apparatus according to any one of claims 15 to 17, wherein the first data is K data and V data obtained from the target head in the attention layer of a machine learning model based on the first input data; the processing module is further configured to:

The apparatus according to any one of claims 15 to 18, wherein the processing module is further configured to:

Based on the generation interval between the first data and the latest data obtained by the machine learning model, the type of nonlinear function or the parameter values included in the nonlinear transformation are determined.

The apparatus according to any one of claims 15 to 19, wherein the processing module is further configured to:

Based on the data distribution of the first data, the category of the nonlinear function or the parameter values included in the nonlinear transformation are determined; the data distribution is indicated by the distribution statistics of the first data or by an identifier, wherein different identifiers correspond to data distributions with different characteristics.

A data processing apparatus, characterized in that the apparatus comprises:

An acquisition module is used to acquire first data, which is calculated based on first input data through a machine learning model, and to read the first compressed data from the memory;

The processing module is configured to perform a nonlinear transformation on the first data using a nonlinear function to obtain the transformed first data; perform uniform quantization on the transformed first data to obtain the first compressed data, store the first compressed data in a memory, and perform inverse transformation processing corresponding to the nonlinear transformation and inverse quantization processing corresponding to the uniform quantization on the first compressed data to obtain second data. The second data and the second input data are used to input into the machine learning model, and the second input data is the data input into the machine learning model after the first input data.

The apparatus according to claim 21, wherein the category of the nonlinear function or the numerical value of the included parameters is determined based on at least one of the following:

The quantization precision of the first data; or,

The data distribution of the first data.

A computer storage medium, characterized in that the computer storage medium stores one or more instructions, which, when executed by one or more computers, cause the one or more computers to perform the operation of the method according to any one of claims 1 to 12.

A computer program product, characterized in that it includes computer-readable instructions that, when executed on a computer device, cause the computer device to perform the method as described in any one of claims 1 to 12.

A system includes at least one processor and at least one memory; the at least one processor and the at least one memory are connected via a communication bus;

The at least one memory is used to store code;

The at least one processor is used to execute the code to perform the method as described in any one of claims 1 to 12.

A chip includes a processor, characterized in that the processor is configured to implement the method as described in any one of claims 1 to 12.