Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.
Floating Point (FP) is mainly used to represent a fraction, and is typically composed of three parts, namely a sign bit, an exponent (exponent) part, which may also be referred to as a step code part, and a mantissa (mantissa) part. For example, floating point number V may be generally expressed in the form:
The sign bit s can be 1 bit to determine whether the floating point number V is a negative number or a positive number, M represents a mantissa part, the mantissa part can comprise a plurality of bits which are in a binary decimal form and define the precision of the floating point number, E represents an index (also called a step code value) used for weighting the floating point number, the position of the decimal point in the floating point number V is reflected, and the value range of the floating point number is defined.
Conventional floating point numbers typically include three formats, namely, half-precision floating point number (FP 16), single-precision floating point number (FP 32), and double-precision floating point number (FP 64), with exponent and mantissa portions having different numbers of bits.
AI accelerators and the like have been widely used for deep learning model training. For the common convolution operation in the deep learning model, special optimization is performed in software and hardware design to accelerate computation, for example, various floating point number data formats are developed for optimization in fields of artificial intelligence or deep learning, such as BF16 (brain floating point, bit width is 16 bits), BF24 (brain floating point, bit width is 24 bits), TF32 (Tensor Float 32, bit width is 19 bits), and the like, and these data formats can greatly reduce computation processing, especially the computation resources and power consumption required by matrix multiplication or convolution multiplication operation, and the like. In addition, the processor supports some conventional floating point types, such as half-precision floating point (FP 16, bit wide of 16 bits) or single-precision floating point (FP 32, bit wide of 32 bits), etc.
Low-precision matrix multiplication is increasingly widely used in the training and reasoning of AI large models, because of the huge performance benefits that it can bring with acceptable loss of precision. In GPUs (graphics processors) or GPGPUs, matrix multiplication is typically performed in hardware by tensor kernels. The low-precision tensor kernel is a plurality of times of the high-precision tensor accounting force, so that higher calculation efficiency is achieved. In addition, the data volume of the low-precision tensor is small times that of the high-precision tensor, so that higher data transmission efficiency is achieved. Therefore, the end-to-end efficiency of low precision tensor computation is also an almost multiple increase. For example, the tensor accounting force of FP4 may be 2-8 times that of FP8, 4-16 times that of FP16/BF16, or even higher. The data amount was 1/2 of that of FP8 and 1/4 of that of FP16/BF 16.
To reduce the loss of precision, low precision matrix multiplication introduces a Scaling Factor (Scaling Factor) to maximize the numerical expressive power of the low precision tensor.
For matrix multiplication d=a×b+c, A, B, C and D are both high precision tensors. The generic matrix multiplication operation using a scaling factor for this matrix multiplication can be described as:
D→D’,γ
Wherein, as indicated by the element multiplication, The outer product is represented, x represents matrix multiplication, alpha is the scaling factor of tensor A, beta is the scaling factor of tensor B, sigma is the scaling factor of tensor C, gamma is the scaling factor of tensor D, and quantized tensor A ', quantized tensor B', quantized tensor C 'and quantized tensor D' are low-precision tensors after floating point quantization of tensor A, tensor B, tensor C and tensor D, respectively. For example, the data size of each parameter is as follows:
A is [ m, sXk ], B is [ sXk, n ], C is [ m, n ], D is [ m, n ], alpha is [ m, s ], beta is [ s, n ], sigma is [ m, r ] or [ r, n ], gamma is [ m, r ] or [ r, n ].
Wherein m, n, s, k, r are positive integers. For example, for any row in tensor a, each k tensor elements in the row share a scaling factor parameter, for any column in tensor B, each k tensor elements in the column share a scaling factor parameter.
Fig. 1 is a schematic structural diagram of a General Purpose Graphics Processor (GPGPU).
As shown in fig. 1, the general-purpose graphics processor is actually an array of programmable multiprocessors, for example, the programmable multiprocessors may be a stream processor cluster (STREAMING PROCESSOR CLUSTER, abbreviated SPC), for example, including stream processor cluster 1 shown in fig. 1. In a general-purpose graphics processor, 1 stream processor cluster processes one computing task, or a plurality of stream processor clusters process one computing task. And sharing data among the plurality of stream processor clusters through a global cache or a global memory.
As shown in fig. 1, taking the streaming processor cluster 1 as an example, the 1 streaming processor cluster includes a plurality of computing units, for example, the computing unit 1, the computing unit 2, the computing unit N, N in fig. 1 are positive integers. One computing unit includes a plurality of cores (also referred to as computing cores or computing cores, not shown in fig. 1), each of which includes an Arithmetic Logic Unit (ALU), a floating point computing unit, etc., for performing a specific computing task.
As shown in fig. 1, each calculation unit is further provided with a Tensor Core (Tensor Core) for performing Tensor-related calculations, for example, matrix multiplication by GEMM operator. Tensors are very important data structures in deep learning, are high-dimensional popularization of scalar, vector and matrix, tensor operation is the normal operation in training and pushing of models such as deep learning at present, and tensor cores can accelerate matrix multiplication operation. Tensor cores in a plurality of computing units can be uniformly scheduled and controlled.
As shown in fig. 1, each computing unit is also provided with a vector computing core (Vector Core Engine). Vector computation cores are used to perform vector-dependent computations, such as vector-dependent arithmetic logic operations, e.g., accumulation, reduction, conventional addition, subtraction, multiplication, division, etc.
The computing unit further comprises a register file (not shown), a shared memory and tensor data storage unit (Tensor Core Memory) for storing source and destination data associated with the computing task. The shared memory in a computing unit is used to share data between the cores of the computing unit. The tensor data storage unit is a storage resource closely related to the tensor core, and is used for storing intermediate data when the tensor core performs tensor operation (such as matrix multiplication), and may perform data format processing on tensor data to be subjected to tensor operation, so that data loaded from the outside meets the data format requirement of the tensor core.
In parallel computing, computing tasks are typically performed by multiple threads (threads). These threads are divided into thread blocks (thread blocks) before execution in a general purpose graphics processor (otherwise referred to as a parallel computing processor), and then the thread blocks are distributed to individual computing units via a thread block distribution module (not shown in FIG. 1). All threads in a thread block must be allocated to the same compute unit for execution. At the same time, the thread block is split into a minimum execution thread bundle (or simply thread bundle, warp), each of which contains a fixed number of threads (or less than the fixed number), e.g., 32 threads. Multiple thread blocks may be executed in the same computing unit or in different computing units.
In each compute unit, a thread bundle scheduling/dispatching module (not shown in FIG. 1) schedules, dispatches, and distributes thread bundles so that multiple compute cores of the compute unit run the thread bundles. The multiple thread bundles in a thread block may be executed simultaneously or in a time-sharing manner, depending on the number of compute cores in the compute unit. Multiple threads in each thread bundle will execute the same instruction. The memory execution instruction may be transmitted to a shared memory in the computing unit or further transmitted to a mid-level cache or a global memory for performing read/write operations, etc.
FIG. 2 is a schematic diagram of a data flow of a scaling factor in a processor.
As shown in fig. 2, currently, the calculation of the scaling factor for each tensor needs to be performed in a vector calculation core.
For example, as shown in fig. 2, high-precision tensors a, B, and C (e.g., BF16, FP16, etc.) loaded into the calculation unit from the memory or the like are first entered into the vector calculation core, the scaling factor α of the tensor a, the scaling factor β of the tensor B, and the scaling factor σ of the tensor C are calculated in the vector calculation core, floating point number quantization is performed on the tensor a, the tensor B, and the tensor C, and quantized into quantized tensors a ', B ', and quantized tensors C ' of low precision (e.g., FP4, FP8, etc.). Then, the scaling factor α, the scaling factor β, the scaling factor σ, and the quantized post-tensor a ', the quantized post-tensor B ', the quantized post-tensor C ' are transferred into the shared memory through the tensor data storage unit along the path shown by the solid line in fig. 2, and then enter the tensor kernel, and the general matrix multiplication operation using the scaling factor is performed in the tensor kernel, so as to obtain a high-precision calculation result, namely, the tensor D.
And then, the tensor D enters a vector calculation core along a path shown by a dotted line in fig. 2, the calculation of the scaling factor gamma of the D is carried out in the vector calculation core, floating point number quantization is carried out on the D to obtain a quantized tensor D ', and finally the vector calculation core outputs the scaling factor gamma of the tensor D and the quantized tensor D'.
As shown in fig. 2, currently, a vector calculation core is required to complete the calculation of the scaling factor and the floating point number quantization of the tensor, and the vector calculation core is also required to execute other operators related to the vector, and calculating the scaling factor can preempt the calculation resources of the other operators related to the vector, so that the performance is reduced. In addition, the paths for transmitting the scaling factors to the tensor cores can block the transmission of other tensor data, for example, the paths occupy the cache of the shared memory, and the transmission bandwidth on the paths needs to be contended with other data, so that the overall throughput rate is reduced, and the delay is increased.
At least one embodiment of the present disclosure provides a scaling factor processing module configured to determine a scaling factor of each tensor related to a matrix multiplication operation in a cache processor, and quantize each tensor from a first floating point format to a second floating point format, where the scaling factor is used to indicate a scaled data expression range when a corresponding tensor is converted from the first floating point format to the second floating point format, and floating point accuracy of the first floating point format is higher than floating point accuracy of the second floating point format, where the processor includes a computing unit and a memory, the computing unit includes a tensor core configured to perform the matrix multiplication operation using the scaling factor, at least one storage module is disposed on a data path between the tensor core and the memory, where the at least one storage module is monopolized by the tensor core when the tensor core performs the tensor operation, and the scaling factor processing module is disposed on the at least one storage module.
In at least one embodiment, an additional hardware module, namely a scaling factor processing module, is arranged in a storage module of the computing unit to improve the transmission efficiency of the scaling factor, reduce delay caused by insufficient buffer capacity, relieve the computing pressure of a vector computing core, improve the efficiency of other vector operators, reduce the computing power of the vector computing core and save the hardware area, thereby improving the overall execution efficiency of low-precision matrix multiplication using the scaling factor.
Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments. Fig. 3 is a schematic block diagram of a processor provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 3, the processor 100 includes a computing unit 110 and a memory 120, the computing unit 110 including a tensor core 102, the tensor core 102 configured to perform a matrix multiplication operation using a scaling factor. As previously described, the matrix multiplication operation using the scaling factor may be expressed asOr (b)。
The description of the computing unit 110, the memory 120, and the tensor core 102 may refer to the description of fig. 1, and will not be repeated here.
The calculation unit further comprises a scaling factor processing module 101 for determining scaling factors of the respective tensors related to the matrix multiplication operation in the processor 100, and for buffering the scaling factors of the respective tensors, and furthermore for performing floating point number quantization on the respective tensors, and for quantizing the respective tensors from the first floating point number format to the second floating point number format.
The scaling factor is used to scale the expression range of the data when the corresponding tensor is converted from the first floating point format to the second floating point format, and the scaling factor may extend the expression capability of the low precision data.
For example, the floating point accuracy of the first floating point format is higher than the floating point accuracy of the second floating point format. For example, the bit width of the first floating point format is greater than the bit width of the second floating point format, e.g., the first floating point format is BF16 and the second floating point format is FP4.
In dequantization, a '×α≡a, that is, a' multiplied by the scaling factor α is approximately equal to the original high-precision tensor a.
As shown in fig. 3, at least one storage module 103 is disposed on a data path between the tensor core 102 and the memory 120, and the at least one storage module 103 is located inside the computing unit 110. It should be noted that fig. 3 shows one memory module, but those skilled in the art will appreciate that a plurality of memory modules may be further disposed on the data path, and the description thereof will not be repeated here.
At least one memory module 103 is exclusive to the tensor core 102 when the tensor core 102 performs the tensor operation, and the scaling factor processing module 101 is disposed on the at least one memory module 103, and the at least one memory module 103 is exclusive to the tensor core 102 when the tensor core 102 performs the tensor-related operation. For example, the tensor-related operation may be any tensor-related operation, such as a matrix multiplication operation.
For example, the storage module 103 includes a storage module 103 associated with the tensor core 102, such as a storage module that the tensor core 102 can directly use.
For example, the storage module 103 may comprise a storage module dedicated to the tensor core 102, such as a tensor data storage unit dedicated to the preprocessing of data associated with the tensor core 102.
For example, the storage module 103 may not be dedicated to the tensor core 102, but may be exclusive to the tensor core 102 when the tensor core 102 performs tensor operations, i.e., may not be available to other computing modules (e.g., vector computing cores, etc.) when the tensor core 102 performs tensor-related operations. For example, the storage module 103 includes a shared memory, which is a storage area shared by all threads in the computing unit, and is dedicated to the tensor core when the tensor core is performing the tensor-related operation, and the vector computing core cannot use the shared memory.
The description of the tensor data storage unit and the shared memory may refer to the related description of fig. 1, and will not be repeated here.
Fig. 4 is a schematic diagram of an internal structure of a scaling factor processing module according to at least one embodiment of the present disclosure.
As shown in fig. 4, the scaling factor processing module includes an arithmetic logic unit (ARITHMETIC LOGIC UNIT, ALU for short) 1011 and a buffer 1012.
The arithmetic logic unit 1011 is configured to receive each tensor in the first floating point number format, perform a calculation operation of a scaling factor of each tensor, and perform floating point number quantization on each tensor to obtain a quantized tensor after floating point number quantization corresponding to each tensor, where the quantized tensor is in the second floating point number format.
The buffer 1012 is configured to buffer the scaling factors of the individual tensors.
As shown in fig. 4, the arithmetic logic unit 1011 receives the high-precision tensors in the first floating-point number format, that is, the first tensor a, the second tensor B, and the third tensor C, performs the calculation operation of the scaling factors of the first tensor a, the second tensor B, and the third tensor C, obtains the scaling factor α of the first tensor a, the scaling factor β of the second tensor B, and the scaling factor σ of the third tensor C, and buffers the scaling factors α, β, and σ to the buffer 1012.
In addition, the arithmetic logic unit 1011 performs floating point quantization on each tensor according to the scaling factor to obtain a first quantized tensor a ', a second quantized tensor B ', and a third quantized tensor C '.
The scaling factor α, scaling factor β and scaling factor σ and the first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' are transmitted to the tensor core 102, and a matrix multiplication operation using the scaling factor is performed by the tensor core 102, for example described as: Or (b) 。
After performing the matrix multiplication operation, the tensor core 102 transmits the calculation result of the matrix multiplication operation using the scaling factor, that is, the fourth tensor D, to the arithmetic logic unit. The fourth tensor D is a high-precision tensor.
For example, the fourth tensor D may have the same precision as the first, second, and third tensors a, B, and C, or may be different from the first, second, and third tensors a, B, and C. That is, for the fourth tensor D, it may correspond to a set of first and second floating point formats that are different from the first tensor, etc. For example, in one embodiment, the floating point number format of the first tensor a is FP16, the floating point number format of the first quantized tensor a 'is FP4, the floating point number format of the fourth tensor D is BF16, and the floating point number format of the fourth quantized tensor D' is FP8, for example, in another embodiment, the floating point number formats of the first tensor a, the fourth tensor D, etc. are FP16, the floating point number formats of the first quantized tensor a ', the fourth quantized tensor D', etc. are FP4, which can be set by those skilled in the art by themselves as needed, and the disclosure is not limited in particular.
The arithmetic logic unit 1011 performs floating point number quantization on the fourth tensor D to obtain a fourth quantized tensor D ', and the fourth quantized tensor D' is a low-precision tensor. The scaling factor γ of the fourth tensor D is calculated and stored in the buffer 1012.
Of course, in other embodiments, the fourth tensor D obtained by the tensor kernel calculation may be output to the subsequent other modules, without performing the correlation operation by using the scaling factor processing module. That is, the floating point number quantization and scaling factor calculation for the fourth tensor D is optional, and the fourth tensor D is directly transferred to the other computing unit or the memory through the transmission path between the tensor core and the memory when the floating point number quantization and scaling factor calculation for the fourth tensor D are not needed.
In at least one embodiment of the present disclosure, by setting an arithmetic logic unit, the arithmetic logic unit is dedicated to performing the calculation operation of the scaling factor and the quantization of the floating point number, so that the related calculation operation in the vector calculation core is transferred to the arithmetic logic unit set on the storage module in the calculation unit, occupation of calculation resources of the vector calculation core due to calculation of the scaling factor is avoided, efficiency of other vector operators is improved, and calculation force of the vector calculation unit can be reduced to save hardware area.
In addition, by setting additional buffer areas to buffer the scaling factors of each tensor, for example, the buffer areas can be dedicated to buffer the scaling factors, so that occupation of originally limited storage resources of the storage module on a transmission path is avoided, delay caused by insufficient buffer capacity is reduced, transmission efficiency of the scaling factors is improved, and overall throughput rate and overall execution efficiency of low-precision matrix multiplication operation using the scaling factors by tensor cores are improved.
For example, for any one of the tensors in the first floating-point number format received by the scaling factor processing module 101, any one of the tensors is divided into a plurality of tensor blocks, each tensor block including a plurality of tensor elements, each tensor block corresponding to a scaling factor parameter, the plurality of tensor elements sharing the scaling factor parameter.
For example, taking tensor A in the first floating point number format as an example, the shape size of A is [ m, s x k ], and the shape size of the scaling factor of tensor A is [ m, s ]. For example, the s×k elements of each row in a are divided into s groups, and each consecutive k elements is taken as a tensor block, i.e. each tensor block includes k tensor elements, where the k tensor elements correspond to a scaling factor parameter.
For example, the arithmetic logic unit 1011 performs the calculation operation of receiving each tensor and performing the scaling factor of each tensor, including the operations of determining, for each tensor block, a result of adding a maximum value of absolute values of a plurality of tensor elements to a preset floating point number, determining a result of multiplying the result by a preset constant, wherein the preset constant is a quotient of 1 and a maximum value of an accuracy expression range of the second floating point number format, and converting the result of multiplying to an accuracy specified by the scaling factor of any tensor, thereby obtaining the scaling factor parameter corresponding to the tensor block.
For example, the above operation is specifically described taking one tensor block in the tensor a described above as an example. For example, the tensor block X includes 1 st to kth tensor elements in a certain row in the tensor a, for example, x= [ X 0,X1,...,Xk-1],X0,X1,...,Xk-1 ] represents the k tensor elements, respectively.
The addition result is first determined with reference to the following formula:
X_max=max(abs(X))+t
when x_max is equal to 0, x_max () +=t
X_max represents the addition, abs () represents the absolute function, max () represents the maximum function, and t is the smallest floating point number (e.g., t=1e -12) for preventing a division by zero scene.
The corresponding scaling factor parameter S for Zhang Liangkuai X is then determined with reference to the following formula:
Where q_max is known in advance, 1/q_max is a pre-calculated preset constant, q_max represents the maximum value of the precision expression range of the second floating point format, e.g., FP4, q_max may be 6, and stype represents the desired precision specified by converting x_max× (1/q_max) to the scaling factor, e.g., the precision of the scaling factor is specified as FP8.
The precision of the scaling factor is not required to be the same as that of the second floating point number format, and the precision of the result of x_max× (1/q_max) may be higher than that of the scaling factor, for example.
For example, the scaling factor parameters corresponding to each tensor block are obtained by referring to the above process, so as to obtain the scaling factor of the first tensor a, and the repetition is not repeated.
The process of determining the scaling factors for the second tensor B, the second tensor C, and the second tensor D is similar to the process of determining the scaling factor for the first tensor a, and will not be repeated here.
After the arithmetic logic unit obtains the scaling factors, the arithmetic logic unit performs floating point number quantization on each tensor to obtain quantized tensors after floating point number quantization corresponding to each tensor, and comprises the following operations of determining a division lookup table based on the scaling factor parameters, determining quotient of a plurality of tensor elements and the scaling factor parameters based on the division lookup table, and quantizing the quotient into a second floating point number format to obtain quantized tensor elements in the second floating point number format after floating point number quantization corresponding to the tensor elements respectively.
For example, floating point number quantization may be performed with reference to the following equation:
wherein Xq represents a quantized tensor block corresponding to tensor block X, which includes quantized tensor elements of k second floating-point number formats in one-to-one correspondence with k tensor elements in Zhang Liangkuai X, Representing a division lookup table determined based on the scale factor parameters, qType () represents the target precision of quantizing the parameters to a matrix multiplication operation, i.e., the second floating point number format.
For example, the floating point number quantization is performed on each tensor block in the first tensor a with reference to the above process, so as to obtain the first quantized tensor a, and the repetition is not repeated.
The process of floating point number quantization of the second tensor B, the second tensor C and the second tensor D is similar to that of the first tensor a, and will not be repeated here.
For example, in some embodiments, arithmetic logic unit 1011 and cache 1012 are located on the same memory module 103. For example, both the arithmetic logic unit 1011 and the buffer 1012 are provided on a shared memory, or both the arithmetic logic unit 1011 and the buffer 1012 are provided on a tensor data storage unit.
For example, in other embodiments, the arithmetic logic unit and the buffer are provided on different memory modules.
For example, the matrix multiplication operation using the scaling factors includes performing matrix multiplication of the first tensor a and the second tensor B in combination with the scaling factor α of the first tensor a and the scaling factor β of the second tensor B, and determining a result of the operation of the matrix multiplication and a result of the addition of the third tensor C in combination with the scaling factor γ of the third tensor C, resulting in a fourth tensor D as a result of the operation of the matrix multiplication operation.
For example, in some embodiments, the buffers include a first buffer configured to buffer a scaling factor α of a first tensor a, a scaling factor β of a second tensor B, and a scaling factor σ of a third tensor C, and a second buffer configured to buffer a scaling factor γ of a fourth tensor D.
For example, the at least one memory module includes a first memory module and a second memory module, the first memory module being closer to the tensor kernel than the second memory module.
For example, in some embodiments, the first storage module is a shared memory and the second storage module is a tensor data storage unit. For example, in other embodiments, the first storage module is a tensor data storage unit and the second storage module is a shared memory. The first storage module and the second storage module may be determined according to a hardware architecture, which is not particularly limited by the present disclosure.
For example, the second buffer and the first buffer may be provided on different memory modules, and the second buffer and the arithmetic logic unit are provided on the same memory module. For example, the arithmetic logic unit and the second buffer are disposed on the second memory module, and the first buffer is disposed on the first memory module.
Fig. 5 is a schematic structural diagram of a computing unit provided in an embodiment of the present disclosure.
As shown in fig. 5, a plurality of storage modules are disposed on a data path between the memory 120 and the tensor core 102, where the storage modules include a first storage module and a second storage module, and the first storage module and the second storage module are located inside the computing unit and are exclusive to the tensor core when the tensor core performs a tensor operation, and the first storage module is closer to the tensor core 102. Reference may be made to the foregoing for the first storage module and the second storage module, and the description thereof will not be repeated here.
As shown in fig. 5, the second buffer area and the arithmetic logic unit are disposed on the second memory module, and the first buffer area is disposed on the first memory module.
As shown in fig. 5, the first tensor a, the second tensor B, and the third tensor C are high-precision tensors, such as BF16, FP16, etc., which may be from a memory or other computing unit or a previous output of a current computing unit, etc., which is not particularly limited by the present disclosure.
The first tensor A, the second tensor B and the third tensor C enter an arithmetic logic unit, the arithmetic logic unit determines the scaling factor alpha of the first tensor A, the scaling factor beta of the second tensor B and the scaling factor sigma of the third tensor C, and quantizes the first tensor A, the second tensor B and the third tensor C to obtain a first quantized tensor A ', a second quantized tensor B ' and a third quantized tensor C '. The specific process of determining the scaling factor and the floating point number quantization may refer to the foregoing, and will not be described herein.
And the arithmetic logic unit processes the first quantized tensor A ', the second quantized tensor B' and the third quantized tensor C 'which are respectively corresponding to the first tensor A, the second tensor B and the third tensor C and are quantized by floating point numbers, and the first quantized tensor A', the second quantized tensor B 'and the third quantized tensor C' are transmitted and cached in the first storage module through a data path between the second storage module and the first storage module. The first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' after floating point number quantization are low-precision tensors, for example, FP4 or FP8 format.
Therefore, the first quantized tensor A ', the second quantized tensor B ' and the third quantized tensor C ' are still stored in the shared memory, the scaling factor alpha of the first tensor A, the scaling factor beta of the second tensor B and the scaling factor sigma of the third tensor C are cached in the first cache region, the capacity of the added cache region and the brought hardware area are reduced, and the data transmission efficiency can be improved and the delay caused by insufficient cache capacity is reduced.
Thereafter, the scaling factor α of the first tensor a, the scaling factor β of the second tensor B and the scaling factor σ of the third tensor C, and the first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' are transferred via the data path between the first memory module and the tensor core into the tensor core for matrix multiplication operations using the scaling factors, e.g. performingOr alternativelyEtc., the present disclosure is not limited to a particular implementation of the matrix multiplication operation using the scaling factor.
The tensor kernel obtains a fourth tensor D after performing the matrix multiplication operation using the scaling factor. As shown in fig. 5, the tensor core transmits the fourth tensor D to the first memory module via a data path between the first memory module and the tensor core, and to the arithmetic logic unit of the second memory module via a data path between the first memory module and the second memory module.
The arithmetic logic unit is further configured to determine a scaling factor gamma of the fourth tensor according to the fourth tensor D, buffer the scaling factor gamma in the second buffer, and perform floating point number quantization on the fourth tensor D to obtain a fourth quantized tensor D' in a second floating point number format after the floating point number quantization corresponding to the fourth tensor. The fourth quantized tensor D' is also in a low precision format, such as FP4 or FP8 format, etc.
The fourth quantized tensor D' and the scaling factor y of the fourth tensor are then transmitted to other relevant modules. For example, to other computing units, memory, or reenter the current computing unit for the next round of operation.
In an embodiment not shown in fig. 5, after determining the fourth tensor D, the tensor core directly outputs the fourth tensor via the first storage module and the second storage module, without performing scaling factor calculation and floating point number quantization of the fourth tensor D, where the second buffer is still disposed in the second storage module.
For example, in other embodiments, quantization of the fourth tensor is not required, and the second buffer may not be set at this time, and the fourth tensor D may be directly transmitted to other related modules via the first storage module and the second storage module, which is not described herein.
For example, in other embodiments, the matrix multiplication operation may not perform addition with the third tensor C, and only includes calculating axb, and the specific process is similar to the matrix multiplication operation described above, that is, the floating point number quantization and scaling factor calculation with respect to the third tensor C need not be performed, and the specific process is not described herein.
In at least one embodiment of the present disclosure, an additional hardware module, that is, a scaling factor processing module, is set in the computing unit to improve the transmission efficiency of the scaling factor, reduce the delay caused by insufficient buffer capacity, relieve the computing pressure of the vector computing core, improve the efficiency of other vector operators, reduce the computing power of the vector computing core to save the hardware area, thereby improving the overall execution efficiency of low-precision matrix multiplication using the scaling factor.
In at least one embodiment of the present disclosure, the processor may be a processor of any architecture, such as a graphics processor, tensor processor, data processor, or the like. A schematic structure of a graphics processor provided in at least one embodiment of the present disclosure is described below using a graphics processor as an example.
Fig. 6 is a schematic block diagram of a graphics processor provided in at least one embodiment of the present disclosure. As shown in fig. 6, the graphics processor 200 includes a plurality of streaming processor clusters, each of which includes a plurality of computing units, and memory. The description of the streaming processor cluster, the computing unit and the memory may refer to the description of fig. 1, and will not be repeated here.
As shown in fig. 6, each computing unit comprises a tensor core, a storage module, which may comprise at least one of a shared memory, a tensor data processing unit, for example.
As shown in fig. 6, a scaling factor processing module 101 is further disposed on the storage module of each computing unit. The scaling factor processing module 101 is configured to determine scaling factors for each tensor associated with a matrix multiplication operation in the buffer processor, and to quantize each tensor from the first floating point format to the second floating point format. The scaling factor is used to scale the expression range of the data as the corresponding tensor is converted from the first floating-point format to the second floating-point format, the floating-point precision of the first floating-point format being higher than the floating-point precision of the second floating-point format. For more detailed description of the scaling factor processing module 101, reference may be made to the related description of the scaling factor processing module 101 as described above, and the repetition is omitted.
For example, the arithmetic logic unit and the buffer are arranged on the same memory module, for example, both on a shared memory, or both on the tensor data processing unit.
For example, the arithmetic logic unit and the buffer are provided on different memory modules.
For example, the buffer comprises a first buffer configured to buffer the scaling factor of the first tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor, and a second buffer configured to buffer the scaling factor of the fourth tensor.
The at least one memory module includes a first memory module and a second memory module, the first memory module being closer to the tensor kernel than the second memory module. The arithmetic logic unit and the second buffer area are arranged on the second storage module, and the first buffer area is arranged on the first storage module.
For example, in one embodiment, the first storage module is a shared memory, the second storage module is a tensor data processing unit, where the first buffer is disposed on the shared memory, and the second buffer and the arithmetic logic unit are disposed on the tensor data processing unit. For example, in another embodiment, the first storage module is a tensor data processing unit, and the second storage module is a shared memory, where the first buffer is disposed on the tensor data processing unit, and the second buffer and the arithmetic logic unit are disposed on the shared memory.
For more details of the scaling factor processing module and interactions with other units in the graphics processor, reference may be made to the foregoing description of the scaling factor processing module, and details are not repeated.
In at least one embodiment, by setting an additional hardware module, namely a scaling factor processing module, in a computing unit of the graphics processor, the transmission efficiency of the scaling factor is improved, delay caused by insufficient buffer capacity is reduced, computing pressure of a vector computing core is relieved, efficiency of other vector operators is improved, computing power of the vector computing core is reduced, and hardware area is saved, so that overall execution efficiency of low-precision matrix multiplication using the scaling factor is improved.
Further, it should be noted that the components of the graphics processor 200 shown in FIG. 6 are exemplary only and not limiting, and that the graphics processor 200 may have other components as desired for practical applications.
Fig. 7 is a schematic diagram of an electronic device according to at least one embodiment of the present disclosure.
For example, as shown in fig. 7, the electronic device 300 includes a processor 200. For example, the processor 200 may be implemented using the architecture shown in FIG. 6. For example, the electronic device 300 may be any electronic device including computing functionality, such as a notebook computer, tablet computer, desktop computer, web server, etc., to which embodiments of the present disclosure are not limited.
For example, the electronic device may also include a central processing unit (Central Processing Unit, CPU), other forms of processing units such as a Digital Signal Processor (DSP) and the like having data processing and/or instruction execution capabilities, storage units and the like, with an operating system, application programming interface (e.g., openGL (Open Graphics Library), metal, etc.) and the like also mounted thereon. For example, the electronic device may further include an output component, such as a display component, such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic LIGHT EMITTING Diode (OLED) display, a Quantum Dot LIGHT EMITTING Diode (QLED) display, or the like, to which embodiments of the present disclosure are not limited.
It should be noted that, for clarity and brevity, not all of the constituent elements of the electronic device 300 are provided in the embodiments of the present disclosure. Other constituent elements not shown may be provided, set up, etc. as required by the specific needs of those skilled in the art in order to achieve the necessary functions of the electronic device 300, and the embodiments of the present disclosure are not limited thereto.
Referring now to fig. 8, there is illustrated a specific structural diagram of an electronic device (e.g., a terminal device or server) 300 suitable for use in implementing a processor including embodiments of the present disclosure.
The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. For example, the electronic device may be in the form of a server for deep learning and artificial intelligence, scientific computing, graphic rendering and video editing, virtual reality and game development, cloud service, and other various application scenarios, for example, the electronic device may be a dedicated server for data center, cloud computing, and the like deployed with tasks such as deep learning training, large-scale data analysis, high-performance computing, and the like.
The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 8, the electronic device 300 may include a processing means 301, the processing means 301 including, for example, the aforementioned processor 200, which may perform various suitable actions and processes in accordance with non-transitory computer readable instructions stored in a memory to implement various functions. The processing means 301 may also comprise a Central Processing Unit (CPU), tensor Processor (TPU) or the like having instruction optimization capabilities and/or program execution capabilities. The Central Processing Unit (CPU) may be an X86, ARM, RISC-V architecture, or the like. The GPU may be integrated directly into the SOC, directly onto the motherboard, or built into the north bridge chip of the motherboard.
As shown in fig. 8, for example, the memory may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random Access Memory (RAM) 303 and/or cache memory (cache) or the like, and computer readable instructions may be loaded from storage 308 into Random Access Memory (RAM) 303 to execute the computer readable instructions. The non-volatile memory may include, for example, read-only memory (ROM) 302, a hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. Various applications and various data, such as style images, and various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
For example, a processing device 301, a Read Only Memory (ROM) 302, and a Random Access Memory (RAM) 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, devices may be connected to an input/output (I/O) interface 305 including input devices 306 such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 307 including a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 308 including magnetic tape, hard disk, flash memory, etc., and communication devices 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 8 shows the electronic device 300 with various means, it is to be understood that not all of the illustrated means are required to be implemented or provided, and that the electronic device 300 may alternatively be implemented or provided with more or fewer means. For example, the processing device 301 may control other components in the electronic device 300 to perform desired functions.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to obtain at least two internet protocol addresses, send a node evaluation request including the at least two internet protocol addresses to a node evaluation device, wherein the node evaluation device selects an internet protocol address from the at least two internet protocol addresses and returns the internet protocol address, receive the internet protocol address returned by the node evaluation device, wherein the obtained internet protocol address indicates an edge node in a content distribution network.
Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to receive a node evaluation request comprising at least two internet protocol addresses, select an internet protocol address from the at least two internet protocol addresses, and return the selected internet protocol address, wherein the received internet protocol address indicates an edge node in the content distribution network.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely a specific embodiment of the disclosure, but the scope of the disclosure is not limited thereto and should be determined by the scope of the claims.