[go: up one dir, main page]

CN120803395B - Processors, electronic devices - Google Patents

Processors, electronic devices

Info

Publication number
CN120803395B
CN120803395B CN202511299546.6A CN202511299546A CN120803395B CN 120803395 B CN120803395 B CN 120803395B CN 202511299546 A CN202511299546 A CN 202511299546A CN 120803395 B CN120803395 B CN 120803395B
Authority
CN
China
Prior art keywords
tensor
scaling factor
storage module
floating
quantized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202511299546.6A
Other languages
Chinese (zh)
Other versions
CN120803395A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202511299546.6A priority Critical patent/CN120803395B/en
Publication of CN120803395A publication Critical patent/CN120803395A/en
Application granted granted Critical
Publication of CN120803395B publication Critical patent/CN120803395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)
  • Image Processing (AREA)

Abstract

一种处理器、电子设备。该处理器包括计算单元和内存,计算单元包括张量核,张量核配置为执行使用缩放因子的矩阵乘操作,计算单元还包括缩放因子处理模块,配置为确定及缓存矩阵乘操作相关的各个张量的缩放因子,以及将各个张量由第一浮点数格式量化为第二浮点数格式,计算单元还包括在张量核和内存之间的数据通路上设置的至少一个存储模块,至少一个存储模块在张量核执行张量相关操作时被张量核独占,缩放因子处理模块设置在至少一个存储模块上。目前缩放因子和浮点数量化由向量计算核完成,导致性能下降,延迟变高,本发明提供的设置在计算单元内的存储模块上的缩放因子处理模块,可提高使用缩放因子的低精度矩阵乘法的整体执行效率。

A processor and electronic device are disclosed. The processor includes a computing unit and memory. The computing unit includes a tensor core configured to perform matrix multiplication operations using scaling factors. The computing unit also includes a scaling factor processing module configured to determine and cache scaling factors for each tensor related to the matrix multiplication operation, and to quantize each tensor from a first floating-point format to a second floating-point format. The computing unit further includes at least one storage module disposed on the data path between the tensor core and memory. This storage module is exclusively used by the tensor core when it performs tensor-related operations. The scaling factor processing module is disposed on the at least one storage module. Currently, scaling factors and floating-point quantization are performed by vector computing cores, leading to performance degradation and increased latency. The scaling factor processing module disposed on a storage module within the computing unit provided by this invention can improve the overall execution efficiency of low-precision matrix multiplication using scaling factors.

Description

Processor and electronic equipment
Technical Field
Embodiments of the present disclosure relate to a processor and an electronic device.
Background
Floating-point number quantization refers to the format conversion of high-precision floating-point number formats (e.g., FP16, BF16, FP32, etc.) to low-precision floating-point number formats (e.g., FP4, FP8, etc.). The low precision floating point number format can greatly improve the calculation efficiency, especially the general matrix multiplication (General Matrix Multiplication, GEMM) operation required by an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) model. However, low-precision matrix multiplication can result in a loss of precision, thereby reducing the prediction and generation effects of the AI model. Therefore, in the low-precision calculation process of the AI model, a corresponding Scaling Factor (Scaling Factor) is generally calculated from the current data in the floating point number quantization process to scale the expression range of the data, thereby reducing the precision loss of the calculation result.
Currently, scaling factors and floating point number quantization are completed by a vector computation core, and the vector computation core also needs to execute other vector-related operators, and computing scaling factors can preempt computing resources of other vector-related operators, resulting in performance degradation. In addition, the paths for transmitting the scaling factors to the tensor cores can block the transmission of other tensor data, for example, the paths occupy the cache of the shared memory, and the transmission bandwidth on the paths needs to be contended with other data, so that the overall throughput rate is reduced, and the delay is increased.
Disclosure of Invention
The present invention provides a processor, wherein the processor includes a computing unit and a memory, the computing unit includes a tensor core configured to perform a matrix multiplication operation using a scaling factor, the computing unit further includes a scaling factor processing module configured to determine and buffer scaling factors of respective tensors related to the matrix multiplication operation, and quantize the respective tensors from a first floating point format to a second floating point format, the scaling factors are used to scale an expression range of data when the corresponding tensors are converted from the first floating point format to the second floating point format, a floating point precision of the first floating point format is higher than a floating point precision of the second floating point format, the computing unit further includes at least one storage module disposed on a data path between the tensor core and the memory, the at least one storage module is disposed on the at least one storage module when the tensor core performs the tensor related operation.
For example, in the processor provided by the present application, the scaling factor processing module includes an arithmetic logic unit and a buffer, where the arithmetic logic unit is configured to receive each tensor in the first floating point format, perform a calculation operation of a scaling factor of each tensor, and perform floating point number quantization on each tensor to obtain a quantized tensor after the floating point number quantization corresponding to each tensor, where the quantized tensor is in the second floating point format, and the buffer is configured to buffer the scaling factor of each tensor.
For example, in the processor provided by the application, for any one tensor of the first floating point number format received by the scaling factor processing module, the any one tensor is divided into a plurality of tensor blocks, each tensor block comprises a plurality of tensor elements, each tensor block corresponds to one scaling factor parameter, the plurality of tensor elements share the one scaling factor parameter, and the arithmetic logic unit is used for receiving each tensor, and when executing the calculation operation of the scaling factor of each tensor, the arithmetic logic unit is used for executing the calculation operation of the scaling factor of each tensor, the operation comprises the following steps of determining the summation result of the maximum value of the absolute values of the plurality of tensor elements and the preset floating point number, determining the product result of the summation result and the preset constant, wherein the preset constant is the quotient of the maximum value of the precision expression range of the second floating point number format, and converting the product result into the precision specified by the scaling factor of the any one tensor to obtain the scaling factor parameter corresponding to the tensor block.
For example, in the processor provided by the application, when the arithmetic logic unit performs floating point number quantization on each tensor to obtain quantized tensors after the floating point number quantization corresponding to each tensor, the method includes the following operations of determining a division lookup table based on the scaling factor parameter for the tensor block, respectively determining quotient of the tensor elements and the scaling factor parameter based on the division lookup table, and quantizing the quotient into the second floating point number format to obtain quantized tensor elements respectively corresponding to the tensor elements.
For example, in the processor provided by the present application, the arithmetic logic unit and the buffer area are disposed on the same memory module.
For example, in the processor provided by the present application, the matrix multiplication operation using the scaling factor includes performing matrix multiplication of a first tensor and a second tensor in combination with the scaling factor of the second tensor, and determining, in combination with the scaling factor of a third tensor, a result of the addition of the result of the matrix multiplication operation and the result of the addition of the third tensor, to obtain a fourth tensor as a result of the matrix multiplication operation, where the buffer includes a first buffer and a second buffer, the first buffer is configured to buffer the scaling factor of the first tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor, the second buffer is configured to buffer the scaling factor of the fourth tensor, and the at least one storage module includes a first storage module and a second storage module, and the first storage module is closer to the distance of the tensor core than the second storage module is, and the arithmetic unit and the second storage module are disposed on the first buffer.
For example, in the processor provided by the application of the present invention, the scaling factor of the first tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor calculated by the arithmetic logic unit are transmitted and buffered in the first buffer area through a data path between the second storage module and the first storage module, the arithmetic logic unit processes the first quantized tensor, the second quantized tensor, and the third quantized tensor after the floating point number quantization, which respectively correspond to the first tensor, the second quantized tensor, and the third tensor, and the scaling factor of the first tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor, and the scaling factor of the first quantized tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor, are transmitted and buffered in the first storage module through a data path between the second storage module and the first storage module, and the scaling factor of the scaling module are performed through the scaling matrix.
For example, in the processor provided by the present application, the fourth tensor is transmitted to the first storage module through a data path between the first storage module and the tensor core, and is transmitted to the arithmetic logic unit of the second storage module through a data path between the first storage module and the second storage module, where the arithmetic logic unit is further configured to determine, according to the fourth tensor, a scaling factor of the fourth tensor and buffer the scaling factor in the second buffer, and perform the floating point quantization on the fourth tensor, to obtain a fourth quantized tensor in the second floating point format after the floating point quantization corresponding to the fourth tensor.
For example, in the processor provided by the present application, the matrix multiplication operation using the scaling factor includes performing matrix multiplication of a first tensor and a second tensor by combining the scaling factor of the first tensor and the scaling factor of the second tensor, and determining a summation result of an operation result of the matrix multiplication and the third tensor by combining the scaling factor of the third tensor, to obtain a fourth tensor as an operation result of the matrix multiplication operation, where the fourth tensor is directly transmitted to other computing units or the memory via a transmission path between the tensor core and the memory.
For example, in the processor provided by the present application, the at least one storage module includes at least one of a tensor core data processing unit in the computing unit and a shared memory, where the tensor core data processing unit is dedicated to relevant data preprocessing of the tensor core, and the shared memory is a storage area shared by all threads in the computing unit.
The present application provides an electronic device comprising a processor as described in an embodiment of the present application.
In at least one embodiment, by setting an additional hardware module, namely a scaling factor processing module, in the computing unit, the transmission efficiency of the scaling factor is improved, delay caused by insufficient buffer capacity is reduced, computing pressure of a vector computing core is relieved, efficiency of other vector operators is improved, computing power of the vector computing core is reduced to save hardware area, and therefore overall execution efficiency of low-precision matrix multiplication using the scaling factor is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 is a schematic block diagram of a General Purpose Graphics Processor (GPGPU);
FIG. 2 is a schematic diagram of a data flow of a scaling factor in a processor;
FIG. 3 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating an internal structure of a scaling factor processing module according to at least one embodiment of the present disclosure;
FIG. 5 is a schematic block diagram of a computing unit provided by an embodiment of the present disclosure;
FIG. 6 is a schematic block diagram of a graphics processor provided in accordance with at least one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of an electronic device provided in an embodiment of the present disclosure;
Fig. 8 is a schematic structural diagram of another electronic device according to at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.
Floating Point (FP) is mainly used to represent a fraction, and is typically composed of three parts, namely a sign bit, an exponent (exponent) part, which may also be referred to as a step code part, and a mantissa (mantissa) part. For example, floating point number V may be generally expressed in the form:
The sign bit s can be 1 bit to determine whether the floating point number V is a negative number or a positive number, M represents a mantissa part, the mantissa part can comprise a plurality of bits which are in a binary decimal form and define the precision of the floating point number, E represents an index (also called a step code value) used for weighting the floating point number, the position of the decimal point in the floating point number V is reflected, and the value range of the floating point number is defined.
Conventional floating point numbers typically include three formats, namely, half-precision floating point number (FP 16), single-precision floating point number (FP 32), and double-precision floating point number (FP 64), with exponent and mantissa portions having different numbers of bits.
AI accelerators and the like have been widely used for deep learning model training. For the common convolution operation in the deep learning model, special optimization is performed in software and hardware design to accelerate computation, for example, various floating point number data formats are developed for optimization in fields of artificial intelligence or deep learning, such as BF16 (brain floating point, bit width is 16 bits), BF24 (brain floating point, bit width is 24 bits), TF32 (Tensor Float 32, bit width is 19 bits), and the like, and these data formats can greatly reduce computation processing, especially the computation resources and power consumption required by matrix multiplication or convolution multiplication operation, and the like. In addition, the processor supports some conventional floating point types, such as half-precision floating point (FP 16, bit wide of 16 bits) or single-precision floating point (FP 32, bit wide of 32 bits), etc.
Low-precision matrix multiplication is increasingly widely used in the training and reasoning of AI large models, because of the huge performance benefits that it can bring with acceptable loss of precision. In GPUs (graphics processors) or GPGPUs, matrix multiplication is typically performed in hardware by tensor kernels. The low-precision tensor kernel is a plurality of times of the high-precision tensor accounting force, so that higher calculation efficiency is achieved. In addition, the data volume of the low-precision tensor is small times that of the high-precision tensor, so that higher data transmission efficiency is achieved. Therefore, the end-to-end efficiency of low precision tensor computation is also an almost multiple increase. For example, the tensor accounting force of FP4 may be 2-8 times that of FP8, 4-16 times that of FP16/BF16, or even higher. The data amount was 1/2 of that of FP8 and 1/4 of that of FP16/BF 16.
To reduce the loss of precision, low precision matrix multiplication introduces a Scaling Factor (Scaling Factor) to maximize the numerical expressive power of the low precision tensor.
For matrix multiplication d=a×b+c, A, B, C and D are both high precision tensors. The generic matrix multiplication operation using a scaling factor for this matrix multiplication can be described as:
D→D’,γ
Wherein, as indicated by the element multiplication, The outer product is represented, x represents matrix multiplication, alpha is the scaling factor of tensor A, beta is the scaling factor of tensor B, sigma is the scaling factor of tensor C, gamma is the scaling factor of tensor D, and quantized tensor A ', quantized tensor B', quantized tensor C 'and quantized tensor D' are low-precision tensors after floating point quantization of tensor A, tensor B, tensor C and tensor D, respectively. For example, the data size of each parameter is as follows:
A is [ m, sXk ], B is [ sXk, n ], C is [ m, n ], D is [ m, n ], alpha is [ m, s ], beta is [ s, n ], sigma is [ m, r ] or [ r, n ], gamma is [ m, r ] or [ r, n ].
Wherein m, n, s, k, r are positive integers. For example, for any row in tensor a, each k tensor elements in the row share a scaling factor parameter, for any column in tensor B, each k tensor elements in the column share a scaling factor parameter.
Fig. 1 is a schematic structural diagram of a General Purpose Graphics Processor (GPGPU).
As shown in fig. 1, the general-purpose graphics processor is actually an array of programmable multiprocessors, for example, the programmable multiprocessors may be a stream processor cluster (STREAMING PROCESSOR CLUSTER, abbreviated SPC), for example, including stream processor cluster 1 shown in fig. 1. In a general-purpose graphics processor, 1 stream processor cluster processes one computing task, or a plurality of stream processor clusters process one computing task. And sharing data among the plurality of stream processor clusters through a global cache or a global memory.
As shown in fig. 1, taking the streaming processor cluster 1 as an example, the 1 streaming processor cluster includes a plurality of computing units, for example, the computing unit 1, the computing unit 2, the computing unit N, N in fig. 1 are positive integers. One computing unit includes a plurality of cores (also referred to as computing cores or computing cores, not shown in fig. 1), each of which includes an Arithmetic Logic Unit (ALU), a floating point computing unit, etc., for performing a specific computing task.
As shown in fig. 1, each calculation unit is further provided with a Tensor Core (Tensor Core) for performing Tensor-related calculations, for example, matrix multiplication by GEMM operator. Tensors are very important data structures in deep learning, are high-dimensional popularization of scalar, vector and matrix, tensor operation is the normal operation in training and pushing of models such as deep learning at present, and tensor cores can accelerate matrix multiplication operation. Tensor cores in a plurality of computing units can be uniformly scheduled and controlled.
As shown in fig. 1, each computing unit is also provided with a vector computing core (Vector Core Engine). Vector computation cores are used to perform vector-dependent computations, such as vector-dependent arithmetic logic operations, e.g., accumulation, reduction, conventional addition, subtraction, multiplication, division, etc.
The computing unit further comprises a register file (not shown), a shared memory and tensor data storage unit (Tensor Core Memory) for storing source and destination data associated with the computing task. The shared memory in a computing unit is used to share data between the cores of the computing unit. The tensor data storage unit is a storage resource closely related to the tensor core, and is used for storing intermediate data when the tensor core performs tensor operation (such as matrix multiplication), and may perform data format processing on tensor data to be subjected to tensor operation, so that data loaded from the outside meets the data format requirement of the tensor core.
In parallel computing, computing tasks are typically performed by multiple threads (threads). These threads are divided into thread blocks (thread blocks) before execution in a general purpose graphics processor (otherwise referred to as a parallel computing processor), and then the thread blocks are distributed to individual computing units via a thread block distribution module (not shown in FIG. 1). All threads in a thread block must be allocated to the same compute unit for execution. At the same time, the thread block is split into a minimum execution thread bundle (or simply thread bundle, warp), each of which contains a fixed number of threads (or less than the fixed number), e.g., 32 threads. Multiple thread blocks may be executed in the same computing unit or in different computing units.
In each compute unit, a thread bundle scheduling/dispatching module (not shown in FIG. 1) schedules, dispatches, and distributes thread bundles so that multiple compute cores of the compute unit run the thread bundles. The multiple thread bundles in a thread block may be executed simultaneously or in a time-sharing manner, depending on the number of compute cores in the compute unit. Multiple threads in each thread bundle will execute the same instruction. The memory execution instruction may be transmitted to a shared memory in the computing unit or further transmitted to a mid-level cache or a global memory for performing read/write operations, etc.
FIG. 2 is a schematic diagram of a data flow of a scaling factor in a processor.
As shown in fig. 2, currently, the calculation of the scaling factor for each tensor needs to be performed in a vector calculation core.
For example, as shown in fig. 2, high-precision tensors a, B, and C (e.g., BF16, FP16, etc.) loaded into the calculation unit from the memory or the like are first entered into the vector calculation core, the scaling factor α of the tensor a, the scaling factor β of the tensor B, and the scaling factor σ of the tensor C are calculated in the vector calculation core, floating point number quantization is performed on the tensor a, the tensor B, and the tensor C, and quantized into quantized tensors a ', B ', and quantized tensors C ' of low precision (e.g., FP4, FP8, etc.). Then, the scaling factor α, the scaling factor β, the scaling factor σ, and the quantized post-tensor a ', the quantized post-tensor B ', the quantized post-tensor C ' are transferred into the shared memory through the tensor data storage unit along the path shown by the solid line in fig. 2, and then enter the tensor kernel, and the general matrix multiplication operation using the scaling factor is performed in the tensor kernel, so as to obtain a high-precision calculation result, namely, the tensor D.
And then, the tensor D enters a vector calculation core along a path shown by a dotted line in fig. 2, the calculation of the scaling factor gamma of the D is carried out in the vector calculation core, floating point number quantization is carried out on the D to obtain a quantized tensor D ', and finally the vector calculation core outputs the scaling factor gamma of the tensor D and the quantized tensor D'.
As shown in fig. 2, currently, a vector calculation core is required to complete the calculation of the scaling factor and the floating point number quantization of the tensor, and the vector calculation core is also required to execute other operators related to the vector, and calculating the scaling factor can preempt the calculation resources of the other operators related to the vector, so that the performance is reduced. In addition, the paths for transmitting the scaling factors to the tensor cores can block the transmission of other tensor data, for example, the paths occupy the cache of the shared memory, and the transmission bandwidth on the paths needs to be contended with other data, so that the overall throughput rate is reduced, and the delay is increased.
At least one embodiment of the present disclosure provides a scaling factor processing module configured to determine a scaling factor of each tensor related to a matrix multiplication operation in a cache processor, and quantize each tensor from a first floating point format to a second floating point format, where the scaling factor is used to indicate a scaled data expression range when a corresponding tensor is converted from the first floating point format to the second floating point format, and floating point accuracy of the first floating point format is higher than floating point accuracy of the second floating point format, where the processor includes a computing unit and a memory, the computing unit includes a tensor core configured to perform the matrix multiplication operation using the scaling factor, at least one storage module is disposed on a data path between the tensor core and the memory, where the at least one storage module is monopolized by the tensor core when the tensor core performs the tensor operation, and the scaling factor processing module is disposed on the at least one storage module.
In at least one embodiment, an additional hardware module, namely a scaling factor processing module, is arranged in a storage module of the computing unit to improve the transmission efficiency of the scaling factor, reduce delay caused by insufficient buffer capacity, relieve the computing pressure of a vector computing core, improve the efficiency of other vector operators, reduce the computing power of the vector computing core and save the hardware area, thereby improving the overall execution efficiency of low-precision matrix multiplication using the scaling factor.
Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments. Fig. 3 is a schematic block diagram of a processor provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 3, the processor 100 includes a computing unit 110 and a memory 120, the computing unit 110 including a tensor core 102, the tensor core 102 configured to perform a matrix multiplication operation using a scaling factor. As previously described, the matrix multiplication operation using the scaling factor may be expressed asOr (b)
The description of the computing unit 110, the memory 120, and the tensor core 102 may refer to the description of fig. 1, and will not be repeated here.
The calculation unit further comprises a scaling factor processing module 101 for determining scaling factors of the respective tensors related to the matrix multiplication operation in the processor 100, and for buffering the scaling factors of the respective tensors, and furthermore for performing floating point number quantization on the respective tensors, and for quantizing the respective tensors from the first floating point number format to the second floating point number format.
The scaling factor is used to scale the expression range of the data when the corresponding tensor is converted from the first floating point format to the second floating point format, and the scaling factor may extend the expression capability of the low precision data.
For example, the floating point accuracy of the first floating point format is higher than the floating point accuracy of the second floating point format. For example, the bit width of the first floating point format is greater than the bit width of the second floating point format, e.g., the first floating point format is BF16 and the second floating point format is FP4.
In dequantization, a '×α≡a, that is, a' multiplied by the scaling factor α is approximately equal to the original high-precision tensor a.
As shown in fig. 3, at least one storage module 103 is disposed on a data path between the tensor core 102 and the memory 120, and the at least one storage module 103 is located inside the computing unit 110. It should be noted that fig. 3 shows one memory module, but those skilled in the art will appreciate that a plurality of memory modules may be further disposed on the data path, and the description thereof will not be repeated here.
At least one memory module 103 is exclusive to the tensor core 102 when the tensor core 102 performs the tensor operation, and the scaling factor processing module 101 is disposed on the at least one memory module 103, and the at least one memory module 103 is exclusive to the tensor core 102 when the tensor core 102 performs the tensor-related operation. For example, the tensor-related operation may be any tensor-related operation, such as a matrix multiplication operation.
For example, the storage module 103 includes a storage module 103 associated with the tensor core 102, such as a storage module that the tensor core 102 can directly use.
For example, the storage module 103 may comprise a storage module dedicated to the tensor core 102, such as a tensor data storage unit dedicated to the preprocessing of data associated with the tensor core 102.
For example, the storage module 103 may not be dedicated to the tensor core 102, but may be exclusive to the tensor core 102 when the tensor core 102 performs tensor operations, i.e., may not be available to other computing modules (e.g., vector computing cores, etc.) when the tensor core 102 performs tensor-related operations. For example, the storage module 103 includes a shared memory, which is a storage area shared by all threads in the computing unit, and is dedicated to the tensor core when the tensor core is performing the tensor-related operation, and the vector computing core cannot use the shared memory.
The description of the tensor data storage unit and the shared memory may refer to the related description of fig. 1, and will not be repeated here.
Fig. 4 is a schematic diagram of an internal structure of a scaling factor processing module according to at least one embodiment of the present disclosure.
As shown in fig. 4, the scaling factor processing module includes an arithmetic logic unit (ARITHMETIC LOGIC UNIT, ALU for short) 1011 and a buffer 1012.
The arithmetic logic unit 1011 is configured to receive each tensor in the first floating point number format, perform a calculation operation of a scaling factor of each tensor, and perform floating point number quantization on each tensor to obtain a quantized tensor after floating point number quantization corresponding to each tensor, where the quantized tensor is in the second floating point number format.
The buffer 1012 is configured to buffer the scaling factors of the individual tensors.
As shown in fig. 4, the arithmetic logic unit 1011 receives the high-precision tensors in the first floating-point number format, that is, the first tensor a, the second tensor B, and the third tensor C, performs the calculation operation of the scaling factors of the first tensor a, the second tensor B, and the third tensor C, obtains the scaling factor α of the first tensor a, the scaling factor β of the second tensor B, and the scaling factor σ of the third tensor C, and buffers the scaling factors α, β, and σ to the buffer 1012.
In addition, the arithmetic logic unit 1011 performs floating point quantization on each tensor according to the scaling factor to obtain a first quantized tensor a ', a second quantized tensor B ', and a third quantized tensor C '.
The scaling factor α, scaling factor β and scaling factor σ and the first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' are transmitted to the tensor core 102, and a matrix multiplication operation using the scaling factor is performed by the tensor core 102, for example described as: Or (b)
After performing the matrix multiplication operation, the tensor core 102 transmits the calculation result of the matrix multiplication operation using the scaling factor, that is, the fourth tensor D, to the arithmetic logic unit. The fourth tensor D is a high-precision tensor.
For example, the fourth tensor D may have the same precision as the first, second, and third tensors a, B, and C, or may be different from the first, second, and third tensors a, B, and C. That is, for the fourth tensor D, it may correspond to a set of first and second floating point formats that are different from the first tensor, etc. For example, in one embodiment, the floating point number format of the first tensor a is FP16, the floating point number format of the first quantized tensor a 'is FP4, the floating point number format of the fourth tensor D is BF16, and the floating point number format of the fourth quantized tensor D' is FP8, for example, in another embodiment, the floating point number formats of the first tensor a, the fourth tensor D, etc. are FP16, the floating point number formats of the first quantized tensor a ', the fourth quantized tensor D', etc. are FP4, which can be set by those skilled in the art by themselves as needed, and the disclosure is not limited in particular.
The arithmetic logic unit 1011 performs floating point number quantization on the fourth tensor D to obtain a fourth quantized tensor D ', and the fourth quantized tensor D' is a low-precision tensor. The scaling factor γ of the fourth tensor D is calculated and stored in the buffer 1012.
Of course, in other embodiments, the fourth tensor D obtained by the tensor kernel calculation may be output to the subsequent other modules, without performing the correlation operation by using the scaling factor processing module. That is, the floating point number quantization and scaling factor calculation for the fourth tensor D is optional, and the fourth tensor D is directly transferred to the other computing unit or the memory through the transmission path between the tensor core and the memory when the floating point number quantization and scaling factor calculation for the fourth tensor D are not needed.
In at least one embodiment of the present disclosure, by setting an arithmetic logic unit, the arithmetic logic unit is dedicated to performing the calculation operation of the scaling factor and the quantization of the floating point number, so that the related calculation operation in the vector calculation core is transferred to the arithmetic logic unit set on the storage module in the calculation unit, occupation of calculation resources of the vector calculation core due to calculation of the scaling factor is avoided, efficiency of other vector operators is improved, and calculation force of the vector calculation unit can be reduced to save hardware area.
In addition, by setting additional buffer areas to buffer the scaling factors of each tensor, for example, the buffer areas can be dedicated to buffer the scaling factors, so that occupation of originally limited storage resources of the storage module on a transmission path is avoided, delay caused by insufficient buffer capacity is reduced, transmission efficiency of the scaling factors is improved, and overall throughput rate and overall execution efficiency of low-precision matrix multiplication operation using the scaling factors by tensor cores are improved.
For example, for any one of the tensors in the first floating-point number format received by the scaling factor processing module 101, any one of the tensors is divided into a plurality of tensor blocks, each tensor block including a plurality of tensor elements, each tensor block corresponding to a scaling factor parameter, the plurality of tensor elements sharing the scaling factor parameter.
For example, taking tensor A in the first floating point number format as an example, the shape size of A is [ m, s x k ], and the shape size of the scaling factor of tensor A is [ m, s ]. For example, the s×k elements of each row in a are divided into s groups, and each consecutive k elements is taken as a tensor block, i.e. each tensor block includes k tensor elements, where the k tensor elements correspond to a scaling factor parameter.
For example, the arithmetic logic unit 1011 performs the calculation operation of receiving each tensor and performing the scaling factor of each tensor, including the operations of determining, for each tensor block, a result of adding a maximum value of absolute values of a plurality of tensor elements to a preset floating point number, determining a result of multiplying the result by a preset constant, wherein the preset constant is a quotient of 1 and a maximum value of an accuracy expression range of the second floating point number format, and converting the result of multiplying to an accuracy specified by the scaling factor of any tensor, thereby obtaining the scaling factor parameter corresponding to the tensor block.
For example, the above operation is specifically described taking one tensor block in the tensor a described above as an example. For example, the tensor block X includes 1 st to kth tensor elements in a certain row in the tensor a, for example, x= [ X 0,X1,...,Xk-1],X0,X1,...,Xk-1 ] represents the k tensor elements, respectively.
The addition result is first determined with reference to the following formula:
X_max=max(abs(X))+t
when x_max is equal to 0, x_max () +=t
X_max represents the addition, abs () represents the absolute function, max () represents the maximum function, and t is the smallest floating point number (e.g., t=1e -12) for preventing a division by zero scene.
The corresponding scaling factor parameter S for Zhang Liangkuai X is then determined with reference to the following formula:
Where q_max is known in advance, 1/q_max is a pre-calculated preset constant, q_max represents the maximum value of the precision expression range of the second floating point format, e.g., FP4, q_max may be 6, and stype represents the desired precision specified by converting x_max× (1/q_max) to the scaling factor, e.g., the precision of the scaling factor is specified as FP8.
The precision of the scaling factor is not required to be the same as that of the second floating point number format, and the precision of the result of x_max× (1/q_max) may be higher than that of the scaling factor, for example.
For example, the scaling factor parameters corresponding to each tensor block are obtained by referring to the above process, so as to obtain the scaling factor of the first tensor a, and the repetition is not repeated.
The process of determining the scaling factors for the second tensor B, the second tensor C, and the second tensor D is similar to the process of determining the scaling factor for the first tensor a, and will not be repeated here.
After the arithmetic logic unit obtains the scaling factors, the arithmetic logic unit performs floating point number quantization on each tensor to obtain quantized tensors after floating point number quantization corresponding to each tensor, and comprises the following operations of determining a division lookup table based on the scaling factor parameters, determining quotient of a plurality of tensor elements and the scaling factor parameters based on the division lookup table, and quantizing the quotient into a second floating point number format to obtain quantized tensor elements in the second floating point number format after floating point number quantization corresponding to the tensor elements respectively.
For example, floating point number quantization may be performed with reference to the following equation:
wherein Xq represents a quantized tensor block corresponding to tensor block X, which includes quantized tensor elements of k second floating-point number formats in one-to-one correspondence with k tensor elements in Zhang Liangkuai X, Representing a division lookup table determined based on the scale factor parameters, qType () represents the target precision of quantizing the parameters to a matrix multiplication operation, i.e., the second floating point number format.
For example, the floating point number quantization is performed on each tensor block in the first tensor a with reference to the above process, so as to obtain the first quantized tensor a, and the repetition is not repeated.
The process of floating point number quantization of the second tensor B, the second tensor C and the second tensor D is similar to that of the first tensor a, and will not be repeated here.
For example, in some embodiments, arithmetic logic unit 1011 and cache 1012 are located on the same memory module 103. For example, both the arithmetic logic unit 1011 and the buffer 1012 are provided on a shared memory, or both the arithmetic logic unit 1011 and the buffer 1012 are provided on a tensor data storage unit.
For example, in other embodiments, the arithmetic logic unit and the buffer are provided on different memory modules.
For example, the matrix multiplication operation using the scaling factors includes performing matrix multiplication of the first tensor a and the second tensor B in combination with the scaling factor α of the first tensor a and the scaling factor β of the second tensor B, and determining a result of the operation of the matrix multiplication and a result of the addition of the third tensor C in combination with the scaling factor γ of the third tensor C, resulting in a fourth tensor D as a result of the operation of the matrix multiplication operation.
For example, in some embodiments, the buffers include a first buffer configured to buffer a scaling factor α of a first tensor a, a scaling factor β of a second tensor B, and a scaling factor σ of a third tensor C, and a second buffer configured to buffer a scaling factor γ of a fourth tensor D.
For example, the at least one memory module includes a first memory module and a second memory module, the first memory module being closer to the tensor kernel than the second memory module.
For example, in some embodiments, the first storage module is a shared memory and the second storage module is a tensor data storage unit. For example, in other embodiments, the first storage module is a tensor data storage unit and the second storage module is a shared memory. The first storage module and the second storage module may be determined according to a hardware architecture, which is not particularly limited by the present disclosure.
For example, the second buffer and the first buffer may be provided on different memory modules, and the second buffer and the arithmetic logic unit are provided on the same memory module. For example, the arithmetic logic unit and the second buffer are disposed on the second memory module, and the first buffer is disposed on the first memory module.
Fig. 5 is a schematic structural diagram of a computing unit provided in an embodiment of the present disclosure.
As shown in fig. 5, a plurality of storage modules are disposed on a data path between the memory 120 and the tensor core 102, where the storage modules include a first storage module and a second storage module, and the first storage module and the second storage module are located inside the computing unit and are exclusive to the tensor core when the tensor core performs a tensor operation, and the first storage module is closer to the tensor core 102. Reference may be made to the foregoing for the first storage module and the second storage module, and the description thereof will not be repeated here.
As shown in fig. 5, the second buffer area and the arithmetic logic unit are disposed on the second memory module, and the first buffer area is disposed on the first memory module.
As shown in fig. 5, the first tensor a, the second tensor B, and the third tensor C are high-precision tensors, such as BF16, FP16, etc., which may be from a memory or other computing unit or a previous output of a current computing unit, etc., which is not particularly limited by the present disclosure.
The first tensor A, the second tensor B and the third tensor C enter an arithmetic logic unit, the arithmetic logic unit determines the scaling factor alpha of the first tensor A, the scaling factor beta of the second tensor B and the scaling factor sigma of the third tensor C, and quantizes the first tensor A, the second tensor B and the third tensor C to obtain a first quantized tensor A ', a second quantized tensor B ' and a third quantized tensor C '. The specific process of determining the scaling factor and the floating point number quantization may refer to the foregoing, and will not be described herein.
And the arithmetic logic unit processes the first quantized tensor A ', the second quantized tensor B' and the third quantized tensor C 'which are respectively corresponding to the first tensor A, the second tensor B and the third tensor C and are quantized by floating point numbers, and the first quantized tensor A', the second quantized tensor B 'and the third quantized tensor C' are transmitted and cached in the first storage module through a data path between the second storage module and the first storage module. The first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' after floating point number quantization are low-precision tensors, for example, FP4 or FP8 format.
Therefore, the first quantized tensor A ', the second quantized tensor B ' and the third quantized tensor C ' are still stored in the shared memory, the scaling factor alpha of the first tensor A, the scaling factor beta of the second tensor B and the scaling factor sigma of the third tensor C are cached in the first cache region, the capacity of the added cache region and the brought hardware area are reduced, and the data transmission efficiency can be improved and the delay caused by insufficient cache capacity is reduced.
Thereafter, the scaling factor α of the first tensor a, the scaling factor β of the second tensor B and the scaling factor σ of the third tensor C, and the first quantized tensor a ', the second quantized tensor B ' and the third quantized tensor C ' are transferred via the data path between the first memory module and the tensor core into the tensor core for matrix multiplication operations using the scaling factors, e.g. performingOr alternativelyEtc., the present disclosure is not limited to a particular implementation of the matrix multiplication operation using the scaling factor.
The tensor kernel obtains a fourth tensor D after performing the matrix multiplication operation using the scaling factor. As shown in fig. 5, the tensor core transmits the fourth tensor D to the first memory module via a data path between the first memory module and the tensor core, and to the arithmetic logic unit of the second memory module via a data path between the first memory module and the second memory module.
The arithmetic logic unit is further configured to determine a scaling factor gamma of the fourth tensor according to the fourth tensor D, buffer the scaling factor gamma in the second buffer, and perform floating point number quantization on the fourth tensor D to obtain a fourth quantized tensor D' in a second floating point number format after the floating point number quantization corresponding to the fourth tensor. The fourth quantized tensor D' is also in a low precision format, such as FP4 or FP8 format, etc.
The fourth quantized tensor D' and the scaling factor y of the fourth tensor are then transmitted to other relevant modules. For example, to other computing units, memory, or reenter the current computing unit for the next round of operation.
In an embodiment not shown in fig. 5, after determining the fourth tensor D, the tensor core directly outputs the fourth tensor via the first storage module and the second storage module, without performing scaling factor calculation and floating point number quantization of the fourth tensor D, where the second buffer is still disposed in the second storage module.
For example, in other embodiments, quantization of the fourth tensor is not required, and the second buffer may not be set at this time, and the fourth tensor D may be directly transmitted to other related modules via the first storage module and the second storage module, which is not described herein.
For example, in other embodiments, the matrix multiplication operation may not perform addition with the third tensor C, and only includes calculating axb, and the specific process is similar to the matrix multiplication operation described above, that is, the floating point number quantization and scaling factor calculation with respect to the third tensor C need not be performed, and the specific process is not described herein.
In at least one embodiment of the present disclosure, an additional hardware module, that is, a scaling factor processing module, is set in the computing unit to improve the transmission efficiency of the scaling factor, reduce the delay caused by insufficient buffer capacity, relieve the computing pressure of the vector computing core, improve the efficiency of other vector operators, reduce the computing power of the vector computing core to save the hardware area, thereby improving the overall execution efficiency of low-precision matrix multiplication using the scaling factor.
In at least one embodiment of the present disclosure, the processor may be a processor of any architecture, such as a graphics processor, tensor processor, data processor, or the like. A schematic structure of a graphics processor provided in at least one embodiment of the present disclosure is described below using a graphics processor as an example.
Fig. 6 is a schematic block diagram of a graphics processor provided in at least one embodiment of the present disclosure. As shown in fig. 6, the graphics processor 200 includes a plurality of streaming processor clusters, each of which includes a plurality of computing units, and memory. The description of the streaming processor cluster, the computing unit and the memory may refer to the description of fig. 1, and will not be repeated here.
As shown in fig. 6, each computing unit comprises a tensor core, a storage module, which may comprise at least one of a shared memory, a tensor data processing unit, for example.
As shown in fig. 6, a scaling factor processing module 101 is further disposed on the storage module of each computing unit. The scaling factor processing module 101 is configured to determine scaling factors for each tensor associated with a matrix multiplication operation in the buffer processor, and to quantize each tensor from the first floating point format to the second floating point format. The scaling factor is used to scale the expression range of the data as the corresponding tensor is converted from the first floating-point format to the second floating-point format, the floating-point precision of the first floating-point format being higher than the floating-point precision of the second floating-point format. For more detailed description of the scaling factor processing module 101, reference may be made to the related description of the scaling factor processing module 101 as described above, and the repetition is omitted.
For example, the arithmetic logic unit and the buffer are arranged on the same memory module, for example, both on a shared memory, or both on the tensor data processing unit.
For example, the arithmetic logic unit and the buffer are provided on different memory modules.
For example, the buffer comprises a first buffer configured to buffer the scaling factor of the first tensor, the scaling factor of the second tensor, and the scaling factor of the third tensor, and a second buffer configured to buffer the scaling factor of the fourth tensor.
The at least one memory module includes a first memory module and a second memory module, the first memory module being closer to the tensor kernel than the second memory module. The arithmetic logic unit and the second buffer area are arranged on the second storage module, and the first buffer area is arranged on the first storage module.
For example, in one embodiment, the first storage module is a shared memory, the second storage module is a tensor data processing unit, where the first buffer is disposed on the shared memory, and the second buffer and the arithmetic logic unit are disposed on the tensor data processing unit. For example, in another embodiment, the first storage module is a tensor data processing unit, and the second storage module is a shared memory, where the first buffer is disposed on the tensor data processing unit, and the second buffer and the arithmetic logic unit are disposed on the shared memory.
For more details of the scaling factor processing module and interactions with other units in the graphics processor, reference may be made to the foregoing description of the scaling factor processing module, and details are not repeated.
In at least one embodiment, by setting an additional hardware module, namely a scaling factor processing module, in a computing unit of the graphics processor, the transmission efficiency of the scaling factor is improved, delay caused by insufficient buffer capacity is reduced, computing pressure of a vector computing core is relieved, efficiency of other vector operators is improved, computing power of the vector computing core is reduced, and hardware area is saved, so that overall execution efficiency of low-precision matrix multiplication using the scaling factor is improved.
Further, it should be noted that the components of the graphics processor 200 shown in FIG. 6 are exemplary only and not limiting, and that the graphics processor 200 may have other components as desired for practical applications.
Fig. 7 is a schematic diagram of an electronic device according to at least one embodiment of the present disclosure.
For example, as shown in fig. 7, the electronic device 300 includes a processor 200. For example, the processor 200 may be implemented using the architecture shown in FIG. 6. For example, the electronic device 300 may be any electronic device including computing functionality, such as a notebook computer, tablet computer, desktop computer, web server, etc., to which embodiments of the present disclosure are not limited.
For example, the electronic device may also include a central processing unit (Central Processing Unit, CPU), other forms of processing units such as a Digital Signal Processor (DSP) and the like having data processing and/or instruction execution capabilities, storage units and the like, with an operating system, application programming interface (e.g., openGL (Open Graphics Library), metal, etc.) and the like also mounted thereon. For example, the electronic device may further include an output component, such as a display component, such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic LIGHT EMITTING Diode (OLED) display, a Quantum Dot LIGHT EMITTING Diode (QLED) display, or the like, to which embodiments of the present disclosure are not limited.
It should be noted that, for clarity and brevity, not all of the constituent elements of the electronic device 300 are provided in the embodiments of the present disclosure. Other constituent elements not shown may be provided, set up, etc. as required by the specific needs of those skilled in the art in order to achieve the necessary functions of the electronic device 300, and the embodiments of the present disclosure are not limited thereto.
Referring now to fig. 8, there is illustrated a specific structural diagram of an electronic device (e.g., a terminal device or server) 300 suitable for use in implementing a processor including embodiments of the present disclosure.
The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. For example, the electronic device may be in the form of a server for deep learning and artificial intelligence, scientific computing, graphic rendering and video editing, virtual reality and game development, cloud service, and other various application scenarios, for example, the electronic device may be a dedicated server for data center, cloud computing, and the like deployed with tasks such as deep learning training, large-scale data analysis, high-performance computing, and the like.
The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 8, the electronic device 300 may include a processing means 301, the processing means 301 including, for example, the aforementioned processor 200, which may perform various suitable actions and processes in accordance with non-transitory computer readable instructions stored in a memory to implement various functions. The processing means 301 may also comprise a Central Processing Unit (CPU), tensor Processor (TPU) or the like having instruction optimization capabilities and/or program execution capabilities. The Central Processing Unit (CPU) may be an X86, ARM, RISC-V architecture, or the like. The GPU may be integrated directly into the SOC, directly onto the motherboard, or built into the north bridge chip of the motherboard.
As shown in fig. 8, for example, the memory may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random Access Memory (RAM) 303 and/or cache memory (cache) or the like, and computer readable instructions may be loaded from storage 308 into Random Access Memory (RAM) 303 to execute the computer readable instructions. The non-volatile memory may include, for example, read-only memory (ROM) 302, a hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. Various applications and various data, such as style images, and various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
For example, a processing device 301, a Read Only Memory (ROM) 302, and a Random Access Memory (RAM) 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, devices may be connected to an input/output (I/O) interface 305 including input devices 306 such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 307 including a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 308 including magnetic tape, hard disk, flash memory, etc., and communication devices 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 8 shows the electronic device 300 with various means, it is to be understood that not all of the illustrated means are required to be implemented or provided, and that the electronic device 300 may alternatively be implemented or provided with more or fewer means. For example, the processing device 301 may control other components in the electronic device 300 to perform desired functions.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to obtain at least two internet protocol addresses, send a node evaluation request including the at least two internet protocol addresses to a node evaluation device, wherein the node evaluation device selects an internet protocol address from the at least two internet protocol addresses and returns the internet protocol address, receive the internet protocol address returned by the node evaluation device, wherein the obtained internet protocol address indicates an edge node in a content distribution network.
Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to receive a node evaluation request comprising at least two internet protocol addresses, select an internet protocol address from the at least two internet protocol addresses, and return the selected internet protocol address, wherein the received internet protocol address indicates an edge node in the content distribution network.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely a specific embodiment of the disclosure, but the scope of the disclosure is not limited thereto and should be determined by the scope of the claims.

Claims (11)

1.一种处理器,其特征在于,所述处理器包括计算单元和内存,所述计算单元包括张量核,所述张量核配置为执行使用缩放因子的矩阵乘操作,1. A processor, characterized in that the processor includes a computing unit and memory, the computing unit including a tensor kernel, the tensor kernel being configured to perform matrix multiplication operations using a scaling factor. 所述计算单元还包括缩放因子处理模块,配置为确定及缓存所述矩阵乘操作相关的各个张量的缩放因子,以及将所述各个张量由第一浮点数格式量化为第二浮点数格式,所述缩放因子用于在对应的张量从所述第一浮点数格式转换为所述第二浮点数格式时来缩放数据的表达范围,所述第一浮点数格式的浮点数精度高于所述第二浮点数格式的浮点数精度,The computation unit further includes a scaling factor processing module, configured to determine and cache the scaling factors of each tensor related to the matrix multiplication operation, and to quantize each tensor from a first floating-point format to a second floating-point format. The scaling factor is used to scale the data representation range when the corresponding tensor is converted from the first floating-point format to the second floating-point format, wherein the floating-point precision of the first floating-point format is higher than that of the second floating-point format. 所述计算单元还包括在所述张量核和所述内存之间的数据通路上设置的至少一个存储模块,所述至少一个存储模块在所述张量核执行张量相关操作时被所述张量核独占,The computing unit further includes at least one storage module disposed on the data path between the tensor kernel and the memory, wherein the at least one storage module is exclusively used by the tensor kernel when the tensor kernel performs tensor-related operations. 所述缩放因子处理模块设置在所述至少一个存储模块上。The scaling factor processing module is located on the at least one storage module. 2.根据权利要求1所述的处理器,其特征在于,所述缩放因子处理模块包括算术逻辑单元和缓存区,2. The processor according to claim 1, wherein the scaling factor processing module comprises an arithmetic logic unit and a cache. 所述算术逻辑单元配置为接收所述第一浮点数格式的各个张量,执行所述各个张量的缩放因子的计算操作,以及对所述各个张量进行浮点数量化,得到所述各个张量对应的经过所述浮点数量化后的量化后张量,其中,所述量化后张量为所述第二浮点数格式;The arithmetic logic unit is configured to receive each tensor in the first floating-point format, perform a scaling factor calculation operation on each tensor, and perform floating-point quantization on each tensor to obtain a quantized tensor corresponding to each tensor after floating-point quantization, wherein the quantized tensor is in the second floating-point format. 所述缓存区配置为缓存所述各个张量的缩放因子。The cache is configured to cache the scaling factors of each tensor. 3.根据权利要求2所述的处理器,其特征在于,对于所述缩放因子处理模块接收的所述第一浮点数格式的任一个张量,所述任一个张量被分成多个张量块,每个张量块包括多个张量元素,每个张量块对应一个缩放因子参数,所述多个张量元素共用所述一个缩放因子参数,3. The processor according to claim 2, characterized in that, for any tensor of the first floating-point format received by the scaling factor processing module, the any tensor is divided into multiple tensor blocks, each tensor block includes multiple tensor elements, each tensor block corresponds to a scaling factor parameter, and the multiple tensor elements share the same scaling factor parameter. 所述算术逻辑单元执行接收所述各个张量,执行所述各个张量的缩放因子的计算操作时,包括执行以下操作:When the arithmetic logic unit performs the operation of receiving the tensors and calculating the scaling factors of the tensors, it includes the following operations: 针对每个张量块:For each tensor block: 确定所述多个张量元素的绝对值中的最大值与预设浮点数的加和结果;Determine the sum of the maximum absolute value among the plurality of tensor elements and a preset floating-point number; 确定所述加和结果和预设常数的乘积结果,其中,所述预设常数为1与所述第二浮点数格式的精度表达范围的最大值的商;Determine the product of the summation result and a preset constant, wherein the preset constant is the quotient of 1 and the maximum value of the precision expression range of the second floating-point number format; 将所述乘积结果转换为所述任一个张量的缩放因子规定的精度,得到所述张量块对应的所述缩放因子参数。The product result is converted to the precision specified by the scaling factor of any tensor to obtain the scaling factor parameter corresponding to the tensor block. 4.根据权利要求3所述的处理器,其特征在于,所述算术逻辑单元执行对所述各个张量进行浮点数量化,得到所述各个张量对应的经过所述浮点数量化后的量化后张量时,包括执行以下操作:4. The processor according to claim 3, characterized in that, when the arithmetic logic unit performs floating-point quantization on each tensor to obtain the quantized tensor corresponding to each tensor after floating-point quantization, it includes performing the following operations: 针对所述张量块:For the tensor block: 基于所述缩放因子参数,确定除法查找表;Based on the scaling factor parameter, determine the division lookup table; 基于所述除法查找表,分别确定所述多个张量元素与所述缩放因子参数的商,并将所述商量化为所述第二浮点数格式,以得到所述多个张量元素分别对应的量化后张量元素。Based on the division lookup table, the quotients of the plurality of tensor elements and the scaling factor parameter are determined respectively, and the quotients are quantized into the second floating-point number format to obtain the quantized tensor elements corresponding to the plurality of tensor elements respectively. 5.根据权利要求2所述的处理器,其特征在于,所述算术逻辑单元和所述缓存区设置在同一个存储模块上。5. The processor according to claim 2, wherein the arithmetic logic unit and the cache are disposed on the same storage module. 6.根据权利要求2所述的处理器,其特征在于,使用所述缩放因子的矩阵乘操作包括结合第一张量的缩放因子和第二张量的缩放因子,执行所述第一张量和所述第二张量的矩阵乘法,并结合第三张量的缩放因子,确定所述矩阵乘法的操作结果与所述第三张量的加和结果,得到作为所述矩阵乘操作的操作结果的第四张量,6. The processor according to claim 2, wherein the matrix multiplication operation using the scaling factor includes performing a matrix multiplication of the first tensor and the second tensor by combining the scaling factors of the first tensor and the second tensor, and determining the sum of the matrix multiplication operation result and the third tensor by combining the scaling factor of the third tensor to obtain a fourth tensor as the operation result of the matrix multiplication operation. 所述缓存区包括第一缓存区和第二缓存区,所述第一缓存区配置为缓存所述第一张量的缩放因子、所述第二张量的缩放因子和所述第三张量的缩放因子,所述第二缓存区配置为缓存所述第四张量的缩放因子,The cache includes a first cache and a second cache. The first cache is configured to cache the scaling factors of the first tensor, the second tensor, and the third tensor. The second cache is configured to cache the scaling factor of the fourth tensor. 所述至少一个存储模块包括第一存储模块和第二存储模块,所述第一存储模块距离所述张量核的距离比所述第二存储模块距离所述张量核的距离更近,The at least one storage module includes a first storage module and a second storage module, wherein the first storage module is closer to the tensor kernel than the second storage module is closer to the tensor kernel. 所述算术逻辑单元和所述第二缓存区设置在所述第二存储模块上,所述第一缓存区设置在所述第一存储模块上。The arithmetic logic unit and the second cache are disposed on the second storage module, and the first cache is disposed on the first storage module. 7.根据权利要求6所述的处理器,其特征在于,所述算术逻辑单元计算的所述第一张量的缩放因子、所述第二张量的缩放因子和所述第三张量的缩放因子,经由所述第二存储模块和所述第一存储模块之间的数据通路,传输并缓存在所述第一缓存区中,7. The processor according to claim 6, wherein the scaling factors of the first tensor, the second tensor, and the third tensor calculated by the arithmetic logic unit are transmitted and cached in the first cache area via the data path between the second storage module and the first storage module. 所述算术逻辑单元处理得到所述第一张量、所述第二张量和所述第三张量分别对应的经过所述浮点数量化后的第一量化后张量、第二量化后张量和第三量化后张量,经由所述第二存储模块和所述第一存储模块之间的数据通路,传输并缓存在所述第一存储模块中,The arithmetic logic unit processes the first tensor, the second tensor, and the third tensor to obtain the first quantized tensor, the second quantized tensor, and the third quantized tensor respectively after floating-point quantization. These are then transmitted and cached in the first storage module via the data path between the second storage module and the first storage module. 所述第一张量的缩放因子、所述第二张量的缩放因子和所述第三张量的缩放因子,以及所述第一量化后张量、所述第二量化后张量和所述第三量化后张量,经由所述第一存储模块和所述张量核之间的数据通路,传输至所述张量核中以进行使用所述缩放因子的所述矩阵乘操作。The scaling factors of the first tensor, the second tensor, and the third tensor, as well as the first quantized tensor, the second quantized tensor, and the third quantized tensor, are transmitted to the tensor kernel via the data path between the first storage module and the tensor kernel to perform the matrix multiplication operation using the scaling factors. 8.根据权利要求6所述的处理器,其特征在于,所述第四张量经所述第一存储模块和所述张量核之间的数据通路,传输至所述第一存储模块,并经由所述第一存储模块和所述第二存储模块之间的数据通路,传输至所述第二存储模块的算术逻辑单元,8. The processor according to claim 6, wherein the fourth tensor is transmitted to the first storage module via a data path between the first storage module and the tensor kernel, and is transmitted to the arithmetic logic unit of the second storage module via a data path between the first storage module and the second storage module. 所述算术逻辑单元还配置为根据所述第四张量,确定所述第四张量的缩放因子并缓存在所述第二缓存区,并对所述第四张量进行所述浮点数量化,得到所述第四张量对应的经过所述浮点数量化后的所述第二浮点数格式的第四量化后张量。The arithmetic logic unit is further configured to determine the scaling factor of the fourth tensor and cache it in the second buffer, and perform floating-point quantization on the fourth tensor to obtain the fourth quantized tensor in the second floating-point format corresponding to the fourth tensor after floating-point quantization. 9.根据权利要求2所述的处理器,其特征在于,使用所述缩放因子的矩阵乘操作包括结合第一张量的缩放因子和第二张量的缩放因子,执行所述第一张量和所述第二张量的矩阵乘法,并结合第三张量的缩放因子,确定所述矩阵乘法的操作结果与所述第三张量的加和结果,得到作为所述矩阵乘操作的操作结果的第四张量,9. The processor according to claim 2, wherein the matrix multiplication operation using the scaling factor includes performing a matrix multiplication of the first tensor and the second tensor by combining the scaling factors of the first tensor and the second tensor, and determining the sum of the matrix multiplication operation result and the third tensor by combining the scaling factor of the third tensor to obtain a fourth tensor as the operation result of the matrix multiplication operation. 其中,所述第四张量直接经由所述张量核至所述内存之间的传输通路传输至其它计算单元或所述内存。The fourth tensor is directly transmitted to other computing units or the memory via the transmission path between the tensor core and the memory. 10.根据权利要求2-9任一项所述的处理器,其特征在于,所述至少一个存储模块包括所述计算单元中的张量数据存储单元和共享内存中的至少一个,10. The processor according to any one of claims 2-9, wherein the at least one storage module comprises at least one of a tensor data storage unit in the computing unit and shared memory. 所述张量数据存储单元专用于所述张量核的相关数据预处理,The tensor data storage unit is dedicated to the preprocessing of data related to the tensor kernel. 所述共享内存为由所述计算单元中的所有线程共享的存储区。The shared memory is a storage area shared by all threads in the computing unit. 11.一种电子设备,其特征在于,包括如权利要求1-10任一项所述的处理器。11. An electronic device, characterized in that it includes a processor as described in any one of claims 1-10.
CN202511299546.6A 2025-09-12 2025-09-12 Processors, electronic devices Active CN120803395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511299546.6A CN120803395B (en) 2025-09-12 2025-09-12 Processors, electronic devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511299546.6A CN120803395B (en) 2025-09-12 2025-09-12 Processors, electronic devices

Publications (2)

Publication Number Publication Date
CN120803395A CN120803395A (en) 2025-10-17
CN120803395B true CN120803395B (en) 2025-11-18

Family

ID=97319979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511299546.6A Active CN120803395B (en) 2025-09-12 2025-09-12 Processors, electronic devices

Country Status (1)

Country Link
CN (1) CN120803395B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120805999B (en) * 2025-09-09 2025-11-11 上海壁仞科技股份有限公司 Scaling factor processing module, processor, electronic devices

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492771A (en) * 2020-11-13 2022-05-13 联发科技股份有限公司 Neural network processing unit and system
CN119883375A (en) * 2024-12-31 2025-04-25 摩尔线程智能科技(北京)股份有限公司 Processor, chip product, computer device and tensor calculation method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11347511B2 (en) * 2019-05-20 2022-05-31 Arm Limited Floating-point scaling operation
US20230133360A1 (en) * 2021-10-28 2023-05-04 Taiwan Semiconductor Manufacturing Company, Ltd. Compute-In-Memory-Based Floating-Point Processor
US20250045572A1 (en) * 2023-08-04 2025-02-06 Texas Instruments Incorporated Quantization for neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492771A (en) * 2020-11-13 2022-05-13 联发科技股份有限公司 Neural network processing unit and system
CN119883375A (en) * 2024-12-31 2025-04-25 摩尔线程智能科技(北京)股份有限公司 Processor, chip product, computer device and tensor calculation method

Also Published As

Publication number Publication date
CN120803395A (en) 2025-10-17

Similar Documents

Publication Publication Date Title
US11275561B2 (en) Mixed precision floating-point multiply-add operation
CN111708511B (en) Data compression for neural networks
CN117785480B (en) Processor, reduction calculation method and electronic equipment
CN114118347A (en) Fine-grained per-vector scaling for neural network quantization
CN119312003B (en) Data processing method and device, processor, electronic device and storage medium
US20220261650A1 (en) Machine learning training in logarithmic number system
CN120803395B (en) Processors, electronic devices
CN113570053B (en) A training method, device and computing device for a neural network model
CN114626516A (en) Neural network acceleration system based on floating point quantization of logarithmic block
CN114787823B (en) Flexible-precision neural reasoning processing unit
CN118170347B (en) Precision conversion device, data processing method, processor, and electronic device
CN116795324A (en) Mixed-precision floating-point multiplication device and mixed-precision floating-point number processing method
CN119201836A (en) Design architecture and control method of in-memory computing AI accelerator based on RISC-V architecture
CN120562507A (en) Training method, data processing method, electronic device and computer-readable storage medium
KR102796774B1 (en) Low-cost multi-fpga accelerating system for transformer-based language services
CN120144491A (en) Data loading method, data storage method, processor, electronic device and medium
US20240160406A1 (en) Low-precision floating-point datapath in a computer processor
WO2020256836A1 (en) Sparse convolutional neural network
CN109635238B (en) Matrix operation method, device, equipment and readable medium
CN120805999B (en) Scaling factor processing module, processor, electronic devices
CN121116207A (en) Data quantization circuits, data processing methods, processors, devices, and media based on lookup tables
CN117827386A (en) Scheduling method, scheduling device, electronic device and storage medium
CN120803396B (en) Tensor core, processor, data processing method, electronic device, and storage medium
CN120872284B (en) Tensor core, processor, data processing method, electronic device, and storage medium
CN120743351A (en) Instruction processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant