TWI845081B

TWI845081B - Graphics processor

Info

Publication number: TWI845081B
Application number: TW111149317A
Authority: TW
Inventors: 陳中和; 許峰銘; 期开蔡
Original assignee: 國立成功大學
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2024-06-11
Also published as: TW202427231A

Abstract

A graphics processor includes a plurality of stream multiprocessors. Each of the stream multiprocessors includes N stream processors and a tensor processing module. The N stream processors are configured to operate in an SIMT architecture, wherein each of the N stream processors includes a product-accumulation unit. The tensor processing module is connected with the N product-accumulation units, and is configured to arrange the N product-accumulation units as a systolic array that performs a matrix multiplication.

Description

Graphics processor

本發明係關於圖形處理器的技術領域，特別是關於一種實現脈動陣列的圖形處理器。 The present invention relates to the technical field of graphics processors, and in particular to a graphics processor that implements a pulse array.

神經網路的應用正在飛速地成長當中，加上資訊安全的注重以及即時應用的需求，人工智慧的應用從雲端普及到了終端，這也促使了許多為了加速神經網路運算所設計的硬體加速器。一般來說，矩陣乘法為大多數網路層中的基礎運算，因此，若能加速矩陣乘法的運算便能達到加速神經網路的運算。為此，各大廠商例如NVIDIA公司以及Google公司皆提出了用於加速神經網路運算的硬體加速器。 The application of neural networks is growing rapidly. With the focus on information security and the need for real-time applications, the application of artificial intelligence has spread from the cloud to the terminal, which has also prompted many hardware accelerators designed to accelerate neural network operations. Generally speaking, matrix multiplication is the basic operation in most network layers. Therefore, if the operation of matrix multiplication can be accelerated, the operation of neural networks can be accelerated. For this reason, major manufacturers such as NVIDIA and Google have proposed hardware accelerators for accelerating neural network operations.

NVIDIA公司從Volta架構的圖像處理單元(Graphics Processing Unit，GPU)上加入了Tensor Core，它是一個獨立的硬體單元，可以執行兩個4x4大小的矩陣相乘的運算，並再與一個4x4矩陣執行一個累加的運算，進而完成一f個乘積累加運算(Fused Multiply-Add，FMA)。Tensor Core跟GPU裡的Cuda Core是不同的計算單元，所以它們可以同時執行以提升指令的並行能力(Instruction Level Parallelism)。 NVIDIA has added Tensor Core to the Graphics Processing Unit (GPU) of the Volta architecture. It is an independent hardware unit that can perform a multiplication operation of two 4x4 matrices and then perform an accumulation operation with a 4x4 matrix to complete a Fused Multiply-Add (FMA). Tensor Core and Cuda Core in the GPU are different computing units, so they can be executed simultaneously to improve instruction level parallelism.

Google公司所研發的TPU(Tensor Processing Unit)是一f個插在PCIe匯流排上的一個晶片，其用來在雲端伺服器中來應對人工智慧應用所需的大量運算，而非像NVIDIA的Tensor core是設計在GPU內部的硬體。Google TPU至今已推出許多版本，但運算單元皆採用脈動陣列(Systolic Array)的硬體架構。TPU無法單獨執行，需要藉由CPU透過PCIe匯流排傳送TPU CISC-like指令來進行控制。 The TPU (Tensor Processing Unit) developed by Google is a chip plugged into the PCIe bus. It is used in cloud servers to handle the large amount of computing required for artificial intelligence applications, rather than hardware designed inside the GPU like NVIDIA's Tensor core. Google TPU has launched many versions so far, but the computing units all use the systolic array hardware architecture. TPU cannot run alone and needs to be controlled by the CPU through the PCIe bus to send TPU CISC-like instructions.

無論是NVIDIA Tensor Core或是Google TPU，皆需要巨大功耗來滿足運算效能的需求，且非為終端的應用設計所考量。然而，在面積和功耗有限的終端裝置上，新增一顆矩陣運算單元帶來的成本非常之高。因此，如何在不增加太多硬體成本的條件下提升矩陣運算能的效能，是目前所遇到的技術瓶頸。 Whether it is NVIDIA Tensor Core or Google TPU, they both require huge power consumption to meet the demand for computing performance, and are not considered for terminal application design. However, on terminal devices with limited area and power consumption, the cost of adding a matrix computing unit is very high. Therefore, how to improve the performance of matrix computing without increasing hardware costs too much is the current technical bottleneck.

本發明之一目的在於提供一種利用圖形處理器實現加速矩陣乘法運算的硬體架構。 One of the purposes of the present invention is to provide a hardware architecture for accelerating matrix multiplication operations using a graphics processor.

為達上述之目的，本發明提供一種圖形處理器，其包含多個串流多處理器。該等串流多處理器的每一者包含N個串流處理器及張量運算模組。N個串流處理器經組態以單一指令對應多個執行緒(Single Instruction Multiple Threads，SIMT)的架構進行運算，其中該等N個串流處理器的每一者包含乘積累加運算單元，其中N為正整數且N

4。張量處理模組連接該等N個乘積累加運算單元，且經組態以將該等N個乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列。 To achieve the above-mentioned object, the present invention provides a graphics processor, which includes a plurality of stream multiprocessors. Each of the stream multiprocessors includes N stream processors and a tensor operation module. The N stream processors are configured to operate in a single instruction to multiple threads (Single Instruction Multiple Threads, SIMT) architecture, wherein each of the N stream processors includes a multiply-accumulate operation unit, wherein N is a positive integer and N

4. The tensor processing module is connected to the N product-accumulation operation units and is configured to configure the N product-accumulation operation units as a pulse array for performing matrix multiplication operations.

在本發明的一實施例中，該脈動陣列的維數(D)符合下式，

In one embodiment of the present invention, the dimension (D) of the pulse array conforms to the following formula,

其中N=2ⁿ，n為正整數且n

2。 Where N=2 ⁿ , n is a positive integer and n

2.

在本發明的一實施例中，N=4，該等四個乘積累加運算單元為MAC₀、MAC₁、MAC₂、MAC₃分別屬於該等N個串流處理器中依序的四者，且該等四個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In one embodiment of the present invention, N=4, the four multiply-accumulate operation units are MAC ₀ , MAC ₁ , MAC ₂ , and MAC ₃ , which belong to four of the N stream processors in sequence, and the four multiply-accumulate operation units are configured in the following manner to form the pulse array:

在本發明的一實施例中，該等N個乘積累加運算單元為MAC₀、MAC₁、...、MAC_(N-1)分別屬於依序的該等N個串流處理器，其中響應於n為偶數且n

4時，該等N個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In one embodiment of the present invention, the N multiply-accumulate units are MAC ₀ , MAC ₁ , ..., MAC _(N-1) and belong to the N stream processors in sequence, wherein in response to n being an even number and n being

4, the N multiplication and accumulation units are configured in the following manner to form the pulse array:

在本發明的一實施例中，N=8，該等八個乘積累加運算單元為MAC₀、MAC₁、MAC₂、...、MAC₇分別屬於該等N個串流處理器中依序的八者，且該等八個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In one embodiment of the present invention, N=8, the eight multiply-accumulate operation units are MAC ₀ , MAC ₁ , MAC ₂ , ..., MAC ₇ , which belong to eight of the N stream processors in sequence, and the eight multiply-accumulate operation units are configured in the following manner to form the pulse array:

在本發明的一實施例中，該等N個乘積累加運算單元為MAC₀、MAC₁、...、MAC_(N-1)分別屬於依序的該等N個串流處理器，其中響應於n為奇數且n

5時，該等N個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In one embodiment of the present invention, the N multiply-accumulate units are MAC ₀ , MAC ₁ , ..., MAC _(N-1) and belong to the N stream processors in sequence, wherein in response to n being an odd number and n being

5, the N product accumulation units are configured in the following manner to form the pulse array:

在本發明的一實施例中，該張量處理模組包含一第一暫緩單元及一第二暫緩單元，其中該第一暫緩單元經組態以將資料傳送給該脈動陣列中對應該行的乘積累加運算單元，該第二暫緩單元經組態以將資料傳送給該脈動陣列中對應該列的乘積累加運算單元。 In one embodiment of the present invention, the tensor processing module includes a first buffer unit and a second buffer unit, wherein the first buffer unit is configured to transmit data to the multiplication and accumulation unit corresponding to the row in the pulse array, and the second buffer unit is configured to transmit data to the multiplication and accumulation unit corresponding to the column in the pulse array.

在本發明的一實施例中，該等串流多處理器的每一者還包含局部記憶體及讀取/儲存模組。局部記憶體連接該張量處理模組，且經組態以儲存資料。讀取/儲存模組連接該局部記憶體與該等N個串流處理器，且經組態以從該局部記憶體讀取資料或是將資料儲存在該局部記憶體。 In one embodiment of the present invention, each of the streaming multiprocessors further includes a local memory and a read/store module. The local memory is connected to the tensor processing module and is configured to store data. The read/store module is connected to the local memory and the N streaming processors and is configured to read data from the local memory or store data in the local memory.

在本發明的一實施例中，響應於執行矩陣乘法的一第一運算，該張量處理模組從該局部記憶體存取資料並將資料傳送給該脈動陣列中的乘積累加運算單元以進行該第一運算。 In one embodiment of the present invention, in response to performing a first operation of matrix multiplication, the tensor processing module accesses data from the local memory and transmits the data to the multiplication and accumulation unit in the pulse array to perform the first operation.

在本發明的一實施例中，響應於執行非矩陣乘法的一第二運算，該讀取/儲存模組從局部記憶體存取資料並將資料傳送給該等N個串流處理器以進行該第二運算。 In one embodiment of the present invention, in response to performing a second operation that is not a matrix multiplication, the read/store module accesses data from the local memory and transmits the data to the N stream processors to perform the second operation.

本發明提出的圖形處理器，透過將串流處理器內的乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列，除了硬體架構設計上的一些改變，還考量運算平臺的整合以及神經網路在軟體層的優化。因此，在未增加太多的硬體的成本的情況下，本發明的圖形處理器可以有效地加速矩陣乘法的運算。 The graphics processor proposed in the present invention configures the multiplication and accumulation unit in the stream processor as a pulse array for performing matrix multiplication operations. In addition to some changes in the hardware architecture design, the integration of the computing platform and the optimization of the neural network at the software layer are also considered. Therefore, without increasing the cost of hardware too much, the graphics processor of the present invention can effectively accelerate the operation of matrix multiplication.

100:圖形處理器 100: Graphics processor

110:互連網路模組 110: Internet module

120:串流多處理器 120: Streaming multiprocessor

121:局部記憶體 121: Local memory

123:讀取/儲存模組 123: Read/save module

125:串流處理器 125: Stream Processor

127:張量處理模組 127: Tensor processing module

1271:第一暫緩單元 1271: First temporary unit

1273:第二暫緩單元 1273: Second temporary unit

129:調配模組 129: Allocation module

130:工作排程模組 130: Task Scheduling Module

140:記憶體 140:Memory

PE₀~PE_(N-1):串流處理器 PE ₀ ~PE _(N-1) : Stream Processor

MAC₀~MAC_(N-1):乘積累加運算單元 MAC ₀ ~MAC _(N-1) : Multiplication and Accumulation Unit

L₀~L_(N-1):通道 L ₀ ~L _(N-1) : Channel

Aout[0]、Aout[1]、Aout[2]、Aout[3]、Bout[0]、Bout[1]、Bout[2]、Bout[3]:資料 Aout[0], Aout[1], Aout[2], Aout[3], Bout[0], Bout[1], Bout[2], Bout[3]: data

第1A圖係根據本發明一較佳實施例繪示的圖形處理器的方塊系統示意圖。 Figure 1A is a block system schematic diagram of a graphics processor according to a preferred embodiment of the present invention.

第1B圖係根據本發明一較佳實施例繪示的圖形處理器實現脈動陣列的示意圖。 Figure 1B is a schematic diagram of a pulse array implemented by a graphics processor according to a preferred embodiment of the present invention.

第2圖係根據本發明一較佳實施例繪示的n=5時所實現的脈動陣列的示意圖。 Figure 2 is a schematic diagram of a pulse array realized when n=5 according to a preferred embodiment of the present invention.

第3圖係根據本發明一較佳實施例繪示的張量處理模組傳遞資料的時序示意圖。 Figure 3 is a timing diagram of the tensor processing module transmitting data according to a preferred embodiment of the present invention.

為了讓本發明之上述及其他目的、特徵、優點能更明顯易懂，下文將特舉本發明較佳實施例，並配合所附圖式，作詳細說明如下。 In order to make the above and other purposes, features and advantages of the present invention more clearly understood, the following will specifically cite the preferred embodiments of the present invention and provide a detailed description with reference to the attached drawings.

請參照第1A圖，第1A圖係根據本發明一較佳實施例繪示的圖形處理器100的方塊系統示意圖。圖形處理器100是單指令對應多執行緒(Single Instruction Multiple Threads，SIMT)的架構，其包含互連網路模組110、多個串流多處理器(Streaming Multiprocessor，SM)120、工作排程模組130、以及記憶體140。互連網路模組110電性連接於各個串流多處理器120、工作排程模組130、以及記憶體140，且經組態以在這些元件之間進行資料的傳輸。串流多處理器120經組態以進行運算與執行指令。工作排程模組130經組態以跟外部的中央處理器(圖未繪示)進行通訊，並接收來自中央處理器指派的工作以及將工作排程給串流多處理器120執行。 Please refer to FIG. 1A, which is a block diagram of a graphics processor 100 according to a preferred embodiment of the present invention. The graphics processor 100 is a single instruction multiple threads (SIMT) architecture, which includes an interconnection network module 110, a plurality of streaming multiprocessors (SMs) 120, a task scheduling module 130, and a memory 140. The interconnection network module 110 is electrically connected to each of the streaming multiprocessors 120, the task scheduling module 130, and the memory 140, and is configured to transmit data between these components. The streaming multiprocessor 120 is configured to perform calculations and execute instructions. The task scheduling module 130 is configured to communicate with an external central processor (not shown), receive tasks assigned from the central processor, and schedule the tasks to the streaming multiprocessor 120 for execution.

每個串流多處理器120包含局部記憶體121、讀取/儲存模組123、N個串流處理器(Streaming Processor，PE)125、張量(Tensor)處理模組127、調配模組129。局部記憶體121連接於記憶體140，用以儲存待運算的資料以及已經運算好的資料。讀取/儲存模組123連接局部記憶體121與調配模組129，經組態以從局部記憶體121中存取待運算的資料並將這些資料傳送到調配模組129，調配模組129再根據資料的型態分送給對應的串流處理器125以進行運算。N個串流處理器125分別包含各自的乘積累加運算單元(Multiply Accumulate Unit，MAC)MAC₀~MAC_N，其用於進行乘積累加的運算。具體來說，串流多處理器120的N個串流處理器125是以SIMT的架構進行運算，亦即，對於同一個指令，N個串流處理器125以執行緒的單位同時進行運算，因此可達到平行處理的效果。 Each streaming multiprocessor 120 includes a local memory 121, a read/storage module 123, N streaming processors (PE) 125, a tensor processing module 127, and a dispatching module 129. The local memory 121 is connected to the memory 140 to store data to be calculated and data that has been calculated. The read/storage module 123 is connected to the local memory 121 and the dispatching module 129, and is configured to access the data to be calculated from the local memory 121 and transmit the data to the dispatching module 129. The dispatching module 129 then distributes the data to the corresponding streaming processor 125 for calculation according to the type of the data. The N stream processors 125 each include a respective multiply-accumulate unit (MAC) MAC ₀ to MAC _N for performing multiply-accumulate operations. Specifically, the N stream processors 125 of the stream multiprocessor 120 perform operations based on a SIMT architecture, that is, for the same instruction, the N stream processors 125 perform operations simultaneously in thread units, thereby achieving a parallel processing effect.

在本發明的圖形處理器100中，每個串流多處理器120進一步包含了張量處理模組127，張量處理模組連接N個串流處理器125裡的N個乘積累加運算單元MAC₀~MAC_N，張量處理模組127經組態以將N個乘積累加運算單元MAC₀~MAC_N配置為執行矩陣乘法運算的脈動陣列(Systolic Array)，如第1B圖所示，第1B圖係根據本發明一較佳實施例繪示的圖形處理器100實現脈動陣列的示意圖。在第1B圖的示例中，N=16。脈動陣列的概念就是讓資料在運算單元的陣列中進行流動，減少存取的次數，並且使得結構更加完整，佈線更加統一，提高流動的頻率。脈動陣列本身只是一個有資料流動的結構，根據不同的應用會有不同的資料進行流動以及不同的流動方向。 In the graphics processor 100 of the present invention, each stream multiprocessor 120 further includes a tensor processing module 127, which is connected to N multiplication and accumulation units MAC ₀ ~MAC _N in the N stream processors 125. The tensor processing module 127 is configured to configure the N multiplication and accumulation units MAC ₀ ~MAC _N as a systolic array for performing matrix multiplication operations, as shown in FIG. 1B, which is a schematic diagram of the graphics processor 100 implementing a systolic array according to a preferred embodiment of the present invention. In the example of FIG. 1B, N=16. The concept of pulse array is to make data flow in the array of operation units, reduce the number of accesses, make the structure more complete, make the wiring more uniform, and increase the flow frequency. The pulse array itself is just a structure with data flow. Different data will flow and different flow directions depending on different applications.

雖然第1B圖繪示了一個陣列(MAC₀~MAC₁₅)，但實際上，張量處理模組127並非真的將N個乘積累加運算單元MAC₀~MAC_N排列成實際脈動陣列的模樣，而是透過不同時序的資料的傳遞，讓這些乘積累加運算單元MAC₀~MAC_N實現脈動陣列的運算，例如，L₀代表著連接於串流處理器PE₀裡的乘積累加運算單元MAC₀的通道，L₁代表著連接於串流處理器PE₁裡的乘積累加運算單元MAC₁的通道，以此類推，藉此量處理模組127可通過通道L₀~L₁₅將資料傳遞給乘積累加運算單元MAC₀~MAC₁₅進行運算。然而，為了方便理解，以下皆以矩陣的方式呈現。 Although FIG. 1B shows an array (MAC ₀ ~ MAC ₁₅ ), in reality, the tensor processing module 127 does not really arrange the N multiplication and accumulation units MAC ₀ ~ MAC _N into an actual pulse array. Instead, it transmits data at different timings to allow these multiplication and accumulation units MAC ₀ ~ MAC _N to perform pulse array operations. For example, L ₀ represents the channel of the multiplication and accumulation unit MAC ₀ connected to the stream processor PE ₀ , and L ₁ represents the channel of the multiplication and accumulation unit MAC ₁ connected to the stream processor PE ₁ , and so on. In this way, the tensor processing module 127 can transmit the multiplication and accumulation units MAC 0 ~ MAC N through the channels L ₀ ~ L ₁₅ passes the data to the multiplication and accumulation units MAC ₀ ~ MAC ₁₅ for operation. However, for ease of understanding, the following is presented in the form of a matrix.

在本發明的一實施例中，該脈動陣列的維數(Dimension，D)符合下式，

In one embodiment of the present invention, the dimension (Dimension, D) of the pulse array conforms to the following formula,

其中N為串流多處理器120裡所具有的乘積累加運算單元的數量(亦即，串流處理器125的數量)，N=2ⁿ，N為正整數且N

4，及n為正整數且n

2。 Where N is the number of MPUs in the streaming multiprocessor 120 (i.e., the number of streaming processors 125), N=2 ⁿ , N is a positive integer and N

4, and n is a positive integer and n

2.

具體來說，本發明實現的脈動陣列根據每個串流多處理器120裡的乘積累加運算單元的數量不同(亦即，不同的N(n))，從而改變可以實現的脈動陣列的維數。請參照第2圖，第2圖係根據本發明一較佳實施例繪示的n=5時所實現的脈動陣列的示意圖。當n=5(亦即，乘積累加運算單元的數量N有32個)時，n為奇數，根據上式，本發明的圖形處理器所實現的脈動陣列為4x4+4x4的矩陣，其中矩陣裡的箭頭方向代表著執行矩陣乘法運算時資料的傳遞方向。以下將介紹在不同N(n)的情況下，各個乘積累加運算單元MAC₀~MAC_N對應脈動陣列的位置。 Specifically, the pulse array implemented by the present invention changes the dimension of the pulse array that can be implemented according to the different number of product-accumulation operation units in each stream multiprocessor 120 (that is, different N(n)). Please refer to Figure 2, which is a schematic diagram of the pulse array implemented when n=5 according to a preferred embodiment of the present invention. When n=5 (that is, the number of product-accumulation operation units N is 32), n is an odd number. According to the above formula, the pulse array implemented by the graphics processor of the present invention is a 4x4+4x4 matrix, wherein the direction of the arrow in the matrix represents the direction of data transmission when performing matrix multiplication operation. The following will introduce the positions of the pulse arrays corresponding to each product-accumulation unit MAC ₀ ~MAC _N under different N(n) conditions.

在本實施例中，脈動陣列的維數最小為2x2，也就是n=2(亦即，乘積累加運算單元的數量N有4個)。在n=2(N=4)的情況下，每個串流多處理器120包含的乘積累加運算單元可例如為MAC₀、MAC₁、MAC₂、MAC₃共四個，分別屬於對應連接於依序的通道L₀~L₃的四個串流處理器PE₀~PE₃，且這四個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In the present embodiment, the minimum dimension of the pulse array is 2x2, that is, n=2 (that is, the number N of the multiplication and accumulation operation units is 4). In the case of n=2 (N=4), each stream multiprocessor 120 may include four multiplication and accumulation operation units, for example, MAC ₀ , MAC ₁ , MAC ₂ , and MAC ₃ , which respectively belong to four stream processors PE ₀ to PE ₃ connected to the sequential channels L ₀ to L ₃ , and these four multiplication and accumulation operation units are configured in the following manner to form the pulse array:

在n為偶數且n

4的情況下，每個串流多處理器120包含的乘積累加運算單元可例如為MAC₀、MAC₁、...、MAC_(N-1)共N個，分別屬於對應連接於依序的通道L₀~L_(N-1)的N個串流處理器PE₀~PE_(N-1)，N個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

When n is even and n

4, each stream multiprocessor 120 may include N multiplication and accumulation units, for example, MAC ₀ , MAC ₁ , ..., MAC _(N-1) , which belong to N stream processors PE ₀ ~PE ( _N-1) connected to the sequential channels L ₀ ~L _(N-1) , respectively. The N multiplication and accumulation units are configured in the following manner to form the pulse array:

在n=3(N=8)的情況下，每個串流多處理器120包含的乘積累加運算單元可例如為MAC₀、MAC₁、MAC₂、...、MAC₇共八個，分別屬於對應連接於依序的通道L₀~L₇的八個串流處理器PE₀~PE₇，且這八個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

In the case of n=3 (N=8), each stream multiprocessor 120 may include eight multiplication and accumulation units, for example, MAC ₀ , MAC ₁ , MAC ₂ , ..., MAC ₇ , which belong to eight stream processors PE ₀ -PE ₇ connected to the sequential channels L ₀ -L ₇ , respectively, and these eight multiplication and accumulation units are configured in the following manner to form the pulse array:

在n為奇數且n

5的情況下，每個串流多處理器120包含的乘積累加運算單元可例如為MAC₀、MAC₁、...、MAC_(N-1)共N個，分別屬於對應連接於依序的通道L₀~L_(N-1)的N個串流處理器PE₀~PE_(N-1)，N個乘積累加運算單元被配置為下列方式以形成該脈動陣列：

When n is odd and n

5, each stream multiprocessor 120 may include N multiplication and accumulation units, for example, MAC ₀ , MAC ₁ , ..., MAC _(N-1) , which belong to N stream processors PE ₀ ~PE ( _N-1) connected to the sequential channels L ₀ ~L _(N-1) , respectively. The N multiplication and accumulation units are configured in the following manner to form the pulse array:

根據上述各個實施方式，本發明提出的圖形處理器100考量在不同數量的乘積累加運算單元的情況下實現各種可執行的脈動陣列的方式，因此僅需要利用串流處理器125本身內建的乘積累加運算單元，無須設置額外的硬體(如Tensor core或是TPU)，利用張量處理模組127對資料的排程便可實現加速矩陣乘法運算的脈動陣列。 According to the above-mentioned implementation methods, the graphics processor 100 proposed by the present invention considers various executable pulse array methods under different numbers of multiplication and accumulation units, so it only needs to use the built-in multiplication and accumulation unit of the stream processor 125 itself, without setting up additional hardware (such as Tensor core or TPU), and the tensor processing module 127 can use the data scheduling to realize the pulse array of accelerated matrix multiplication operation.

請參照第3圖，第3圖係根據本發明一較佳實施例繪示的張量處理模組127傳遞資料的時序示意圖。在本例中，n=4。如第3圖所示，張量處理模組127包含第一暫緩單元1271及第二暫緩單元1273。張量處理模組127可將待進行乘法運算的矩陣(例如，第一矩陣和第二矩陣)的資料從局部記憶體121中存取出來並根據對應的行列資料分別儲存在第一暫緩單元1271(例如，第一矩陣的資料)及第二暫緩單元1273(例如，第二矩陣的資料)裡。第一暫緩單元1271經組態以將第一矩陣的資料(例如，Aout[0]、Aout[1]、Aout[2]、Aout[3])傳送給該脈動陣列中對應該行的乘積累加運算單元。第二暫緩單元1273經組態以將第二矩陣的資料(例如，Bbout[0]、Bout[1]、Bout[2]、Bout[3])傳送給該脈動陣列中對應該列的乘積累加運算單元。 Please refer to FIG. 3, which is a timing diagram of the tensor processing module 127 transmitting data according to a preferred embodiment of the present invention. In this example, n=4. As shown in FIG. 3, the tensor processing module 127 includes a first buffer unit 1271 and a second buffer unit 1273. The tensor processing module 127 can retrieve the data of the matrix to be multiplied (e.g., the first matrix and the second matrix) from the local memory 121 and store them in the first buffer unit 1271 (e.g., the data of the first matrix) and the second buffer unit 1273 (e.g., the data of the second matrix) according to the corresponding row and column data. The first buffer unit 1271 is configured to transmit the data of the first matrix (e.g., Aout[0], Aout[1], Aout[2], Aout[3]) to the multiplication and accumulation unit corresponding to the row in the pulse array. The second buffer unit 1273 is configured to transmit the data of the second matrix (e.g., Bbout[0], Bout[1], Bout[2], Bout[3]) to the multiplication and accumulation unit corresponding to the row in the pulse array.

一開始，張量處理模組127將待進行乘法運算的矩陣的資料從局部記憶體121中存取至第一暫緩單元1271及第二暫緩單元1273裡。然後，第一暫緩單元1271及第二暫緩單元1273分別將對應第一行(例如，Aout[0])及第一列(例如，Bout[0])的矩陣要運算的資料透過通道L₀傳遞給乘積累加運算單元MAC₁以進行運算。待運算完畢後，再將Aout[0]及Bout[1]透過通道L₄傳遞給乘積累加運算單元MAC₄以進行運算，以及將Bout[0]及Aout[1]透過通道L₁傳遞給乘積累加運算單元MAC₁以進行運算。待運算完畢後，接著，將Aout[0]及Bout[2]透過通道L₈傳遞給乘積累加運算單元MAC₈以進行運算，將Bout[0]及Aout[2]透過通道L₂傳遞給乘積累加運算單元MAC₂以進行運算，將Aout[1]及Bout[1]透過通道L₅傳遞給乘積累加運算單元MAC₅以進行運算，以此類推，直到完成待運算的矩陣的資料皆完成傳遞和計算，便可成完矩陣的乘法運算。 Initially, the tensor processing module 127 accesses the matrix data to be multiplied from the local memory 121 to the first buffer unit 1271 and the second buffer unit 1273. Then, the first buffer unit 1271 and the second buffer unit 1273 respectively transmit the matrix data to be operated corresponding to the first row (e.g., Aout[0]) and the first column (e.g., Bout[0]) through the channel _L0 to the multiplication and accumulation unit _MAC1 for operation. After the operation is completed, Aout[0] and Bout[1] are transmitted to the multiplication and accumulation unit MAC ₄ through channel L ₄ for operation, and Bout[0] and Aout[1] are transmitted to the multiplication and accumulation unit MAC ₁ through channel L ₁ for operation. After the operation is completed, Aout[0] and Bout[2] are then passed to the multiplication and accumulation unit MAC ₈ through channel L ₈ for operation, Bout[0] and Aout[2] are passed to the multiplication and accumulation unit MAC ₂ through channel L ₂ for operation, and Aout[1] and Bout[1] are passed to the multiplication and accumulation unit MAC ₅ through channel L ₅ for operation, and so on, until all the matrix data to be operated are completed. The matrix multiplication operation can be completed.

在一些實施例中，張量處理模組127可包含的暫緩單元的數量可不侷限於兩個，可視待進行乘法運算的矩陣的資料的大小而有多於兩個以上數量的暫緩單元。 In some embodiments, the number of buffer units that the tensor processing module 127 may include is not limited to two, and may include more than two buffer units depending on the size of the matrix data to be multiplied.

進一步來說，本實施例中的張量處理模組127可先向局部記憶體121存取待運算矩陣乘法的矩陣資料，並將矩陣資料分別先暫存在第一暫緩單元 1271及第二暫緩單元1273中，等到要進行矩陣乘法時，便可直接從第一暫緩單元1271及第二暫緩單元1273存取這些資料並將資料直接透過對應的通道提供給對應乘積累加運算單元，而無須透過讀取/儲存模組123從局部記憶體121存取資料再由調配模組129進行分配資料的操作。因此，可大幅減少資料從外部記憶體存取的時間，從而進一步減少執行矩陣的乘法運算的時間。換言之，在響應於執行矩陣乘法的運算時，串流多處理器120可利用張量處理模組127從局部記憶體121直接存取資料並儲存在第一暫緩單元1271及第二暫緩單元1273中，然後從第一暫緩單元1271及第二暫緩單元1273透過通道將資料傳送給該脈動陣列中的乘積累加運算單元以進行該運算；而當響應於執行非矩陣乘法的運算時，串流多處理器120則是利用讀取/儲存模組123從局部記憶體121存取資料並將資料傳送給調配模組129，再由調配模組129分配資料給N個串流處理器125以進行該運算。 Furthermore, the tensor processing module 127 in this embodiment can first access the matrix data to be operated on the matrix multiplication from the local memory 121, and temporarily store the matrix data in the first buffer unit 1271 and the second buffer unit 1273. When the matrix multiplication is to be performed, the data can be directly accessed from the first buffer unit 1271 and the second buffer unit 1273 and provided to the corresponding product-accumulation operation unit through the corresponding channel, without the need to access the data from the local memory 121 through the read/storage module 123 and then the allocation module 129 performs the operation of allocating the data. Therefore, the time for accessing data from the external memory can be greatly reduced, thereby further reducing the time for performing matrix multiplication operations. In other words, in response to performing matrix multiplication operations, the streaming multiprocessor 120 can use the tensor processing module 127 to directly access data from the local memory 121 and store it in the first buffer unit 1271 and the second buffer unit 1273, and then transmit the data from the first buffer unit 1271 and the second buffer unit 1273 to the pulse through the channel. The multiplication and accumulation unit in the array performs the operation; when responding to the execution of non-matrix multiplication operations, the streaming multiprocessor 120 uses the read/store module 123 to access data from the local memory 121 and transmit the data to the allocation module 129, and then the allocation module 129 distributes the data to N streaming processors 125 to perform the operation.

本發明提出的圖形處理器，透過將串流處理器本身內建的乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列，除了硬體架構設計上的一些改變，還考量運算平臺的整合以及神經網路在軟體層的優化。因此，在未增加太多的硬體的成本的情況下，本發明的圖形處理器可以有效地加速矩陣乘法的運算。 The graphics processor proposed in the present invention configures the built-in multiplication and accumulation unit of the stream processor itself as a pulse array for performing matrix multiplication operations. In addition to some changes in the hardware architecture design, the integration of the computing platform and the optimization of the neural network at the software layer are also considered. Therefore, without increasing the cost of hardware too much, the graphics processor of the present invention can effectively accelerate the operation of matrix multiplication.

雖然本發明已以較佳實施例揭露，然其並非用以限制本發明，任何熟習此項技藝之人士，在不脫離本發明之精神和範圍內，當可作各種更動與修飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed with preferred embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be subject to the scope of the patent application attached hereto.

100:圖形處理器 100: Graphics processor

110:互連網路模組 110: Internet module

120:串流多處理器 120: Streaming multiprocessor

121:局部記憶體 121: Local memory

123:讀取/儲存模組 123: Read/save module

125:串流處理器 125: Stream Processor

127:張量處理模組 127: Tensor processing module

129:調配模組 129: Allocation module

130:工作排程模組 130: Task Scheduling Module

140:記憶體 140:Memory

PE₀~PE_(N-1):串流處理器 PE ₀ ~PE _(N-1) : Stream Processor

L₀~L_(N-1):通道 L ₀ ~L _(N-1) : Channel

Claims

A graphics processor includes: a plurality of stream multiprocessors, each of the stream multiprocessors includes: N stream processors configured to operate with a single instruction corresponding to a plurality of execution threads, wherein each of the N stream processors includes a multiply-accumulate operation unit, wherein N is a positive integer and N

4; and a tensor processing module connected to the N product-accumulation operation units and configured to configure the N product-accumulation operation units into a pulse array for performing matrix multiplication operations, wherein the dimension (D) of the pulse array meets the following formula,

Where N=2 ⁿ , n is a positive integer and n

2.

The graphics processor as claimed in claim 1, wherein N=4, the four multiply-accumulate operation units are MAC ₀ , MAC ₁ , MAC ₂ , and MAC ₃ , which belong to four of the N stream processors in sequence, and the four multiply-accumulate operation units are configured in the following manner to form the pulse array:

The graphics processor of claim 1, wherein the N multiply-accumulate units are MAC ₀ , MAC ₁ , ..., MAC _(N-1) and belong to the N stream processors in sequence, wherein in response to n being an even number and n

The graphics processor as claimed in claim 1, wherein N=8, the eight multiply-accumulate operation units are MAC ₀ , MAC ₁ , MAC ₂ , . . . , MAC ₇ , which belong to eight of the N stream processors in sequence, and the eight multiply-accumulate operation units are configured in the following manner to form the pulse array:

The graphics processor of claim 1, wherein the N multiply-accumulate units are MAC ₀ , MAC ₁ , ..., MAC _(N-1) and belong to the N stream processors in sequence, wherein in response to n being an odd number and n

A graphics processor as described in claim 1, wherein the size of the pulse array is AxB, wherein the tensor processing module includes a first buffer unit and a second buffer unit, wherein the first buffer unit is configured to transmit data to a corresponding row of the B row of the pulse array, and the second buffer unit is configured to transmit data to a corresponding row of the A row of the pulse array.

A graphics processor as described in claim 1, wherein each of the streaming multiprocessors further comprises: a local memory connected to the tensor processing module and configured to store data; and a read/store module connected to the local memory and the N streaming processors and configured to read data from the local memory or store data in the local memory.

A graphics processor as described in claim 7, wherein in response to performing a first operation of matrix multiplication, the tensor processing module accesses data from the local memory and transmits the data to the N multiply-accumulate operation units in the pulse array to perform the first operation.

A graphics processor as described in claim 7, wherein in response to performing a second operation that is not a matrix multiplication, the read/store module accesses data from the local memory and transmits the data to the N stream processors to perform the second operation.