TWI845081B - Graphics processor - Google Patents
Graphics processor Download PDFInfo
- Publication number
- TWI845081B TWI845081B TW111149317A TW111149317A TWI845081B TW I845081 B TWI845081 B TW I845081B TW 111149317 A TW111149317 A TW 111149317A TW 111149317 A TW111149317 A TW 111149317A TW I845081 B TWI845081 B TW I845081B
- Authority
- TW
- Taiwan
- Prior art keywords
- mac
- data
- graphics processor
- units
- pulse array
- Prior art date
Links
- 238000009825 accumulation Methods 0.000 claims abstract description 51
- 239000011159 matrix material Substances 0.000 claims abstract description 42
- 230000004044 response Effects 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000000034 method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100059544 Arabidopsis thaliana CDC5 gene Proteins 0.000 description 1
- 101150115300 MAC1 gene Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Landscapes
- Image Processing (AREA)
- Multi Processors (AREA)
Abstract
Description
本發明係關於圖形處理器的技術領域,特別是關於一種實現脈動陣列的圖形處理器。 The present invention relates to the technical field of graphics processors, and in particular to a graphics processor that implements a pulse array.
神經網路的應用正在飛速地成長當中,加上資訊安全的注重以及即時應用的需求,人工智慧的應用從雲端普及到了終端,這也促使了許多為了加速神經網路運算所設計的硬體加速器。一般來說,矩陣乘法為大多數網路層中的基礎運算,因此,若能加速矩陣乘法的運算便能達到加速神經網路的運算。為此,各大廠商例如NVIDIA公司以及Google公司皆提出了用於加速神經網路運算的硬體加速器。 The application of neural networks is growing rapidly. With the focus on information security and the need for real-time applications, the application of artificial intelligence has spread from the cloud to the terminal, which has also prompted many hardware accelerators designed to accelerate neural network operations. Generally speaking, matrix multiplication is the basic operation in most network layers. Therefore, if the operation of matrix multiplication can be accelerated, the operation of neural networks can be accelerated. For this reason, major manufacturers such as NVIDIA and Google have proposed hardware accelerators for accelerating neural network operations.
NVIDIA公司從Volta架構的圖像處理單元(Graphics Processing Unit,GPU)上加入了Tensor Core,它是一個獨立的硬體單元,可以執行兩個4x4大小的矩陣相乘的運算,並再與一個4x4矩陣執行一個累加的運算,進而完成一f個乘積累加運算(Fused Multiply-Add,FMA)。Tensor Core跟GPU裡的Cuda Core是不同的計算單元,所以它們可以同時執行以提升指令的並行能力(Instruction Level Parallelism)。 NVIDIA has added Tensor Core to the Graphics Processing Unit (GPU) of the Volta architecture. It is an independent hardware unit that can perform a multiplication operation of two 4x4 matrices and then perform an accumulation operation with a 4x4 matrix to complete a Fused Multiply-Add (FMA). Tensor Core and Cuda Core in the GPU are different computing units, so they can be executed simultaneously to improve instruction level parallelism.
Google公司所研發的TPU(Tensor Processing Unit)是一f個插在PCIe匯流排上的一個晶片,其用來在雲端伺服器中來應對人工智慧應用所需的大量運算,而非像NVIDIA的Tensor core是設計在GPU內部的硬體。Google TPU至今已推出許多版本,但運算單元皆採用脈動陣列(Systolic Array)的硬體架構。TPU無法單獨執行,需要藉由CPU透過PCIe匯流排傳送TPU CISC-like指令來進行控制。 The TPU (Tensor Processing Unit) developed by Google is a chip plugged into the PCIe bus. It is used in cloud servers to handle the large amount of computing required for artificial intelligence applications, rather than hardware designed inside the GPU like NVIDIA's Tensor core. Google TPU has launched many versions so far, but the computing units all use the systolic array hardware architecture. TPU cannot run alone and needs to be controlled by the CPU through the PCIe bus to send TPU CISC-like instructions.
無論是NVIDIA Tensor Core或是Google TPU,皆需要巨大功耗來滿足運算效能的需求,且非為終端的應用設計所考量。然而,在面積和功耗有限的終端裝置上,新增一顆矩陣運算單元帶來的成本非常之高。因此,如何在不增加太多硬體成本的條件下提升矩陣運算能的效能,是目前所遇到的技術瓶頸。 Whether it is NVIDIA Tensor Core or Google TPU, they both require huge power consumption to meet the demand for computing performance, and are not considered for terminal application design. However, on terminal devices with limited area and power consumption, the cost of adding a matrix computing unit is very high. Therefore, how to improve the performance of matrix computing without increasing hardware costs too much is the current technical bottleneck.
本發明之一目的在於提供一種利用圖形處理器實現加速矩陣乘法運算的硬體架構。 One of the purposes of the present invention is to provide a hardware architecture for accelerating matrix multiplication operations using a graphics processor.
為達上述之目的,本發明提供一種圖形處理器,其包含多個串流多處理器。該等串流多處理器的每一者包含N個串流處理器及張量運算模組。N個串流處理器經組態以單一指令對應多個執行緒(Single Instruction Multiple Threads,SIMT)的架構進行運算,其中該等N個串流處理器的每一者包含乘積累加運算單元,其中N為正整數且N4。張量處理模組連接該等N個乘積累加運算單元,且經組態以將該等N個乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列。
To achieve the above-mentioned object, the present invention provides a graphics processor, which includes a plurality of stream multiprocessors. Each of the stream multiprocessors includes N stream processors and a tensor operation module. The N stream processors are configured to operate in a single instruction to multiple threads (Single Instruction Multiple Threads, SIMT) architecture, wherein each of the N stream processors includes a multiply-accumulate operation unit, wherein N is a positive integer and
在本發明的一實施例中,該脈動陣列的維數(D)符合下式,
其中N=2n,n為正整數且n2。
Where N=2 n , n is a positive integer and
在本發明的一實施例中,N=4,該等四個乘積累加運算單元為MAC0、MAC1、MAC2、MAC3分別屬於該等N個串流處理器中依序的四者,且該等四個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在本發明的一實施例中,該等N個乘積累加運算單元為MAC0、MAC1、...、MAC(N-1)分別屬於依序的該等N個串流處理器,其中響應於n為偶數且n4時,該等N個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在本發明的一實施例中,N=8,該等八個乘積累加運算單元為MAC0、MAC1、MAC2、...、MAC7分別屬於該等N個串流處理器中依序的八者,且該等八個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在本發明的一實施例中,該等N個乘積累加運算單元為MAC0、MAC1、...、MAC(N-1)分別屬於依序的該等N個串流處理器,其中響應於n為奇數且n5時,該等N個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在本發明的一實施例中,該張量處理模組包含一第一暫緩單元及一第二暫緩單元,其中該第一暫緩單元經組態以將資料傳送給該脈動陣列中對應該行的乘積累加運算單元,該第二暫緩單元經組態以將資料傳送給該脈動陣列中對應該列的乘積累加運算單元。 In one embodiment of the present invention, the tensor processing module includes a first buffer unit and a second buffer unit, wherein the first buffer unit is configured to transmit data to the multiplication and accumulation unit corresponding to the row in the pulse array, and the second buffer unit is configured to transmit data to the multiplication and accumulation unit corresponding to the column in the pulse array.
在本發明的一實施例中,該等串流多處理器的每一者還包含局部記憶體及讀取/儲存模組。局部記憶體連接該張量處理模組,且經組態以儲存資料。讀取/儲存模組連接該局部記憶體與該等N個串流處理器,且經組態以從該局部記憶體讀取資料或是將資料儲存在該局部記憶體。 In one embodiment of the present invention, each of the streaming multiprocessors further includes a local memory and a read/store module. The local memory is connected to the tensor processing module and is configured to store data. The read/store module is connected to the local memory and the N streaming processors and is configured to read data from the local memory or store data in the local memory.
在本發明的一實施例中,響應於執行矩陣乘法的一第一運算,該張量處理模組從該局部記憶體存取資料並將資料傳送給該脈動陣列中的乘積累加運算單元以進行該第一運算。 In one embodiment of the present invention, in response to performing a first operation of matrix multiplication, the tensor processing module accesses data from the local memory and transmits the data to the multiplication and accumulation unit in the pulse array to perform the first operation.
在本發明的一實施例中,響應於執行非矩陣乘法的一第二運算,該讀取/儲存模組從局部記憶體存取資料並將資料傳送給該等N個串流處理器以進行該第二運算。 In one embodiment of the present invention, in response to performing a second operation that is not a matrix multiplication, the read/store module accesses data from the local memory and transmits the data to the N stream processors to perform the second operation.
本發明提出的圖形處理器,透過將串流處理器內的乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列,除了硬體架構設計上的一些改變,還考量運算平臺的整合以及神經網路在軟體層的優化。因此,在未增加太多的硬體的成本的情況下,本發明的圖形處理器可以有效地加速矩陣乘法的運算。 The graphics processor proposed in the present invention configures the multiplication and accumulation unit in the stream processor as a pulse array for performing matrix multiplication operations. In addition to some changes in the hardware architecture design, the integration of the computing platform and the optimization of the neural network at the software layer are also considered. Therefore, without increasing the cost of hardware too much, the graphics processor of the present invention can effectively accelerate the operation of matrix multiplication.
100:圖形處理器 100: Graphics processor
110:互連網路模組 110: Internet module
120:串流多處理器 120: Streaming multiprocessor
121:局部記憶體 121: Local memory
123:讀取/儲存模組 123: Read/save module
125:串流處理器 125: Stream Processor
127:張量處理模組 127: Tensor processing module
1271:第一暫緩單元 1271: First temporary unit
1273:第二暫緩單元 1273: Second temporary unit
129:調配模組 129: Allocation module
130:工作排程模組 130: Task Scheduling Module
140:記憶體 140:Memory
PE0~PE(N-1):串流處理器 PE 0 ~PE (N-1) : Stream Processor
MAC0~MAC(N-1):乘積累加運算單元 MAC 0 ~MAC (N-1) : Multiplication and Accumulation Unit
L0~L(N-1):通道 L 0 ~L (N-1) : Channel
Aout[0]、Aout[1]、Aout[2]、Aout[3]、Bout[0]、Bout[1]、Bout[2]、Bout[3]:資料 Aout[0], Aout[1], Aout[2], Aout[3], Bout[0], Bout[1], Bout[2], Bout[3]: data
第1A圖係根據本發明一較佳實施例繪示的圖形處理器的方塊系統示意圖。 Figure 1A is a block system schematic diagram of a graphics processor according to a preferred embodiment of the present invention.
第1B圖係根據本發明一較佳實施例繪示的圖形處理器實現脈動陣列的示意圖。 Figure 1B is a schematic diagram of a pulse array implemented by a graphics processor according to a preferred embodiment of the present invention.
第2圖係根據本發明一較佳實施例繪示的n=5時所實現的脈動陣列的示意圖。 Figure 2 is a schematic diagram of a pulse array realized when n=5 according to a preferred embodiment of the present invention.
第3圖係根據本發明一較佳實施例繪示的張量處理模組傳遞資料的時序示意圖。 Figure 3 is a timing diagram of the tensor processing module transmitting data according to a preferred embodiment of the present invention.
為了讓本發明之上述及其他目的、特徵、優點能更明顯易懂,下文將特舉本發明較佳實施例,並配合所附圖式,作詳細說明如下。 In order to make the above and other purposes, features and advantages of the present invention more clearly understood, the following will specifically cite the preferred embodiments of the present invention and provide a detailed description with reference to the attached drawings.
請參照第1A圖,第1A圖係根據本發明一較佳實施例繪示的圖形處理器100的方塊系統示意圖。圖形處理器100是單指令對應多執行緒(Single Instruction Multiple Threads,SIMT)的架構,其包含互連網路模組110、多個串流多處理器(Streaming Multiprocessor,SM)120、工作排程模組130、以及記憶
體140。互連網路模組110電性連接於各個串流多處理器120、工作排程模組130、以及記憶體140,且經組態以在這些元件之間進行資料的傳輸。串流多處理器120經組態以進行運算與執行指令。工作排程模組130經組態以跟外部的中央處理器(圖未繪示)進行通訊,並接收來自中央處理器指派的工作以及將工作排程給串流多處理器120執行。
Please refer to FIG. 1A, which is a block diagram of a
每個串流多處理器120包含局部記憶體121、讀取/儲存模組123、N個串流處理器(Streaming Processor,PE)125、張量(Tensor)處理模組127、調配模組129。局部記憶體121連接於記憶體140,用以儲存待運算的資料以及已經運算好的資料。讀取/儲存模組123連接局部記憶體121與調配模組129,經組態以從局部記憶體121中存取待運算的資料並將這些資料傳送到調配模組129,調配模組129再根據資料的型態分送給對應的串流處理器125以進行運算。N個串流處理器125分別包含各自的乘積累加運算單元(Multiply Accumulate Unit,MAC)MAC0~MACN,其用於進行乘積累加的運算。具體來說,串流多處理器120的N個串流處理器125是以SIMT的架構進行運算,亦即,對於同一個指令,N個串流處理器125以執行緒的單位同時進行運算,因此可達到平行處理的效果。
Each
在本發明的圖形處理器100中,每個串流多處理器120進一步包含了張量處理模組127,張量處理模組連接N個串流處理器125裡的N個乘積累加運算單元MAC0~MACN,張量處理模組127經組態以將N個乘積累加運算單元MAC0~MACN配置為執行矩陣乘法運算的脈動陣列(Systolic Array),如第1B圖所示,第1B圖係根據本發明一較佳實施例繪示的圖形處理器100實現脈動陣列的示意圖。在第1B圖的示例中,N=16。脈動陣列的概念就是讓資料在運算單元的陣列中進行流動,減少存取的次數,並且使得結構更加完整,佈線更加統一,提
高流動的頻率。脈動陣列本身只是一個有資料流動的結構,根據不同的應用會有不同的資料進行流動以及不同的流動方向。
In the
雖然第1B圖繪示了一個陣列(MAC0~MAC15),但實際上,張量處理模組127並非真的將N個乘積累加運算單元MAC0~MACN排列成實際脈動陣列的模樣,而是透過不同時序的資料的傳遞,讓這些乘積累加運算單元MAC0~MACN實現脈動陣列的運算,例如,L0代表著連接於串流處理器PE0裡的乘積累加運算單元MAC0的通道,L1代表著連接於串流處理器PE1裡的乘積累加運算單元MAC1的通道,以此類推,藉此量處理模組127可通過通道L0~L15將資料傳遞給乘積累加運算單元MAC0~MAC15進行運算。然而,為了方便理解,以下皆以矩陣的方式呈現。
Although FIG. 1B shows an array (MAC 0 ~ MAC 15 ), in reality, the
在本發明的一實施例中,該脈動陣列的維數(Dimension,D)符合下式,
其中N為串流多處理器120裡所具有的乘積累加運算單元的數量(亦即,串流處理器125的數量),N=2n,N為正整數且N4,及n為正整數且n2。
Where N is the number of MPUs in the streaming multiprocessor 120 (i.e., the number of streaming processors 125), N=2 n , N is a positive integer and
具體來說,本發明實現的脈動陣列根據每個串流多處理器120裡的乘積累加運算單元的數量不同(亦即,不同的N(n)),從而改變可以實現的脈動陣列的維數。請參照第2圖,第2圖係根據本發明一較佳實施例繪示的n=5時所實現的脈動陣列的示意圖。當n=5(亦即,乘積累加運算單元的數量N有32個)時,n為奇數,根據上式,本發明的圖形處理器所實現的脈動陣列為4x4+4x4的矩陣, 其中矩陣裡的箭頭方向代表著執行矩陣乘法運算時資料的傳遞方向。以下將介紹在不同N(n)的情況下,各個乘積累加運算單元MAC0~MACN對應脈動陣列的位置。 Specifically, the pulse array implemented by the present invention changes the dimension of the pulse array that can be implemented according to the different number of product-accumulation operation units in each stream multiprocessor 120 (that is, different N(n)). Please refer to Figure 2, which is a schematic diagram of the pulse array implemented when n=5 according to a preferred embodiment of the present invention. When n=5 (that is, the number of product-accumulation operation units N is 32), n is an odd number. According to the above formula, the pulse array implemented by the graphics processor of the present invention is a 4x4+4x4 matrix, wherein the direction of the arrow in the matrix represents the direction of data transmission when performing matrix multiplication operation. The following will introduce the positions of the pulse arrays corresponding to each product-accumulation unit MAC 0 ~MAC N under different N(n) conditions.
在本實施例中,脈動陣列的維數最小為2x2,也就是n=2(亦即,乘積累加運算單元的數量N有4個)。在n=2(N=4)的情況下,每個串流多處理器120包含的乘積累加運算單元可例如為MAC0、MAC1、MAC2、MAC3共四個,分別屬於對應連接於依序的通道L0~L3的四個串流處理器PE0~PE3,且這四個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在n為偶數且n4的情況下,每個串流多處理器120包含的乘積累加運算單元可例如為MAC0、MAC1、...、MAC(N-1)共N個,分別屬於對應連接於依序的通道L0~L(N-1)的N個串流處理器PE0~PE(N-1),N個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在n=3(N=8)的情況下,每個串流多處理器120包含的乘積累加運算單元可例如為MAC0、MAC1、MAC2、...、MAC7共八個,分別屬於對應連
接於依序的通道L0~L7的八個串流處理器PE0~PE7,且這八個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
在n為奇數且n5的情況下,每個串流多處理器120包含的乘積累加運算單元可例如為MAC0、MAC1、...、MAC(N-1)共N個,分別屬於對應連接於依序的通道L0~L(N-1)的N個串流處理器PE0~PE(N-1),N個乘積累加運算單元被配置為下列方式以形成該脈動陣列:
根據上述各個實施方式,本發明提出的圖形處理器100考量在不同數量的乘積累加運算單元的情況下實現各種可執行的脈動陣列的方式,因此僅需要利用串流處理器125本身內建的乘積累加運算單元,無須設置額外的硬體(如Tensor core或是TPU),利用張量處理模組127對資料的排程便可實現加速矩陣乘法運算的脈動陣列。
According to the above-mentioned implementation methods, the
請參照第3圖,第3圖係根據本發明一較佳實施例繪示的張量處理模組127傳遞資料的時序示意圖。在本例中,n=4。如第3圖所示,張量處理模組127包含第一暫緩單元1271及第二暫緩單元1273。張量處理模組127可將待進行乘法運算的矩陣(例如,第一矩陣和第二矩陣)的資料從局部記憶體121中存取出
來並根據對應的行列資料分別儲存在第一暫緩單元1271(例如,第一矩陣的資料)及第二暫緩單元1273(例如,第二矩陣的資料)裡。第一暫緩單元1271經組態以將第一矩陣的資料(例如,Aout[0]、Aout[1]、Aout[2]、Aout[3])傳送給該脈動陣列中對應該行的乘積累加運算單元。第二暫緩單元1273經組態以將第二矩陣的資料(例如,Bbout[0]、Bout[1]、Bout[2]、Bout[3])傳送給該脈動陣列中對應該列的乘積累加運算單元。
Please refer to FIG. 3, which is a timing diagram of the
一開始,張量處理模組127將待進行乘法運算的矩陣的資料從局部記憶體121中存取至第一暫緩單元1271及第二暫緩單元1273裡。然後,第一暫緩單元1271及第二暫緩單元1273分別將對應第一行(例如,Aout[0])及第一列(例如,Bout[0])的矩陣要運算的資料透過通道L0傳遞給乘積累加運算單元MAC1以進行運算。待運算完畢後,再將Aout[0]及Bout[1]透過通道L4傳遞給乘積累加運算單元MAC4以進行運算,以及將Bout[0]及Aout[1]透過通道L1傳遞給乘積累加運算單元MAC1以進行運算。待運算完畢後,接著,將Aout[0]及Bout[2]透過通道L8傳遞給乘積累加運算單元MAC8以進行運算,將Bout[0]及Aout[2]透過通道L2傳遞給乘積累加運算單元MAC2以進行運算,將Aout[1]及Bout[1]透過通道L5傳遞給乘積累加運算單元MAC5以進行運算,以此類推,直到完成待運算的矩陣的資料皆完成傳遞和計算,便可成完矩陣的乘法運算。
Initially, the
在一些實施例中,張量處理模組127可包含的暫緩單元的數量可不侷限於兩個,可視待進行乘法運算的矩陣的資料的大小而有多於兩個以上數量的暫緩單元。
In some embodiments, the number of buffer units that the
進一步來說,本實施例中的張量處理模組127可先向局部記憶體121存取待運算矩陣乘法的矩陣資料,並將矩陣資料分別先暫存在第一暫緩單元
1271及第二暫緩單元1273中,等到要進行矩陣乘法時,便可直接從第一暫緩單元1271及第二暫緩單元1273存取這些資料並將資料直接透過對應的通道提供給對應乘積累加運算單元,而無須透過讀取/儲存模組123從局部記憶體121存取資料再由調配模組129進行分配資料的操作。因此,可大幅減少資料從外部記憶體存取的時間,從而進一步減少執行矩陣的乘法運算的時間。換言之,在響應於執行矩陣乘法的運算時,串流多處理器120可利用張量處理模組127從局部記憶體121直接存取資料並儲存在第一暫緩單元1271及第二暫緩單元1273中,然後從第一暫緩單元1271及第二暫緩單元1273透過通道將資料傳送給該脈動陣列中的乘積累加運算單元以進行該運算;而當響應於執行非矩陣乘法的運算時,串流多處理器120則是利用讀取/儲存模組123從局部記憶體121存取資料並將資料傳送給調配模組129,再由調配模組129分配資料給N個串流處理器125以進行該運算。
Furthermore, the
本發明提出的圖形處理器,透過將串流處理器本身內建的乘積累加運算單元配置為執行矩陣乘法運算的脈動陣列,除了硬體架構設計上的一些改變,還考量運算平臺的整合以及神經網路在軟體層的優化。因此,在未增加太多的硬體的成本的情況下,本發明的圖形處理器可以有效地加速矩陣乘法的運算。 The graphics processor proposed in the present invention configures the built-in multiplication and accumulation unit of the stream processor itself as a pulse array for performing matrix multiplication operations. In addition to some changes in the hardware architecture design, the integration of the computing platform and the optimization of the neural network at the software layer are also considered. Therefore, without increasing the cost of hardware too much, the graphics processor of the present invention can effectively accelerate the operation of matrix multiplication.
雖然本發明已以較佳實施例揭露,然其並非用以限制本發明,任何熟習此項技藝之人士,在不脫離本發明之精神和範圍內,當可作各種更動與修飾,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed with preferred embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be subject to the scope of the patent application attached hereto.
100:圖形處理器 100: Graphics processor
110:互連網路模組 110: Internet module
120:串流多處理器 120: Streaming multiprocessor
121:局部記憶體 121: Local memory
123:讀取/儲存模組 123: Read/save module
125:串流處理器 125: Stream Processor
127:張量處理模組 127: Tensor processing module
129:調配模組 129: Allocation module
130:工作排程模組 130: Task Scheduling Module
140:記憶體 140:Memory
PE0~PE(N-1):串流處理器 PE 0 ~PE (N-1) : Stream Processor
MAC0~MAC(N-1):乘積累加運算單元 MAC 0 ~MAC (N-1) : Multiplication and Accumulation Unit
L0~L(N-1):通道 L 0 ~L (N-1) : Channel
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW111149317A TWI845081B (en) | 2022-12-21 | 2022-12-21 | Graphics processor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW111149317A TWI845081B (en) | 2022-12-21 | 2022-12-21 | Graphics processor |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI845081B true TWI845081B (en) | 2024-06-11 |
| TW202427231A TW202427231A (en) | 2024-07-01 |
Family
ID=92541622
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW111149317A TWI845081B (en) | 2022-12-21 | 2022-12-21 | Graphics processor |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI845081B (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW202143032A (en) * | 2020-03-30 | 2021-11-16 | 美商高通公司 | Processing data stream modification to reduce power effects during parallel processing |
| TW202143031A (en) * | 2020-05-05 | 2021-11-16 | 美商英特爾股份有限公司 | Scalable sparse matrix multiply acceleration using systolic arrays with feedback inputs |
| CN114787822A (en) * | 2019-10-28 | 2022-07-22 | 美光科技公司 | Distributed neural network processing on a smart image sensor stack |
| TW202238509A (en) * | 2017-04-28 | 2022-10-01 | 美商英特爾股份有限公司 | General-purpose graphics processing unit and data processing system for compute optimizations for low precision machine learning operations |
| US20220366527A1 (en) * | 2017-04-24 | 2022-11-17 | Intel Corporation | Coordination and increased utilization of graphics processors during inference |
-
2022
- 2022-12-21 TW TW111149317A patent/TWI845081B/en active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220366527A1 (en) * | 2017-04-24 | 2022-11-17 | Intel Corporation | Coordination and increased utilization of graphics processors during inference |
| TW202238509A (en) * | 2017-04-28 | 2022-10-01 | 美商英特爾股份有限公司 | General-purpose graphics processing unit and data processing system for compute optimizations for low precision machine learning operations |
| CN114787822A (en) * | 2019-10-28 | 2022-07-22 | 美光科技公司 | Distributed neural network processing on a smart image sensor stack |
| TW202143032A (en) * | 2020-03-30 | 2021-11-16 | 美商高通公司 | Processing data stream modification to reduce power effects during parallel processing |
| TW202143031A (en) * | 2020-05-05 | 2021-11-16 | 美商英特爾股份有限公司 | Scalable sparse matrix multiply acceleration using systolic arrays with feedback inputs |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202427231A (en) | 2024-07-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110516801B (en) | High-throughput-rate dynamic reconfigurable convolutional neural network accelerator | |
| CN107729989B (en) | A device and method for performing forward operation of artificial neural network | |
| CN107679620B (en) | Artificial Neural Network Processing Device | |
| CN107704922B (en) | Artificial Neural Network Processing Device | |
| US8281053B2 (en) | Performing an all-to-all data exchange on a plurality of data buffers by performing swap operations | |
| CN107679621A (en) | Artificial Neural Network Processing Device | |
| EP3869352A1 (en) | Network-on-chip data processing method and device | |
| CN111859273A (en) | matrix multiplier | |
| US9612750B2 (en) | Autonomous memory subsystem architecture | |
| US20090006663A1 (en) | Direct Memory Access ('DMA') Engine Assisted Local Reduction | |
| CN111630505A (en) | Deep learning accelerator system and method thereof | |
| CN115880132B (en) | Graphics processor, matrix multiplication task processing method, device and storage medium | |
| CN110059797B (en) | Computing device and related product | |
| US20250278318A1 (en) | Data processing method and apparatus, electronic device, and computer-readable storage medium | |
| TWI845081B (en) | Graphics processor | |
| CN110059809B (en) | Computing device and related product | |
| Chu et al. | High-performance adaptive MPI derived datatype communication for modern Multi-GPU systems | |
| CN113434813B (en) | Matrix multiplication operation method based on neural network and related device | |
| US20210334264A1 (en) | System, method, and program for increasing efficiency of database queries | |
| CN114692853B (en) | Computing unit architecture, computing unit cluster, and convolution operation execution method | |
| CN117332809A (en) | Neural network inference chips, methods and terminal equipment | |
| KR20190029124A (en) | Optimal gpu coding method | |
| RU2830044C1 (en) | Vector computing device | |
| KR102775919B1 (en) | System and method for cooperative working with cpu-gpu server | |
| CN113867798B (en) | Integrated computing devices, integrated circuit chips, circuit boards, and computing methods |