[go: up one dir, main page]

CN111199275A - System-on-Chip for Neural Networks - Google Patents

System-on-Chip for Neural Networks Download PDF

Info

Publication number
CN111199275A
CN111199275A CN201811383562.3A CN201811383562A CN111199275A CN 111199275 A CN111199275 A CN 111199275A CN 201811383562 A CN201811383562 A CN 201811383562A CN 111199275 A CN111199275 A CN 111199275A
Authority
CN
China
Prior art keywords
data
chip
matrix
cluster
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811383562.3A
Other languages
Chinese (zh)
Other versions
CN111199275B (en
Inventor
王平
孙洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Denglin Technology Co ltd
Original Assignee
Shanghai Denglin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co Ltd filed Critical Shanghai Denglin Technology Co Ltd
Priority to CN201811383562.3A priority Critical patent/CN111199275B/en
Publication of CN111199275A publication Critical patent/CN111199275A/en
Application granted granted Critical
Publication of CN111199275B publication Critical patent/CN111199275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Multi Processors (AREA)

Abstract

本发明提供一种用于神经网络的片上系统。该片上系统包括多个计算集群、前向数据转发通路、后向数据共享通路和任务分配单元,其中,所述多个计算集群用于实现神经网络中输入神经元矩阵和权重矩阵的乘法操作,其中,每个计算集群包括本地的片上内存以及对应的片外内存;所述前向数据转发通路用于在所述多个计算集群之间转发输入神经元数据;所述后向数据共享通路用于在所述多个计算集群之间传递权重数据或计算结果;所述任务分配单元用于根据待计算的输入神经元矩阵规模确定每个计算集群的任务分配策略,从而为每个计算集群分配待执行矩阵乘法操作的输入神经元数据。本发明的片上系统能够提高资源利用率和运算效率。

Figure 201811383562

The present invention provides a system-on-chip for neural networks. The system-on-chip includes a plurality of computing clusters, a forward data forwarding path, a backward data sharing path and a task allocation unit, wherein the multiple computing clusters are used to realize the multiplication operation of the input neuron matrix and the weight matrix in the neural network, Wherein, each computing cluster includes local on-chip memory and corresponding off-chip memory; the forward data forwarding path is used to forward input neuron data among the multiple computing clusters; the backward data sharing path is used for For transferring weight data or calculation results between the multiple computing clusters; the task allocation unit is used to determine the task allocation strategy of each computing cluster according to the input neuron matrix scale to be calculated, so as to assign each computing cluster Input neuron data to perform the matrix multiplication operation. The system-on-chip of the present invention can improve resource utilization and operation efficiency.

Figure 201811383562

Description

System on chip for neural networks
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a system on chip for a neural network.
Background
Artificial intelligence technology has gained rapid development in recent years and has gained wide attention worldwide, and the research work of artificial intelligence technology has been carried out in both industry and academia, and at present, artificial intelligence technology has penetrated into various fields such as visual perception, speech recognition, auxiliary driving, intelligent home, traffic scheduling, etc.
The deep neural network is one of perception models with higher development level in the field of artificial intelligence, simulates a neural connection structure of a human brain by establishing a model, describes data characteristics by layering a plurality of transformation stages, and brings breakthrough progress for large-scale data processing tasks such as images, videos and audios. The deep neural network model is an operational model, which is composed of a large number of nodes, called neurons (or input data), through a mesh interconnection structure. The strength of the connection between each two nodes represents the coefficient, i.e. the weight, between the two nodes through the connection signal, corresponding to the memory in the human neural network.
The matrix multiplication aiming at input neuron data and a weight matrix is typical operation in neural network reasoning and training application, an operation circuit and a related data input and output path of the matrix multiplication unit are the key points of performance and power consumption optimization, and the basic characteristic is that the larger the scale of a matrix multiplication unit is, the better the power consumption effect is, but the defect is that in the shallow layer of application (namely the condition that the scale of the input neuron data is larger and the scale of the weight is smaller), the performance wastes calculation resources due to insufficient output component number. In contrast, the smaller the matrix-by-unit size, the higher the utilization of computational resources, but the repetitive scheduling, the frequent access of data (e.g., to off-chip DRAM access, on-chip general register files, shared memory) makes it less power efficient.
The existing system on chip for neural network processing mainly has three types of architectures, the first type is a single processor network which is stored on chip in a centralized way, the architecture of the single processor network is simple, but the shallow layer processing efficiency is low, and the energy consumption of data transportation is high; the second type is a symmetric multiprocessor network, in such architectures, there is no data communication between the multiple processors, the storage of each processor unit is limited, and the memory swapping in and out is frequent, resulting in higher energy consumption; the third type is an architecture form of network-on-chip cascade, in which the processing efficiency of the processor is limited to a layer with poor performance due to performance imbalance among different layers in the neural network, thereby wasting on-chip processing resources.
Therefore, in order to push the neural network to a wider application, for example, the fields of intelligent wearing, intelligent robot, automatic driving, pattern recognition, etc., the improvement of the prior art is needed to improve the efficiency of data processing of the neural network, reduce the operation power consumption, and improve the utilization rate of computing resources.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and providing a system on chip for a neural network, which improves the computational energy efficiency ratio by improving the processor architecture and the resource scheduling manner of the system on chip.
According to a first aspect of the invention, a system on chip for a neural network is provided. The system on chip comprises a plurality of computing clusters, a forward data forwarding path, a backward data sharing path and a task allocation unit, wherein:
the plurality of computing clusters are used for realizing multiplication operation of an input neuron matrix and a weight matrix in a neural network, wherein each computing cluster comprises a local on-chip memory and a corresponding off-chip memory;
the forward data forwarding path is to forward input neuron data between the plurality of compute clusters;
the backward data sharing path is used for transmitting weight data or calculation results among the plurality of calculation clusters;
the task allocation unit is used for determining a task allocation strategy of each computing cluster according to the input neuron matrix scale to be computed, so as to allocate input neuron data to be subjected to matrix multiplication operation for each computing cluster.
In one embodiment, the computing cluster includes a data flow control module, a data buffer module, a multiply-accumulate module, a data transfer module, and an on-chip memory, wherein:
the data caching module is used for storing neuron data, weight data or calculation result data;
the multiplication and accumulation module is used for realizing multiplication operation of the input neuron matrix and the corresponding weight matrix;
the data flow control module is used for controlling the loading of data to the data cache module, the multiply-accumulate module, the data transmission module and the on-chip memory;
the data transfer module is used for forwarding the neuron data to other computing clusters.
In one embodiment, the backward data sharing path is formed by a plurality of repeaters connected in sequence, wherein each repeater corresponds to a computation cluster and is used for transmitting the weight data or computation results received from other repeaters to the corresponding computation cluster.
In one embodiment, the task allocation unit is further configured to determine a storage policy of the weight matrix in the local on-chip memories of the plurality of computing clusters and the corresponding off-chip memories according to at least one of a size of the weight matrix or a computing capability of the plurality of computing clusters.
In one embodiment, in the case that the input neuron matrix is B × N × K, the weight matrix is K × M, there are B computation clusters, each computation cluster has a computation capability of K × M, and N, M, K, M, B are any positive integer:
the task allocation strategy is to allocate each computing cluster in parallel
Figure BDA0001872448310000031
A matrix of input neuron data.
In an embodiment, when M is less than or equal to M, the weight matrix is stored in one off-chip memory corresponding to each computation cluster, or the weight matrix is stored in one off-chip memory corresponding to one computation cluster, or the weight matrix is divided into a plurality of sub-matrices, and the sub-matrices are stored in the off-chip memory corresponding to each computation cluster.
In one embodiment, when performing a matrix multiplication operation while storing the weight matrix in an off-chip memory corresponding to one compute cluster, the compute cluster loads the weight matrix from its corresponding off-chip memory to a local on-chip memory, and transfers the weight matrix to the remaining compute clusters via the backward data sharing path.
In an embodiment, when the weight matrix is divided into a plurality of sub-matrices on average and stored in the off-chip memory corresponding to each compute cluster, each compute cluster loads the weight matrix from its corresponding off-chip memory to the local on-chip memory and transfers the weight matrix to the other compute clusters via the backward data sharing path when performing a matrix multiplication operation.
In one embodiment, where the input neuron matrix is B × N × K, the weight matrix is K × M, there are B computing clusters, each computing cluster has a computing power of K × M, N, M, K, K, M, B are any positive integer, and M ≧ B × M:
the task allocation strategy is to allocate each computing cluster in parallel
Figure BDA0001872448310000032
A matrix of input neurons;
the storage strategy of the weight matrix is to divide the weight matrix into a plurality of sub-matrices according to the computing capabilities of the computing clusters and distribute the sub-matrices in off-chip memories corresponding to the computing clusters.
In one embodiment, the forward data forwarding path sequentially connects the plurality of computational clusters in series in a first direction to form a loop that passes input neuron data, and the backward data sharing path sequentially connects the plurality of computational clusters in series in a second direction to form a loop that passes weight data or computation results.
According to a second aspect of the invention, an electronic device is provided. The electronic device comprises the system on chip of the invention.
Compared with the prior art, the invention has the advantages that: aiming at the operation characteristics of different layers in neural network reasoning application, a unified multi-computing cluster coordinated system-on-chip architecture is provided, and the problem of low computation efficiency of a single operation unit in a shallow layer of reasoning application is solved; data sharing among multiple computing clusters is realized by designing a special data forwarding path and an on-chip network; and scheduling the needed heavier bandwidth load in the computing cluster according to the input neuron matrix scale or the weight matrix scale, and transmitting the lighter bandwidth load through a data forwarding channel, thereby realizing the energy consumption optimization of local access.
Drawings
The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:
FIG. 1 is an architectural diagram of a system on a chip for a neural network, according to one embodiment of the invention;
FIG. 2 is a schematic diagram of a structure of a compute cluster of a system on a chip, according to one embodiment of the invention;
FIG. 3 is a block diagram of a system on a chip according to another embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In the description herein, input neuron data is node data in a neural network model, weights refer to coefficients connecting two nodes, which can be obtained by training, and data generally refers to each type of data such as input neuron data, weight data, and calculation results unless otherwise indicated by context.
According to an embodiment of the present invention, a system on chip for neural network processing is provided, and referring to fig. 1, the system includes a plurality of compute clusters (or processor clusters), which show a compute cluster 101, a compute cluster 102, a compute cluster 103, and a compute cluster 104, a forward data forwarding path 120, a backward data sharing path 130, and a task allocation unit 140.
The computing cluster 101-104 is used for performing a matrix multiplication function, and may be formed by one or more processing units, for example, only including a matrix multiplication processing unit, or including a matrix multiplication processing unit and other types of units. Each computing cluster may have the same or different circuit configuration, and may be implemented by various types of circuit configurations such as an ASIC or a DSP, and the computing power of each computing cluster may be the same or different. Furthermore, each compute cluster has its own on-chip memory (also referred to herein as local on-chip memory), which may be, for example, SRAM or other types, and off-chip memory, which may be, for example, DDR pellets or other types, as the present invention is not limited in this respect.
The forward data forwarding path 120 forms a ring path for forwarding input neuron data between a plurality of computing clusters, each computing cluster being capable of forwarding neuron data read from the outside (e.g., off-chip memory) or received neuron data forwarded by other computing clusters to other computing clusters connected thereto in turn via the forward data forwarding path 120, so that neuron data can circulate between the plurality of computing clusters.
The backward data sharing path 130 forms a ring path for passing the weight data or the calculation result of the matrix multiplication among the plurality of calculation clusters, and each calculation cluster can forward the weight data read from the outside (such as an off-chip memory) or the received weight data from other calculation clusters to other calculation clusters connected thereto in turn via the backward data sharing path 130, so that the weight data circularly flow among the plurality of calculation clusters. In this way, each compute cluster is able to access both on-chip memory and off-chip memory.
In one embodiment, the on-chip memory resources used for data sharing by each compute cluster may be uniformly addressed and the off-chip memory resources may also be uniformly addressed, the addressing bits including a portion for identifying off-chip memory and on-chip memory, a portion for identifying the selected compute cluster, and a portion for identifying a particular off-chip memory address or on-chip memory address. For example, the unified address bit number is 35 bits, where the highest bit34 is used to identify whether the address is on-chip or off-chip, bit33 and bit32 are used to select a compute cluster, and for the case of 4 compute clusters, 2bits are used to select any corresponding compute cluster, and the lower 32bits, that is, bit31-bit0, can represent the address of 4G. See table 1.
Table 1: bit identification
Figure BDA0001872448310000061
As can be seen from Table 1, in this way, each compute cluster has access to both on-chip 4G memory space and off-chip 4G memory space.
The task allocation unit 140 is configured to determine task allocation policies and on-chip and off-chip storage policies of the multiple compute clusters according to requirements of the tasks to be computed and computing capabilities of each compute cluster. The task assigning unit 140 may be implemented by a software module of the system on chip. Further details regarding the task allocation policies and on-chip and off-chip storage policies are provided below.
Fig. 2 is a block diagram of a computing cluster according to an embodiment of the present invention, where the computing cluster includes a data flow control module 210, a data buffer module 220, a multiply-accumulate module 230, a data transfer module 240, an on-chip memory 250, and a repeater 260.
The data flow control module 210 is in communication connection with the data buffer module 220, the multiply-accumulate module 230, the data transfer module 240, and the repeater 260 (connections to the repeater 260 are not shown), and can receive the task allocation policy and the storage policy of the on-chip and off-chip memories from the task allocation unit, and control data (including neuron data, weights or matrix multiplication results, etc.) to be transferred between modules in the computing cluster and data interaction with the outside of the computing cluster according to the policies and the task execution condition. For example, including but not limited to controlling the data cache module 220 to select data from the on-chip memory 250, controlling the repeater 260 to receive weight data from repeaters of other computing clusters, controlling loading of data from outside the computing cluster to the data cache module 220 and then controlling passing to the multiply-accumulate module 230, or after the multiply-accumulate module 230 has performed a matrix multiply operation, controlling passing of neuron data to the data transfer module 240 and then controlling the data transfer module 240 to pass the neuron data to a subsequent computing cluster, and so on.
The data caching module 220 is used for caching various types of data. For example, including but not limited to weight data of the matrix multiplication operation to be performed, input neuron data, calculation results of the multiply-accumulate module 230, and the like.
The multiply-accumulate module 230 is used to perform multiplication operations of the weight matrix and the input neuron matrix, and may include one or more matrix multiply processing units to quickly process matrix multiplication operations of different sizes.
The data transfer module 240 is configured to form a forward data forwarding path, which is communicatively connected to the data flow control module 240 inside the computing cluster, and also communicatively connected to other computing clusters to pass data to other computing clusters. For example, to the data caching modules of other computing clusters.
The on-chip memory 250, i.e. the local on-chip memory of the computational cluster, is used to store various types of data, such as neuron data or weight data.
The repeater 260 is configured to form a backward data sharing path, and may load weight data from an external memory, receive weight data from other computing clusters, or forward the weight data to other computing clusters (for example, by interacting with repeaters of other computing clusters), or receive matrix multiplication results from other computing clusters, store the matrix multiplication results in the on-chip memory 250, or forward the matrix multiplication results to other computing clusters.
For clarity of illustration, the connection relationship between the data flow control module 210 and the on-chip memory 250 and the repeater 260 is not shown in fig. 2. Such a connection relationship is understandable to those skilled in the art in order to realize the functions of the present invention. Moreover, it will be apparent to those skilled in the art that the components and the connection relationship between the components shown in fig. 2 are not limited, and that some components may be added or deleted and the connection relationship between the components may be changed according to the needs and purposes of the system.
In conjunction with the compute cluster architecture of fig. 2, when the system on chip includes multiple compute clusters, a forward data forwarding path and a backward data sharing path may be formed under the control of the data flow control module 210.
For example, the forward data forwarding path is composed of the data buffering module 220, the multiply-accumulate module 230, the data transmission module 240, and those modules in other computing clusters, that is, the neuron data can be sequentially forwarded to the data buffering module 220, the multiply-accumulate module 230, the data transmission module 240, and the data buffering module, the multiply-accumulate module, and the data transmission module of other computing clusters. As another example, the forward data forwarding path is formed by the data buffering module 220, the data transmission module 240, the multiply-accumulate module and the data transmission module of the data buffering modules of other computing clusters, in which case some neuron data may be directly forwarded to other computing clusters without participating in the multiply-accumulate operation.
For example, the backward data sharing path includes the repeater 260 and repeaters in other computation clusters, i.e., the computation result of the weight data or matrix multiplication is transmitted by the repeater 260 to the repeater of the computation cluster connected thereto.
It should be understood that the connection relationship between the modules in fig. 2 is only used for illustration, and in practical applications, those skilled in the art may make appropriate modifications, for example, the calculation result of the multiply-accumulate module 230 may also be temporarily stored in the data buffer module 220 and forwarded to other calculation clusters as appropriate.
Fig. 3 is a system on chip according to another embodiment of the present invention, which is similar to the system shown in fig. 1 and also includes a computation cluster 301 and 304, a forward data forwarding path 320, a backward data sharing path 430 and a task allocation unit (not shown), but compared to fig. 1, the repeaters for constituting the backward data sharing path are arranged outside the computation cluster and show the backward data sharing path 430. The backward data sharing path 430 is formed by connecting a plurality of repeaters, the repeaters correspond to the computing clusters one by one, and can perform data interaction with the corresponding computing clusters, and the repeaters are respectively marked as a repeater 401, a repeater 402, a repeater 403, and a repeater 404.
The task allocation policy and the on-chip and off-chip storage policy and the corresponding data processing procedure will be described below with reference to fig. 3.
In one embodiment, the tasks to be executed by each computation cluster and the on-chip and off-chip storage strategies in the computation process are determined according to the scale of the matrix to be computed (including the scale of the input neuron matrix and/or the scale of the weight matrix) or the computing capacity of each processor, so that efficient utilization of computing resources and minimum data transmission are realized to the greatest extent by selecting different schemes.
For example, it is assumed that an input neuron data matrix has a size of B × N × K, an input neuron data matrix having B N × K (N is a row dimension and K is a column dimension) is represented, a plurality of weight matrices have a size of K × M (K is a row dimension and M is a column dimension), and for convenience of description, all B computation clusters have the same computation capability, and are K × M (that is, when matrix multiplication is performed once, the row dimension of the weight matrix is K and the column dimension is M), where N, M, K, and M are any positive integer. The task allocation strategy and the on-chip and off-chip storage strategy under the two conditions of small weight scale and large weight scale are respectively introduced below.
1) The case of smaller weight scale
For example, if M ≦ M, such a calculation typically exists in shallower layers of image recognition applications, where M is generally smaller, K is also smaller, and thus the scale K M of the weight matrix is also smaller.
In this case, the subtask allocation strategy is to allocate each computation cluster in parallel by using the input neuron matrix parallel mode
Figure BDA0001872448310000081
An input neuron matrix to be computed.
In one embodiment, the storage policy for the weight matrix is: and storing one part of the weight matrix in the off-chip memory corresponding to each calculation cluster, and loading the weight matrix from the off-chip memory to the local on-chip memory by each calculation cluster when matrix multiplication is executed, so that all input neuron matrices and the weight matrix are locally processed by the calculation clusters in the operation inference, data communication is not needed among the calculation clusters, and in this case, a backward data sharing path is not used. In this way, access latency and access power consumption can be reduced.
In another embodiment, the storage strategy of the weight matrix is to equally distribute the weight matrix to the on-chip memories of the computing clusters, when matrix multiplication is performed, the multiply-accumulate module in each computing cluster loads the weight matrix from the local on-chip memory, and obtains the weight matrices of other computing clusters through a backward data sharing path.
For ease of understanding, table 2 below illustrates the behavior of the compute clusters at different times and table 3 illustrates the behavior of the forwarders at different times in connection with the system on chip shown in fig. 3. In particular, parallel allocation with each compute cluster
Figure BDA0001872448310000091
An input neuron matrix, for example, is labeled separately
Figure BDA0001872448310000092
Figure BDA0001872448310000093
The weight matrix is also equally divided into four sub-weight matrices, labeled as weight portions 1-4, respectively assigned to the computation clusters 301-At time T0, compute cluster 301 executes a neuron matrix
Figure BDA0001872448310000094
And the weight portion 1, and the repeater 401 corresponding to the calculation cluster reads the weight portion 2 from the calculation cluster 402, and at a time T1, the calculation cluster 301 executes the neuron matrix
Figure BDA0001872448310000095
The other compute clusters and corresponding repeaters behave similarly to the matrix multiplication operation of weight portion 2, see tables 2 and 3.
Table 2: computing cluster behavior at different times
Figure BDA0001872448310000096
Table 3: repeater behavior at different times
Figure BDA0001872448310000097
Figure BDA0001872448310000101
As can be seen from tables 2 and 3, while each computation cluster performs matrix multiplication, the corresponding repeater can read the weight data to be subjected to matrix multiplication at a subsequent time from other computation clusters via the backward data sharing path, so that the weight data flows among the repeaters, and the weight data is loaded when needed by the computation cluster and can be loaded to a data cache module, an on-chip memory, and the like. By the method, the flow of the weight data among the computing clusters can be controlled, so that the resource utilization rate of the computing clusters and the operation efficiency of matrix multiplication are improved.
2) Case of large weight scale
If M is larger than or equal to bXm, in this case, the weight matrix scale is larger, the neuron matrix scale is smaller, and the calculation cluster cannot complete input nerve at one timeMultiplication of the element matrix and the weight matrix. In this case, each compute cluster may still be allocated in parallel
Figure BDA0001872448310000102
A matrix to be calculated, and the weight matrix is divided into several smaller matrices distributed in different computing clusters, for example, the computing cluster 101 is assigned a sub-matrix of Kx [0, m-1 ]]The computing cluster 102 assigns a sub-matrix of Kx [ m, 2m-1 × ]]And so on. In this way, when performing matrix multiplication operations, the larger communication bandwidth, i.e. the large-scale weights, will remain local, while the neuron data can be read once from the on-chip memory (e.g. SDRAM) and propagated to other compute clusters of the system-on-chip by means of the forward data forwarding path in the compute cluster. In this way, only the result of the matrix multiplication, or the intermediate calculation result, is written back to the on-chip shared memory through the backward data sharing path, and other accesses all occur inside the calculation cluster.
Still referring to fig. 3, tables 4 and 5 below illustrate the behavior of the processor and the behavior of the repeater, respectively, at different times. In particular, still with parallel allocation of each compute cluster
Figure BDA0001872448310000103
An input neuron matrix, for example, is labeled separately
Figure BDA0001872448310000104
And one weight matrix is divided into four sub-weight matrices, which are marked as weight parts 1-4 and are respectively distributed to the calculation clusters 301-304, and when the multiplication operation of the neuron matrix and the weight matrix is executed, the operation results of splicing the four sub-weight matrices are needed.
In this example, at time T0, compute cluster 301 executes a neuron matrix
Figure BDA0001872448310000111
And a weight portion 1, the computational cluster 302 performing a neuron matrix
Figure BDA0001872448310000112
And a matrix multiplication operation of weight part 2; at time T1, compute cluster 301 executes a neuron matrix
Figure BDA0001872448310000113
And a weight portion 1, the computational cluster 302 performing a neuron matrix
Figure BDA0001872448310000114
And a matrix multiplication operation of weight part 2; at time T2, the repeater 401 corresponding to the computing cluster 301 reads from the computing cluster 402
Figure BDA0001872448310000115
With the result of weight portion 2, at time T3, the repeater 401 reads from the compute cluster 403
Figure BDA0001872448310000116
By analogy with the result of the weight part 3, after the calculation cluster obtains the results of a plurality of sub-weight matrices for a weight matrix, the calculation results of the neuron matrix and the weight matrix can be obtained by splicing, and the behaviors of other calculation clusters and corresponding repeaters are similar, which can be seen in tables 4 and 5.
Table 4: processor behavior at different times
Figure BDA0001872448310000117
Table 5: repeater behavior at different times
Figure BDA0001872448310000118
Figure BDA0001872448310000121
As can be seen from tables 4 and 5, when each computation cluster performs matrix multiplication, the corresponding repeater can read the result of the matrix multiplication at the previous time from other computation clusters via the backward data sharing path, so that the computation results sequentially flow among the repeaters, and the computation clusters need to be spliced.
It should be understood that the timing of the flow of the above weight data and calculation results between the processors is not fixed, and the transfer sequence of data between the modules may be controlled according to the scale of data processing, the calculation capability of the multiply-accumulate module, and the capacity of the data buffer module and the on-chip memory. E.g. per computing cluster
Figure BDA0001872448310000122
The matrix to be calculated is not necessarily the result of calculation forwarded to the data sharing path after all the matrix is processed, or may be the result of calculation forwarded to a part of the matrix after a part of the matrix is processed. Further, although tables 4 and 5 illustrate the processing procedure of one weight matrix, it is sufficient to sequentially process for the case of a plurality of weight matrices similarly.
The system on chip of the invention provides a unified and coordinated multiprocessor architecture aiming at the operation characteristics of different layers in neural network reasoning application, each computing cluster has own larger storage and can efficiently access the own storage, thereby solving the problems of lower shallow layer processing efficiency and higher data carrying energy consumption of the first type of architecture and simultaneously solving the problem caused by limited storage of the second type of architecture. In addition, by coordinating and scheduling processing tasks and selecting different storage strategies, the method and the device can be suitable for the operation characteristics of different layers in the neural network, so that the problem of unbalanced performance of the third type of architecture is solved. In the aspect of coordination of software and hardware, the invention enables the heavier bandwidth load required by the operation layer to be concentrated in the calculation cluster based on the task division mode and the non-uniform memory storage strategy applied to the software layer, and transmits the lighter bandwidth load through the on-chip internet, thereby realizing the energy consumption optimization of local memory access.
The invention improves the calculation energy efficiency ratio in the field of artificial intelligence reasoning, and is particularly suitable for application scenarios with high performance reasoning requirements, such as data centers, unmanned driving and the like.
The system on chip of the invention can be applied to various electronic devices, such as mobile devices, embedded electronic devices, intelligent computing processing devices, robots and the like, and can be applied to the fields of word processing, voice recognition and processing, multi-national language translation, image recognition, biological feature recognition, intelligent control and the like.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (11)

1. A system on a chip for a neural network, comprising a plurality of compute clusters, a forward data forwarding path, a backward data sharing path, and a task allocation unit, wherein:
the plurality of computing clusters are used for realizing multiplication operation of an input neuron matrix and a weight matrix in a neural network, wherein each computing cluster comprises a local on-chip memory and a corresponding off-chip memory;
the forward data forwarding path is to forward input neuron data between the plurality of compute clusters;
the backward data sharing path is used for transmitting weight data or calculation results among the plurality of calculation clusters;
the task allocation unit is used for determining a task allocation strategy of each computing cluster according to the input neuron matrix scale to be computed, so as to allocate input neuron data to be subjected to matrix multiplication operation for each computing cluster.
2. The system on a chip of claim 1, wherein the compute cluster comprises a data flow control module, a data cache module, a multiply-accumulate module, a data transfer module, and an on-chip memory, wherein:
the data caching module is used for storing neuron data, weight data or calculation result data;
the multiplication and accumulation module is used for realizing multiplication operation of the input neuron matrix and the corresponding weight matrix;
the data flow control module is used for controlling the loading of data to the data cache module, the multiply-accumulate module, the data transmission module and the on-chip memory;
the data transfer module is used for forwarding the neuron data to other computing clusters.
3. The system on chip according to claim 1 or 2, wherein the backward data sharing path is formed by a plurality of repeaters connected in sequence, wherein each repeater corresponds to a computation cluster for transmitting the weight data or computation results received from other repeaters to the corresponding computation cluster.
4. The system on chip according to claim 1 or 2, wherein the task allocation unit is further configured to determine a storage policy of the weight matrix in the local on-chip memories of the plurality of computing clusters and the corresponding off-chip memories according to at least one of a size of the weight matrix or a computing capability of the plurality of computing clusters.
5. The system-on-chip of claim 4, wherein, when the input neuron matrix is BxNxK, the weight matrix is KxM, there are B computation clusters, and each computation cluster has a computation capability of kxm, and N, M, K, K, M, and B are any positive integer:
the task allocation strategy is to allocate each computing cluster in parallel
Figure FDA0001872448300000021
A matrix of input neuron data.
6. The system on chip as claimed in claim 5, wherein the weight matrix is stored in the off-chip memory corresponding to each compute cluster, or stored in the off-chip memory corresponding to one compute cluster, or divided into a plurality of sub-matrices and stored in the off-chip memory corresponding to each compute cluster.
7. The system on a chip of claim 6, wherein when performing a matrix multiplication operation with the weight matrix stored in an off-chip memory corresponding to a compute cluster, the compute cluster loads the weight matrix from its corresponding off-chip memory to a local on-chip memory and passes the weight matrix to the remaining compute clusters via the backward data sharing path.
8. The system on a chip of claim 6, wherein when performing matrix multiplication with the weight matrix divided into a plurality of sub-matrices stored in the off-chip memory corresponding to each compute cluster, each compute cluster loads the weight matrix from its corresponding off-chip memory to a local on-chip memory and transfers the weight matrix to the remaining compute clusters via the backward data sharing path.
9. The system-on-chip of claim 4, wherein, when the input neuron matrix is BxNxK, the weight matrix is KxM, there are B computation clusters, each computation cluster has a computation capability of kxm, N, M, K, K, M, B are any positive integer, and M ≧ bxm:
the task allocation strategy is to allocate each computing cluster in parallel
Figure FDA0001872448300000022
A matrix of input neurons;
the storage strategy of the weight matrix is to divide the weight matrix into a plurality of sub-matrices according to the computing capabilities of the computing clusters and distribute the sub-matrices in off-chip memories corresponding to the computing clusters.
10. The system on chip of claim 1 or 2, wherein the forward data forwarding path sequentially concatenates the plurality of computational clusters in a first direction to form a loop that passes input neuron data, and the backward data sharing path sequentially concatenates the plurality of computational clusters in a second direction to form a loop that passes weight data or computation results.
11. An electronic device comprising the system-on-chip of any of claims 1 to 10.
CN201811383562.3A 2018-11-20 2018-11-20 System-on-Chip for Neural Networks Active CN111199275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811383562.3A CN111199275B (en) 2018-11-20 2018-11-20 System-on-Chip for Neural Networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811383562.3A CN111199275B (en) 2018-11-20 2018-11-20 System-on-Chip for Neural Networks

Publications (2)

Publication Number Publication Date
CN111199275A true CN111199275A (en) 2020-05-26
CN111199275B CN111199275B (en) 2023-04-28

Family

ID=70744235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811383562.3A Active CN111199275B (en) 2018-11-20 2018-11-20 System-on-Chip for Neural Networks

Country Status (1)

Country Link
CN (1) CN111199275B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010845A (en) * 2021-03-22 2021-06-22 上海寒武纪信息科技有限公司 Computing device and method for executing matrix multiplication and related products
CN113434813A (en) * 2021-06-26 2021-09-24 上海寒武纪信息科技有限公司 Matrix multiplication method based on neural network and related device
CN113742266A (en) * 2021-09-10 2021-12-03 中科寒武纪科技股份有限公司 Integrated circuit device, electronic equipment, board card and calculation method
CN113791996A (en) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 Integrated circuit device, electronic equipment, board card and calculation method
CN113900917A (en) * 2021-09-30 2022-01-07 上海商汤智能科技有限公司 A performance determination method, device, computer equipment and storage medium
CN114064561A (en) * 2021-11-17 2022-02-18 北京灵汐科技有限公司 Data processing method, device, chip and medium
CN114282659A (en) * 2020-09-28 2022-04-05 中科寒武纪科技股份有限公司 Device, board card and method for calculating neural network and readable storage medium
CN114648087A (en) * 2020-12-17 2022-06-21 北京灵汐科技有限公司 Neural network computing method, electronic device and computer readable medium
CN114721599A (en) * 2022-04-29 2022-07-08 北京灵汐科技有限公司 Weight data storage method and device, chip, electronic equipment and readable medium
CN114792128A (en) * 2022-04-29 2022-07-26 北京灵汐科技有限公司 Method for weighted data transmission, many-core system, electronic device, medium
CN114861895A (en) * 2022-04-29 2022-08-05 北京灵汐科技有限公司 Neural network neuron information storage method and device, many-core system, medium
WO2022174733A1 (en) * 2021-02-19 2022-08-25 山东英信计算机技术有限公司 Neuron accelerated processing method and apparatus, and device and readable storage medium
WO2023208027A1 (en) * 2022-04-29 2023-11-02 北京灵汐科技有限公司 Information processing method and information processing unit, and device, medium and product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003989A (en) * 2014-12-19 2017-08-01 英特尔公司 Method and apparatus for distributed and collaborative computing in artificial neural networks
CN107818367A (en) * 2017-10-30 2018-03-20 中国科学院计算技术研究所 Processing system and processing method for neutral net
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array
WO2018107383A1 (en) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 Neural network convolution computation method and device, and computer-readable storage medium
EP3373210A1 (en) * 2017-03-09 2018-09-12 Google LLC Transposing neural network matrices in hardware

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003989A (en) * 2014-12-19 2017-08-01 英特尔公司 Method and apparatus for distributed and collaborative computing in artificial neural networks
WO2018107383A1 (en) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 Neural network convolution computation method and device, and computer-readable storage medium
EP3373210A1 (en) * 2017-03-09 2018-09-12 Google LLC Transposing neural network matrices in hardware
CN107818367A (en) * 2017-10-30 2018-03-20 中国科学院计算技术研究所 Processing system and processing method for neutral net
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭文生;李国和;: "人工神经网络在并行计算机集群上的设计研究" *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282659A (en) * 2020-09-28 2022-04-05 中科寒武纪科技股份有限公司 Device, board card and method for calculating neural network and readable storage medium
CN114648087A (en) * 2020-12-17 2022-06-21 北京灵汐科技有限公司 Neural network computing method, electronic device and computer readable medium
WO2022174733A1 (en) * 2021-02-19 2022-08-25 山东英信计算机技术有限公司 Neuron accelerated processing method and apparatus, and device and readable storage medium
CN113010845A (en) * 2021-03-22 2021-06-22 上海寒武纪信息科技有限公司 Computing device and method for executing matrix multiplication and related products
CN113434813A (en) * 2021-06-26 2021-09-24 上海寒武纪信息科技有限公司 Matrix multiplication method based on neural network and related device
CN113434813B (en) * 2021-06-26 2024-05-14 上海寒武纪信息科技有限公司 Matrix multiplication operation method based on neural network and related device
CN113742266B (en) * 2021-09-10 2024-02-06 中科寒武纪科技股份有限公司 Integrated circuit device, electronic apparatus, board and computing method
CN113742266A (en) * 2021-09-10 2021-12-03 中科寒武纪科技股份有限公司 Integrated circuit device, electronic equipment, board card and calculation method
CN113791996A (en) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 Integrated circuit device, electronic equipment, board card and calculation method
CN113791996B (en) * 2021-09-10 2024-02-06 中科寒武纪科技股份有限公司 Integrated circuit device, electronic apparatus, board and computing method
CN113900917A (en) * 2021-09-30 2022-01-07 上海商汤智能科技有限公司 A performance determination method, device, computer equipment and storage medium
CN114064561A (en) * 2021-11-17 2022-02-18 北京灵汐科技有限公司 Data processing method, device, chip and medium
CN114721599A (en) * 2022-04-29 2022-07-08 北京灵汐科技有限公司 Weight data storage method and device, chip, electronic equipment and readable medium
WO2023208027A1 (en) * 2022-04-29 2023-11-02 北京灵汐科技有限公司 Information processing method and information processing unit, and device, medium and product
CN114861895A (en) * 2022-04-29 2022-08-05 北京灵汐科技有限公司 Neural network neuron information storage method and device, many-core system, medium
CN114792128A (en) * 2022-04-29 2022-07-26 北京灵汐科技有限公司 Method for weighted data transmission, many-core system, electronic device, medium
CN114721599B (en) * 2022-04-29 2025-12-02 北京灵汐科技有限公司 Weighted data storage methods and devices, chips, electronic devices, and readable media

Also Published As

Publication number Publication date
CN111199275B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN111199275A (en) System-on-Chip for Neural Networks
CN109190756B (en) Arithmetic device based on Winograd convolution and neural network processor comprising same
US11354570B2 (en) Machine learning network implemented by statically scheduled instructions, with MLA chip
CN110325963B (en) Multipurpose unit for programmable hardware nodes for neural network processing
US11609792B2 (en) Maximizing resource utilization of neural network computing system
CN111630505B (en) Deep learning accelerator system and method thereof
CN113986816B (en) Reconfigurable computing chip
CN110347626B (en) server system
JP2019204492A (en) Neuromorphic accelerator multitasking
CN116644804B (en) Distributed training system, neural network model training method, device and medium
US20230334374A1 (en) Allocating computations of a machine learning network in a machine learning accelerator
CN114580606B (en) Data processing method, device, computer equipment and storage medium
US20250111217A1 (en) Data layout conscious processing in memory architecture for executing neural network model
US12333351B2 (en) Synchronization of processing elements that execute statically scheduled instructions in a machine learning accelerator
Min et al. NeuralHMC: An efficient HMC-based accelerator for deep neural networks
CN119939097A (en) A matrix multiplication and addition operation implementation method, device and medium based on RISC-V
EP4202774A1 (en) Runtime predictors for neural network computation reduction
CN112766475B (en) Processing component and artificial intelligence processor
CN117114055A (en) FPGA binary neural network acceleration method for industrial application scene
CN114358269A (en) Neural network processing component and multi-neural network processing method
US20250307345A1 (en) Integrated Heterogeneous Processing Cores for Unified Independent Computation Execution
CN114691589B (en) A processing device and related products
CN119883634A (en) Task processing method and device, electronic equipment and storage medium
CN116561046A (en) Mapping method, data processing method and many-core system
CN116894462A (en) A data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20200526

Assignee: Suzhou Heyu Finance Leasing Co.,Ltd.

Assignor: Shanghai Denglin Technology Co.,Ltd.

Contract record no.: X2024980007796

Denomination of invention: On chip systems for neural networks

Granted publication date: 20230428

License type: Exclusive License

Record date: 20240625

EE01 Entry into force of recordation of patent licensing contract
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: On chip systems for neural networks

Granted publication date: 20230428

Pledgee: Suzhou Heyu Finance Leasing Co.,Ltd.

Pledgor: Shanghai Denglin Technology Co.,Ltd.

Registration number: Y2024980025096

PE01 Entry into force of the registration of the contract for pledge of patent right
CP03 Change of name, title or address

Address after: Room 1101, Building 5, South Bank New Land Phase I, No. 11 Yangfu Road, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province 215101

Patentee after: Suzhou Denglin Technology Co.,Ltd.

Country or region after: China

Address before: Room 901, 570 shengxia Road, Pudong New Area, Shanghai 201203

Patentee before: Shanghai Denglin Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address