Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and providing a system on chip for a neural network, which improves the computational energy efficiency ratio by improving the processor architecture and the resource scheduling manner of the system on chip.
According to a first aspect of the invention, a system on chip for a neural network is provided. The system on chip comprises a plurality of computing clusters, a forward data forwarding path, a backward data sharing path and a task allocation unit, wherein:
the plurality of computing clusters are used for realizing multiplication operation of an input neuron matrix and a weight matrix in a neural network, wherein each computing cluster comprises a local on-chip memory and a corresponding off-chip memory;
the forward data forwarding path is to forward input neuron data between the plurality of compute clusters;
the backward data sharing path is used for transmitting weight data or calculation results among the plurality of calculation clusters;
the task allocation unit is used for determining a task allocation strategy of each computing cluster according to the input neuron matrix scale to be computed, so as to allocate input neuron data to be subjected to matrix multiplication operation for each computing cluster.
In one embodiment, the computing cluster includes a data flow control module, a data buffer module, a multiply-accumulate module, a data transfer module, and an on-chip memory, wherein:
the data caching module is used for storing neuron data, weight data or calculation result data;
the multiplication and accumulation module is used for realizing multiplication operation of the input neuron matrix and the corresponding weight matrix;
the data flow control module is used for controlling the loading of data to the data cache module, the multiply-accumulate module, the data transmission module and the on-chip memory;
the data transfer module is used for forwarding the neuron data to other computing clusters.
In one embodiment, the backward data sharing path is formed by a plurality of repeaters connected in sequence, wherein each repeater corresponds to a computation cluster and is used for transmitting the weight data or computation results received from other repeaters to the corresponding computation cluster.
In one embodiment, the task allocation unit is further configured to determine a storage policy of the weight matrix in the local on-chip memories of the plurality of computing clusters and the corresponding off-chip memories according to at least one of a size of the weight matrix or a computing capability of the plurality of computing clusters.
In one embodiment, in the case that the input neuron matrix is B × N × K, the weight matrix is K × M, there are B computation clusters, each computation cluster has a computation capability of K × M, and N, M, K, M, B are any positive integer:
the task allocation strategy is to allocate each computing cluster in parallel
A matrix of input neuron data.
In an embodiment, when M is less than or equal to M, the weight matrix is stored in one off-chip memory corresponding to each computation cluster, or the weight matrix is stored in one off-chip memory corresponding to one computation cluster, or the weight matrix is divided into a plurality of sub-matrices, and the sub-matrices are stored in the off-chip memory corresponding to each computation cluster.
In one embodiment, when performing a matrix multiplication operation while storing the weight matrix in an off-chip memory corresponding to one compute cluster, the compute cluster loads the weight matrix from its corresponding off-chip memory to a local on-chip memory, and transfers the weight matrix to the remaining compute clusters via the backward data sharing path.
In an embodiment, when the weight matrix is divided into a plurality of sub-matrices on average and stored in the off-chip memory corresponding to each compute cluster, each compute cluster loads the weight matrix from its corresponding off-chip memory to the local on-chip memory and transfers the weight matrix to the other compute clusters via the backward data sharing path when performing a matrix multiplication operation.
In one embodiment, where the input neuron matrix is B × N × K, the weight matrix is K × M, there are B computing clusters, each computing cluster has a computing power of K × M, N, M, K, K, M, B are any positive integer, and M ≧ B × M:
the task allocation strategy is to allocate each computing cluster in parallel
A matrix of input neurons;
the storage strategy of the weight matrix is to divide the weight matrix into a plurality of sub-matrices according to the computing capabilities of the computing clusters and distribute the sub-matrices in off-chip memories corresponding to the computing clusters.
In one embodiment, the forward data forwarding path sequentially connects the plurality of computational clusters in series in a first direction to form a loop that passes input neuron data, and the backward data sharing path sequentially connects the plurality of computational clusters in series in a second direction to form a loop that passes weight data or computation results.
According to a second aspect of the invention, an electronic device is provided. The electronic device comprises the system on chip of the invention.
Compared with the prior art, the invention has the advantages that: aiming at the operation characteristics of different layers in neural network reasoning application, a unified multi-computing cluster coordinated system-on-chip architecture is provided, and the problem of low computation efficiency of a single operation unit in a shallow layer of reasoning application is solved; data sharing among multiple computing clusters is realized by designing a special data forwarding path and an on-chip network; and scheduling the needed heavier bandwidth load in the computing cluster according to the input neuron matrix scale or the weight matrix scale, and transmitting the lighter bandwidth load through a data forwarding channel, thereby realizing the energy consumption optimization of local access.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In the description herein, input neuron data is node data in a neural network model, weights refer to coefficients connecting two nodes, which can be obtained by training, and data generally refers to each type of data such as input neuron data, weight data, and calculation results unless otherwise indicated by context.
According to an embodiment of the present invention, a system on chip for neural network processing is provided, and referring to fig. 1, the system includes a plurality of compute clusters (or processor clusters), which show a compute cluster 101, a compute cluster 102, a compute cluster 103, and a compute cluster 104, a forward data forwarding path 120, a backward data sharing path 130, and a task allocation unit 140.
The computing cluster 101-104 is used for performing a matrix multiplication function, and may be formed by one or more processing units, for example, only including a matrix multiplication processing unit, or including a matrix multiplication processing unit and other types of units. Each computing cluster may have the same or different circuit configuration, and may be implemented by various types of circuit configurations such as an ASIC or a DSP, and the computing power of each computing cluster may be the same or different. Furthermore, each compute cluster has its own on-chip memory (also referred to herein as local on-chip memory), which may be, for example, SRAM or other types, and off-chip memory, which may be, for example, DDR pellets or other types, as the present invention is not limited in this respect.
The forward data forwarding path 120 forms a ring path for forwarding input neuron data between a plurality of computing clusters, each computing cluster being capable of forwarding neuron data read from the outside (e.g., off-chip memory) or received neuron data forwarded by other computing clusters to other computing clusters connected thereto in turn via the forward data forwarding path 120, so that neuron data can circulate between the plurality of computing clusters.
The backward data sharing path 130 forms a ring path for passing the weight data or the calculation result of the matrix multiplication among the plurality of calculation clusters, and each calculation cluster can forward the weight data read from the outside (such as an off-chip memory) or the received weight data from other calculation clusters to other calculation clusters connected thereto in turn via the backward data sharing path 130, so that the weight data circularly flow among the plurality of calculation clusters. In this way, each compute cluster is able to access both on-chip memory and off-chip memory.
In one embodiment, the on-chip memory resources used for data sharing by each compute cluster may be uniformly addressed and the off-chip memory resources may also be uniformly addressed, the addressing bits including a portion for identifying off-chip memory and on-chip memory, a portion for identifying the selected compute cluster, and a portion for identifying a particular off-chip memory address or on-chip memory address. For example, the unified address bit number is 35 bits, where the highest bit34 is used to identify whether the address is on-chip or off-chip, bit33 and bit32 are used to select a compute cluster, and for the case of 4 compute clusters, 2bits are used to select any corresponding compute cluster, and the lower 32bits, that is, bit31-bit0, can represent the address of 4G. See table 1.
Table 1: bit identification
As can be seen from Table 1, in this way, each compute cluster has access to both on-chip 4G memory space and off-chip 4G memory space.
The task allocation unit 140 is configured to determine task allocation policies and on-chip and off-chip storage policies of the multiple compute clusters according to requirements of the tasks to be computed and computing capabilities of each compute cluster. The task assigning unit 140 may be implemented by a software module of the system on chip. Further details regarding the task allocation policies and on-chip and off-chip storage policies are provided below.
Fig. 2 is a block diagram of a computing cluster according to an embodiment of the present invention, where the computing cluster includes a data flow control module 210, a data buffer module 220, a multiply-accumulate module 230, a data transfer module 240, an on-chip memory 250, and a repeater 260.
The data flow control module 210 is in communication connection with the data buffer module 220, the multiply-accumulate module 230, the data transfer module 240, and the repeater 260 (connections to the repeater 260 are not shown), and can receive the task allocation policy and the storage policy of the on-chip and off-chip memories from the task allocation unit, and control data (including neuron data, weights or matrix multiplication results, etc.) to be transferred between modules in the computing cluster and data interaction with the outside of the computing cluster according to the policies and the task execution condition. For example, including but not limited to controlling the data cache module 220 to select data from the on-chip memory 250, controlling the repeater 260 to receive weight data from repeaters of other computing clusters, controlling loading of data from outside the computing cluster to the data cache module 220 and then controlling passing to the multiply-accumulate module 230, or after the multiply-accumulate module 230 has performed a matrix multiply operation, controlling passing of neuron data to the data transfer module 240 and then controlling the data transfer module 240 to pass the neuron data to a subsequent computing cluster, and so on.
The data caching module 220 is used for caching various types of data. For example, including but not limited to weight data of the matrix multiplication operation to be performed, input neuron data, calculation results of the multiply-accumulate module 230, and the like.
The multiply-accumulate module 230 is used to perform multiplication operations of the weight matrix and the input neuron matrix, and may include one or more matrix multiply processing units to quickly process matrix multiplication operations of different sizes.
The data transfer module 240 is configured to form a forward data forwarding path, which is communicatively connected to the data flow control module 240 inside the computing cluster, and also communicatively connected to other computing clusters to pass data to other computing clusters. For example, to the data caching modules of other computing clusters.
The on-chip memory 250, i.e. the local on-chip memory of the computational cluster, is used to store various types of data, such as neuron data or weight data.
The repeater 260 is configured to form a backward data sharing path, and may load weight data from an external memory, receive weight data from other computing clusters, or forward the weight data to other computing clusters (for example, by interacting with repeaters of other computing clusters), or receive matrix multiplication results from other computing clusters, store the matrix multiplication results in the on-chip memory 250, or forward the matrix multiplication results to other computing clusters.
For clarity of illustration, the connection relationship between the data flow control module 210 and the on-chip memory 250 and the repeater 260 is not shown in fig. 2. Such a connection relationship is understandable to those skilled in the art in order to realize the functions of the present invention. Moreover, it will be apparent to those skilled in the art that the components and the connection relationship between the components shown in fig. 2 are not limited, and that some components may be added or deleted and the connection relationship between the components may be changed according to the needs and purposes of the system.
In conjunction with the compute cluster architecture of fig. 2, when the system on chip includes multiple compute clusters, a forward data forwarding path and a backward data sharing path may be formed under the control of the data flow control module 210.
For example, the forward data forwarding path is composed of the data buffering module 220, the multiply-accumulate module 230, the data transmission module 240, and those modules in other computing clusters, that is, the neuron data can be sequentially forwarded to the data buffering module 220, the multiply-accumulate module 230, the data transmission module 240, and the data buffering module, the multiply-accumulate module, and the data transmission module of other computing clusters. As another example, the forward data forwarding path is formed by the data buffering module 220, the data transmission module 240, the multiply-accumulate module and the data transmission module of the data buffering modules of other computing clusters, in which case some neuron data may be directly forwarded to other computing clusters without participating in the multiply-accumulate operation.
For example, the backward data sharing path includes the repeater 260 and repeaters in other computation clusters, i.e., the computation result of the weight data or matrix multiplication is transmitted by the repeater 260 to the repeater of the computation cluster connected thereto.
It should be understood that the connection relationship between the modules in fig. 2 is only used for illustration, and in practical applications, those skilled in the art may make appropriate modifications, for example, the calculation result of the multiply-accumulate module 230 may also be temporarily stored in the data buffer module 220 and forwarded to other calculation clusters as appropriate.
Fig. 3 is a system on chip according to another embodiment of the present invention, which is similar to the system shown in fig. 1 and also includes a computation cluster 301 and 304, a forward data forwarding path 320, a backward data sharing path 430 and a task allocation unit (not shown), but compared to fig. 1, the repeaters for constituting the backward data sharing path are arranged outside the computation cluster and show the backward data sharing path 430. The backward data sharing path 430 is formed by connecting a plurality of repeaters, the repeaters correspond to the computing clusters one by one, and can perform data interaction with the corresponding computing clusters, and the repeaters are respectively marked as a repeater 401, a repeater 402, a repeater 403, and a repeater 404.
The task allocation policy and the on-chip and off-chip storage policy and the corresponding data processing procedure will be described below with reference to fig. 3.
In one embodiment, the tasks to be executed by each computation cluster and the on-chip and off-chip storage strategies in the computation process are determined according to the scale of the matrix to be computed (including the scale of the input neuron matrix and/or the scale of the weight matrix) or the computing capacity of each processor, so that efficient utilization of computing resources and minimum data transmission are realized to the greatest extent by selecting different schemes.
For example, it is assumed that an input neuron data matrix has a size of B × N × K, an input neuron data matrix having B N × K (N is a row dimension and K is a column dimension) is represented, a plurality of weight matrices have a size of K × M (K is a row dimension and M is a column dimension), and for convenience of description, all B computation clusters have the same computation capability, and are K × M (that is, when matrix multiplication is performed once, the row dimension of the weight matrix is K and the column dimension is M), where N, M, K, and M are any positive integer. The task allocation strategy and the on-chip and off-chip storage strategy under the two conditions of small weight scale and large weight scale are respectively introduced below.
1) The case of smaller weight scale
For example, if M ≦ M, such a calculation typically exists in shallower layers of image recognition applications, where M is generally smaller, K is also smaller, and thus the scale K M of the weight matrix is also smaller.
In this case, the subtask allocation strategy is to allocate each computation cluster in parallel by using the input neuron matrix parallel mode
An input neuron matrix to be computed.
In one embodiment, the storage policy for the weight matrix is: and storing one part of the weight matrix in the off-chip memory corresponding to each calculation cluster, and loading the weight matrix from the off-chip memory to the local on-chip memory by each calculation cluster when matrix multiplication is executed, so that all input neuron matrices and the weight matrix are locally processed by the calculation clusters in the operation inference, data communication is not needed among the calculation clusters, and in this case, a backward data sharing path is not used. In this way, access latency and access power consumption can be reduced.
In another embodiment, the storage strategy of the weight matrix is to equally distribute the weight matrix to the on-chip memories of the computing clusters, when matrix multiplication is performed, the multiply-accumulate module in each computing cluster loads the weight matrix from the local on-chip memory, and obtains the weight matrices of other computing clusters through a backward data sharing path.
For ease of understanding, table 2 below illustrates the behavior of the compute clusters at different times and table 3 illustrates the behavior of the forwarders at different times in connection with the system on chip shown in fig. 3. In particular, parallel allocation with each compute cluster
An input neuron matrix, for example, is labeled separately
The weight matrix is also equally divided into four sub-weight matrices, labeled as weight portions 1-4, respectively assigned to the computation clusters 301-At time T0, compute cluster 301 executes a neuron matrix
And the weight portion 1, and the
repeater 401 corresponding to the calculation cluster reads the weight portion 2 from the
calculation cluster 402, and at a time T1, the calculation cluster 301 executes the neuron matrix
The other compute clusters and corresponding repeaters behave similarly to the matrix multiplication operation of weight portion 2, see tables 2 and 3.
Table 2: computing cluster behavior at different times
Table 3: repeater behavior at different times
As can be seen from tables 2 and 3, while each computation cluster performs matrix multiplication, the corresponding repeater can read the weight data to be subjected to matrix multiplication at a subsequent time from other computation clusters via the backward data sharing path, so that the weight data flows among the repeaters, and the weight data is loaded when needed by the computation cluster and can be loaded to a data cache module, an on-chip memory, and the like. By the method, the flow of the weight data among the computing clusters can be controlled, so that the resource utilization rate of the computing clusters and the operation efficiency of matrix multiplication are improved.
2) Case of large weight scale
If M is larger than or equal to bXm, in this case, the weight matrix scale is larger, the neuron matrix scale is smaller, and the calculation cluster cannot complete input nerve at one timeMultiplication of the element matrix and the weight matrix. In this case, each compute cluster may still be allocated in parallel

A matrix to be calculated, and the weight matrix is divided into several smaller matrices distributed in different computing clusters, for example, the computing cluster 101 is assigned a sub-matrix of Kx [0, m-1 ]]The computing cluster 102 assigns a sub-matrix of Kx [ m, 2m-1 × ]]And so on. In this way, when performing matrix multiplication operations, the larger communication bandwidth, i.e. the large-scale weights, will remain local, while the neuron data can be read once from the on-chip memory (e.g. SDRAM) and propagated to other compute clusters of the system-on-chip by means of the forward data forwarding path in the compute cluster. In this way, only the result of the matrix multiplication, or the intermediate calculation result, is written back to the on-chip shared memory through the backward data sharing path, and other accesses all occur inside the calculation cluster.
Still referring to fig. 3, tables 4 and 5 below illustrate the behavior of the processor and the behavior of the repeater, respectively, at different times. In particular, still with parallel allocation of each compute cluster
An input neuron matrix, for example, is labeled separately
And one weight matrix is divided into four sub-weight matrices, which are marked as weight parts 1-4 and are respectively distributed to the calculation clusters 301-304, and when the multiplication operation of the neuron matrix and the weight matrix is executed, the operation results of splicing the four sub-weight matrices are needed.
In this example, at time T0, compute cluster 301 executes a neuron matrix
And a weight portion 1, the computational cluster 302 performing a neuron matrix
And a matrix multiplication operation of weight part 2; at time T1, compute cluster 301 executes a neuron matrix
And a weight portion 1, the computational cluster 302 performing a neuron matrix
And a matrix multiplication operation of weight part 2; at time T2, the
repeater 401 corresponding to the computing cluster 301 reads from the
computing cluster 402
With the result of weight portion 2, at time T3, the
repeater 401 reads from the
compute cluster 403
By analogy with the result of the weight part 3, after the calculation cluster obtains the results of a plurality of sub-weight matrices for a weight matrix, the calculation results of the neuron matrix and the weight matrix can be obtained by splicing, and the behaviors of other calculation clusters and corresponding repeaters are similar, which can be seen in tables 4 and 5.
Table 4: processor behavior at different times
Table 5: repeater behavior at different times
As can be seen from tables 4 and 5, when each computation cluster performs matrix multiplication, the corresponding repeater can read the result of the matrix multiplication at the previous time from other computation clusters via the backward data sharing path, so that the computation results sequentially flow among the repeaters, and the computation clusters need to be spliced.
It should be understood that the timing of the flow of the above weight data and calculation results between the processors is not fixed, and the transfer sequence of data between the modules may be controlled according to the scale of data processing, the calculation capability of the multiply-accumulate module, and the capacity of the data buffer module and the on-chip memory. E.g. per computing cluster
The matrix to be calculated is not necessarily the result of calculation forwarded to the data sharing path after all the matrix is processed, or may be the result of calculation forwarded to a part of the matrix after a part of the matrix is processed. Further, although tables 4 and 5 illustrate the processing procedure of one weight matrix, it is sufficient to sequentially process for the case of a plurality of weight matrices similarly.
The system on chip of the invention provides a unified and coordinated multiprocessor architecture aiming at the operation characteristics of different layers in neural network reasoning application, each computing cluster has own larger storage and can efficiently access the own storage, thereby solving the problems of lower shallow layer processing efficiency and higher data carrying energy consumption of the first type of architecture and simultaneously solving the problem caused by limited storage of the second type of architecture. In addition, by coordinating and scheduling processing tasks and selecting different storage strategies, the method and the device can be suitable for the operation characteristics of different layers in the neural network, so that the problem of unbalanced performance of the third type of architecture is solved. In the aspect of coordination of software and hardware, the invention enables the heavier bandwidth load required by the operation layer to be concentrated in the calculation cluster based on the task division mode and the non-uniform memory storage strategy applied to the software layer, and transmits the lighter bandwidth load through the on-chip internet, thereby realizing the energy consumption optimization of local memory access.
The invention improves the calculation energy efficiency ratio in the field of artificial intelligence reasoning, and is particularly suitable for application scenarios with high performance reasoning requirements, such as data centers, unmanned driving and the like.
The system on chip of the invention can be applied to various electronic devices, such as mobile devices, embedded electronic devices, intelligent computing processing devices, robots and the like, and can be applied to the fields of word processing, voice recognition and processing, multi-national language translation, image recognition, biological feature recognition, intelligent control and the like.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.