[go: up one dir, main page]

WO2019041251A1 - Dispositif à puce et produit associé - Google Patents

Dispositif à puce et produit associé Download PDF

Info

Publication number
WO2019041251A1
WO2019041251A1 PCT/CN2017/099991 CN2017099991W WO2019041251A1 WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1 CN 2017099991 W CN2017099991 W CN 2017099991W WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
unit
basic
main unit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/099991
Other languages
English (en)
Chinese (zh)
Inventor
刘少礼
陈天石
王秉睿
张尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201811462676.7A priority Critical patent/CN109615061B/zh
Priority to EP19212002.0A priority patent/EP3651031A1/fr
Priority to EP19211995.6A priority patent/EP3651030A1/fr
Priority to KR1020197037895A priority patent/KR102481256B1/ko
Priority to EP19212365.1A priority patent/EP3654209A1/fr
Priority to CN201910530860.9A priority patent/CN110245751B/zh
Priority to CN202010628834.2A priority patent/CN111860815A/zh
Priority to JP2019553977A priority patent/JP7065877B2/ja
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to PCT/CN2017/099991 priority patent/WO2019041251A1/fr
Priority to CN201910534528.XA priority patent/CN110245752B/zh
Priority to KR1020197029020A priority patent/KR102467688B1/ko
Priority to CN201910534527.5A priority patent/CN110083390B/zh
Priority to CN201910534118.5A priority patent/CN110231958B/zh
Priority to CN201780002287.3A priority patent/CN109729734B8/zh
Priority to EP17923228.5A priority patent/EP3605402B1/fr
Priority to CN201910531031.2A priority patent/CN110222308B/zh
Priority to EP19212010.3A priority patent/EP3654208A1/fr
Priority to CN201910102972.4A priority patent/CN109902804B/zh
Priority to KR1020197037903A priority patent/KR102477404B1/ko
Priority to EP19212368.5A priority patent/EP3654210A1/fr
Priority to TW107125681A priority patent/TWI749249B/zh
Priority to US16/168,778 priority patent/US11409535B2/en
Publication of WO2019041251A1 publication Critical patent/WO2019041251A1/fr
Priority to US16/663,210 priority patent/US11354133B2/en
Priority to US16/663,206 priority patent/US11334363B2/en
Priority to US16/663,205 priority patent/US11347516B2/en
Priority to US16/663,164 priority patent/US11531553B2/en
Priority to US16/663,181 priority patent/US11561800B2/en
Priority to US16/663,174 priority patent/US11775311B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing

Definitions

  • the present disclosure relates to the field of communication and chip technologies, and in particular to a chip device and related products.
  • ANN Artificial Neural Network
  • a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other.
  • the calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit), and the calculation has high power consumption and long calculation time.
  • the embodiments of the present disclosure provide a neural network operation method and related products, which can reduce computation time and reduce power consumption of the module.
  • an embodiment of the present disclosure provides a method for computing a neural network, where the method is applied to a chip device, where the chip device includes: a main unit and a plurality of basic units, and the method includes the following steps: acquiring the main unit a data block to be calculated and an operation instruction, according to the operation instruction, dividing the data block to be calculated into a component data block and a broadcast data block; the main unit splitting the distribution data block to obtain a plurality of basic data blocks, Distributing the plurality of basic data blocks to the plurality of basic units, the main unit broadcasting the broadcast data block to the plurality of basic units; and performing, by the basic unit, the basic data blocks and the broadcast data blocks
  • the product operation obtains the operation result, and the operation result is sent to the main unit; the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the main unit broadcasts the broadcast data block to the multiple basic units, including:
  • the master unit broadcasts the broadcast data block to the plurality of basic units by one time.
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main unit.
  • the operation result is a result of the inner product processing
  • the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction, including:
  • the main unit accumulates the operation result to obtain an accumulation result, and the accumulation result is arranged to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the main unit broadcasts the broadcast data block to the multiple basic units, including:
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit performs inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, accumulates the inner product processing result to obtain a partial operation result, and sends the partial operation result to The main unit.
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
  • the basic unit multiplexes the partial broadcast data block n times to execute the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and accumulates the n partial processing results to obtain n partial operations.
  • the n partial operation results are transmitted to the main unit, and n is an integer greater than or equal to 2.
  • a chip device in a second aspect, includes: a main unit and a plurality of basic units, wherein the main unit is configured to acquire a data block to be calculated and an operation instruction, and the to-be-calculated according to the operation instruction Data block partitioning the data block and the broadcast data block; Performing a splitting process to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data blocks to the plurality of basic units; the basic unit, configured to: Performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and transmitting the operation result to the main unit; and the main unit is configured to process the operation result to obtain the to-be-calculated The data block and the instruction result of the operation instruction.
  • the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
  • the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
  • the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
  • the result is sent to the main unit.
  • the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
  • the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
  • the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
  • the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
  • the main unit includes: one or any combination of a main register or a main on-chip buffer circuit;
  • the base unit includes one or any combination of a basic register or a basic on-chip buffer circuit.
  • the main unit includes one or any combination of a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, or a data rearrangement circuit.
  • the unit includes one or any combination of an inner product operator circuit or an accumulator circuit.
  • the branch unit is a plurality of branch units, and the main unit is separately connected to the plurality of branch units, and each branch unit is connected to at least one base unit.
  • the branch unit is a plurality of branch units, and the plurality of branch units are connected in series and connected to the main unit, and each branch unit is respectively connected to at least one base unit.
  • the branching unit is specifically configured to forward data between the primary unit and the basic unit.
  • the branching unit is specifically configured to forward data between the primary unit and the basic unit or other branch units.
  • the data is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
  • the operation instruction is a multiplication instruction, determining that the multiplicative data block is a broadcast data block, and the multiplicand data block is a distribution data block;
  • the operation instruction is a convolution instruction
  • the input data block is a broadcast data block
  • the convolution kernel is a distribution data block.
  • a method for applying a chip device provided by the second aspect, the chip device for performing one or any combination of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation.
  • a chip is provided that integrates the chip arrangement provided by the second aspect.
  • a smart device comprising the chip provided by the sixth aspect.
  • the data is divided into distribution data and broadcast data, and the distribution data is split into basic data blocks and distributed to a plurality of basic units to perform inner product operations.
  • the inner product operation with the largest amount of computation is distributed to a plurality of basic units for simultaneous execution, so that it has the advantages of reducing calculation time and saving power consumption.
  • FIG. 1a is a schematic structural diagram of a chip device provided by the present disclosure.
  • FIG. 1b is a schematic structural diagram of another chip device provided by the present disclosure.
  • FIG. 1c is a schematic diagram of data distribution of the chip device provided by the present disclosure.
  • FIG. 1d is a schematic diagram of data back transmission of a chip device.
  • FIG. 2 is a schematic flow chart of a method for computing a neural network according to an embodiment of the present disclosure.
  • 2a is a schematic diagram of matrix A multiplied by matrix B provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart diagram of a method for computing a neural network according to an embodiment of the present disclosure.
  • Figure 3a is a schematic diagram of single sample data for Full Connection 1.
  • Figure 3b is a schematic diagram of multi-sample data for full connection 2.
  • Figure 3c is a schematic diagram of M convolution kernel data for convolution 1.
  • Figure 3d is a schematic diagram of convolution 2 input data.
  • Figure 3e is a schematic diagram of the operation window of a three-dimensional data block of input data.
  • Figure 3f is a schematic diagram of another operational window of a three-dimensional data block of input data.
  • Figure 3g is a schematic diagram of yet another operational window of a three-dimensional data block of input data.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present disclosure.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • the CPU is taken as an example to illustrate the operation method of the neural network.
  • the multiplication of the matrix and the matrix is widely used in the neural network.
  • the step taken in calculating C can be to first calculate the completion of the first line, then complete the calculation for the second line, and finally complete the operation for the third line, that is, one row of data for the CPU. After the calculation is completed, the calculation of the second line of data is performed.
  • the CPU completes the calculation on the first line, that is, it needs to be completed, a 11 *b 11 +a 12 *b 21 +a 13 *b 31 , a 11 *b 12 +a 12 * b 22 +a 13 *b 32 and a 11 *b 13 +a 12 *b 23 +a 13 *b 33 ;
  • the CPU or GPU it needs to be calculated line by line, that is, after the first line is calculated, the second line is calculated, and then the third line is calculated until all the lines are calculated.
  • the number of rows may have thousands of rows of data, so the calculation time is very long, and in the calculation, the CPU is in a working state for a long time, and the energy consumption is also high.
  • FIG. 1b is a schematic structural diagram of a chip device, as shown in FIG. 1b.
  • the device includes: a main unit circuit, a basic unit circuit, and a branch unit circuit.
  • the main unit circuit may include a register and/or an on-chip buffer circuit, and the main unit may further include: a vector operator circuit, an ALU (arithmetic and logic unit) circuit, an accumulator circuit, a matrix transposition circuit, and a DMA.
  • ALU arithmetic and logic unit
  • each base unit may include a base register and/or a base on-chip buffer circuit; each base unit may further include: an inner product operation One or any combination of a circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the circuits can all be integrated circuits. If there is a branch unit, wherein the main unit is connected to the branch unit, the branch unit is connected to the basic unit for performing an inner product operation between the data blocks, the main unit for transmitting and receiving external data, and the external unit The data is distributed to a branch unit for transmitting and receiving data of the main unit or the base unit.
  • the structure shown in Figure 1b is suitable for the calculation of complex data, because for the main unit, the number of connected units is limited, so it is necessary to add branch units between the main unit and the basic unit to achieve more basic unit connections. Into, to achieve the calculation of complex data blocks.
  • connection structure of the branch unit and the base unit may be arbitrary, and is not limited to the H-type structure of FIG. 1b.
  • the primary unit to the base unit is a structure for broadcasting or distribution
  • the base unit to the main unit is a structure of a gather.
  • the definitions of broadcasting, distribution and collection are as follows:
  • the data transfer manner of the main unit to the base unit may include:
  • the main unit is connected to a plurality of branch units, and each branch unit is connected to a plurality of base units.
  • the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected to a plurality of base units.
  • the main unit is connected to a plurality of branch units, and each branch unit is connected in series with a plurality of base units.
  • the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected in series with a plurality of base units.
  • the main unit When distributing data, the main unit transmits data to some or all of the base units, and the data received by the base unit of each received data may be different;
  • the main unit When broadcasting data, the main unit transmits data to some or all of the base units, and the base unit of each received data receives the same data.
  • the chip device shown in FIG. 1a or FIG. 1b may be a single physical chip. Of course, in practical applications, the chip device may also be integrated in other chips (for example, CPU, GPU). The specific embodiment does not limit the physical representation of the above chip device.
  • FIG. 1c is a schematic diagram of data distribution of a chip device. As shown by the arrow in FIG. 1c, the arrow is a data distribution direction. As shown in FIG. 1c, after the main unit receives the external data, the external data is removed. After the distribution, the distribution is distributed to a plurality of branch units, and the branch unit transmits the split data to the base unit.
  • FIG. 1d is a schematic diagram of data back transmission of a chip device. As shown by the arrow in FIG. 1d, the arrow is the data return direction, as shown in FIG. 1d, the basic unit will data (for example, the inner product calculation result). ) is passed back to the branch unit, which is passed back to the main unit.
  • data for example, the inner product calculation result
  • FIG. 1a is a schematic structural diagram of another chip device.
  • the chip device includes a main unit and a basic unit, and the main unit is connected to the basic unit. Since the structure shown in Fig. 1a is directly physically connected to the main unit, the number of basic units connected to the structure is limited, which is suitable for simple data calculation.
  • FIG. 2 provides a method for computing a neural network using the above chip device.
  • the method is implemented by using a chip device as shown in FIG. 1a or as shown in FIG. 1b.
  • the method is as shown in FIG. 2, and includes the following steps. :
  • Step S201 The main unit of the chip device acquires a data block to be calculated and an operation instruction.
  • the data block to be calculated in the above step S201 may be specifically a matrix, a vector, a three-dimensional data, a four-dimensional data, a multi-dimensional data, and the like.
  • the specific embodiment of the present disclosure does not limit the specific expression of the data block, and the operation instruction may specifically For, multiplication instructions, convolution instructions, addition instructions, subtraction instructions, BLAS (English: Basic Linear Algebra Subprograms) functions or activation functions, and so on.
  • Step S202 The main unit divides the data block to be calculated into the component data block and the broadcast data block according to the operation instruction.
  • the implementation method of the foregoing step S202 may specifically be:
  • the multiplier data block is determined to be a broadcast data block, and the multiplicand data block is determined. To distribute data blocks.
  • the operation instruction is a convolution instruction
  • the input data block is a broadcast data block
  • the convolution kernel is a distribution data block.
  • Step S2031 The main unit performs split processing on the distributed data block to obtain a plurality of basic data blocks, and distributes the plurality of basic data blocks to multiple basic units.
  • step S2032 the main unit broadcasts the broadcast data block to a plurality of basic units.
  • step S2031 and step S2032 may also be performed in a loop.
  • the main unit splits the distributed data block to obtain a plurality of basic data blocks, and each basic data block is split.
  • the broadcast data block is also split into m broadcast data sub-blocks, the main unit distributes one basic data sub-block at a time and broadcasts a broadcast data sub-block, the basic data sub-block and broadcast data.
  • Sub-blocks are all data blocks that are capable of performing parallel neural network calculations.
  • the basic data block may be the z-th row data of the matrix A, and the basic data sub-block may be the front of the matrix A in the z-th row data.
  • the broadcast data sub-block may be the first 20 rows of data in the z-th column of matrix B.
  • the basic data block in the above step S203 may specifically be a minimum data block capable of performing an inner product operation.
  • the basic data block may be a row of data of a matrix.
  • the basic data block may be Is the weight of a convolution kernel.
  • step S203 For the manner of the foregoing step S203, refer to the description of the following embodiments, and details are not described herein again.
  • step S203 For the method for broadcasting the broadcast data block, refer to the description of the following embodiments, and details are not described herein again.
  • Step S2041 The basic unit of the chip device performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result (possibly an intermediate result).
  • Step S2042 If the operation result is not an intermediate result, the operation result is transmitted back to the main unit.
  • Step S205 The main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
  • the processing in the foregoing step S205 may be an accumulation, a sort, or the like.
  • the disclosure is not limited to the specific manner of the foregoing processing.
  • the specific manner needs to be configured according to different operation instructions, for example, may also include performing a nonlinear transformation or the like.
  • the technical solution provided by the present disclosure receives external data by the main unit, and the external data includes a data block to be calculated and an operation instruction, acquires a data block to be calculated, and an operation instruction, and determines the to-be-calculated according to the operation instruction.
  • the distribution data block of the data block and the broadcast data block split the distribution data block into a plurality of basic data blocks, broadcast the broadcast data block to a plurality of basic units, and distribute the plurality of basic data blocks to the plurality of basic units, and multiple
  • the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and the plurality of basic units return the operation result to the main unit, and the main unit obtains the instruction result of the operation instruction according to the returned operation result.
  • the technical point of this technical solution is that for the neural network, the large amount of computation lies in the inner product operation between the data block and the data block, the overhead of the inner product operation is large, and the calculation time is long, so the embodiment of the present disclosure passes the operation.
  • the instruction and the instruction to be operated first distinguish the distribution data block and the broadcast data block in the data block to be calculated, and for the broadcast data block, the data block that must be used when implementing the inner product operation, and for the distribution data block,
  • the matrix multiplication is taken as an example.
  • the data block to be calculated is matrix A and matrix B
  • the operation instruction is a multiplication instruction (A*B), which is determined according to the rule of matrix multiplication.
  • the matrix A is a splittable data block, and the matrix B is determined to be a broadcast data block, because for matrix multiplication, the multiplicand matrix A can be split into multiple basic data blocks, and the multiplier matrix B can be broadcast. data block.
  • the multiplicand matrix A needs to perform inner product operations on each row of data and the multiplier matrix B respectively. Therefore, the technical solution of the present application divides the matrix A into M basic data blocks, each of the M basic data blocks.
  • the basic data block can be a row of data of matrix A. Therefore, for matrix multiplication, the time-consuming operation time is performed by multiple basic units separately. Therefore, in the inner product operation, multiple basic units can quickly calculate the result in parallel, thereby reducing the calculation time and less. The calculation time can also reduce the operating time of the chip device, thereby reducing power consumption.
  • a matrix A is multiplied by a vector B.
  • the matrix A has M rows, L columns, and the vector B has L rows. It is assumed that the operator operates a row of the matrix A and the vector B.
  • the matrix A is split into M basic data blocks, and each basic data block is a row of data of the matrix A, and M basic units.
  • the calculation time is t1
  • t2 can be the time for the main unit to split the data
  • t3 can be the operation for processing the inner product operation.
  • the chip device provided by the present disclosure has a short working time, and experimentally proves that the working time of the chip device In a very short time, its energy consumption will be much lower than the long working hours, so it has the advantage of saving energy.
  • the main unit can broadcast the broadcast data block to the multiple basic units in multiple manners.
  • Mode A Broadcasting the data block to the plurality of basic units by one time.
  • the broadcast refers to performing "one-to-many" data transmission, that is, the main unit simultaneously transmits the same data block to a plurality of (all or part of) base units), for example, a matrix A* matrix B, where matrix B is a broadcast
  • the data block broadcasts the matrix B to the plurality of basic units by one time.
  • the input data is a broadcast data block
  • the input data block is broadcast to the plurality of basic units at one time.
  • Method B dividing the broadcast data block into a plurality of partial broadcast data blocks, and broadcasting the plurality of partial broadcast data blocks to the plurality of basic units by multiple times, for example, the matrix B is broadcasted to the plurality of basic units by multiple times, specifically Each time the matrix N of the matrix B is broadcast.
  • the advantage of this method is that the configuration of the basic unit can be reduced, because the storage space of the registers configured for the basic unit is unlikely to be large, and if the matrix B is sent to the basic unit once for the matrix B with a relatively large amount of data, then the basic The storage of these data requires a relatively large register space. Because the number of basic units is large, increasing the register space inevitably has a great impact on the cost increase. Therefore, the scheme of broadcasting the broadcast data block multiple times is used, that is, for the basic unit. In other words, it only needs to store part of the data of the broadcast data block that is broadcasted each time, thereby reducing the cost.
  • the foregoing method for distributing a plurality of basic data blocks to a plurality of basic units in step S203 may also adopt the above manner A or mode B, except that the transmission mode is a unicast party. And the data transmitted is the basic data block.
  • the implementation method of the foregoing step S204 may specifically be:
  • the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, that is, one execution.
  • An inner product operation of one line, the inner product processing result (one of the operation results) is sent to the main unit, and the main unit accumulates the inner product processing result.
  • the basic unit may accumulate the inner product processing result. After that, the accumulated result (another one of the operation results) is sent to the main unit.
  • the above method can reduce the amount of data transmission between the main unit and the basic unit, thereby increasing the speed of calculation.
  • the basic unit performs a partial inner product operation of the basic data block and the partial broadcast data block to obtain a partial processing result, and the basic unit, every time the basic unit receives the partial broadcast data block.
  • the processing result is sent to the main unit, which accumulates the processing result.
  • the basic unit receives n basic data blocks, multiplexing the broadcast data block to perform the inner data operation of the broadcast data block and the n basic data blocks to obtain n partial processing results, and basically The unit transmits the n processing results to the main unit, and the main unit accumulates the n processing results respectively.
  • the above accumulation can also be performed in the basic unit.
  • the amount of data of the broadcast data block is generally very large and the distribution data block is also large, because for the chip device, since it belongs to the configuration of the hardware, the basic unit of the configuration may be innumerable in theory, but In practice, the number is limited, generally tens of basic units, which may change constantly as technology develops, such as increase.
  • the number of rows of the matrix A may have thousands of rows, and the number of columns of the matrix B also has thousands of columns, so that once the broadcast data is sent to the basic unit, the matrix B cannot be realized.
  • the implementation manner may be that part of the data of the broadcast matrix B is broadcasted once, for example, the first five columns of data, and a similar manner may be adopted for the matrix A, and for the basic unit, the partial inner product calculation may be performed each time. Then, the result of the partial inner product calculation is stored in the register, and after all the inner product operations of the row are executed, the result of the inner product calculation of all the rows of the row is accumulated to obtain an operation result, and the operation result is obtained. Send to the main unit.
  • This approach has the advantage of increasing the speed of calculation.
  • FIG. 3 provides a calculation method of a neural network, and the calculation in the embodiment is based on a moment.
  • the calculation method of the matrix A* matrix B indicates that the matrix A* matrix B can be a matrix diagram shown in FIG. 3a.
  • the calculation method of the neural network shown in FIG. 3 is as shown in FIG. 1b.
  • the chip device has 16 basic units.
  • the value of M as shown in FIG. 3a may be 32, and the value of N may be 15.
  • the value of L can be 20. It will of course be understood that the computing device can have any number of basic units.
  • the method is shown in Figure 3 and includes the following steps:
  • Step S301 the main unit receives the matrix A, the matrix B, and the multiplication operation instruction A*B.
  • Step S302 the main unit determines that the matrix B is a broadcast data block according to the multiplication operation instruction A*B, the matrix A is a distribution data block, and the matrix A is divided into 32 basic data blocks, and each basic data block is a row of data of the matrix A. .
  • Step S303 the main unit uniformly allocates 32 basic data blocks to 16 basic units, and evenly allocates 32 basic data blocks to 16 basic units, that is, each basic unit receives 2 basic data blocks, and the two data blocks
  • the allocation method can be any non-repeating allocation order.
  • the allocation method of the foregoing step S303 may adopt some other allocation methods.
  • the database may be unevenly allocated to each basic unit; some of them may not be equally divided.
  • the embodiment of the present disclosure does not limit the manner in which the above basic data blocks are allocated to a plurality of basic units.
  • Step S304 the main unit extracts partial data of the first few columns (such as the first five columns) of the matrix B, and the matrix B broadcasts part of the data of the first five columns to 16 basic units.
  • Step S305 the 16 basic units secondarily multiplex the partial data of the first 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 pre-processing results, and send 32*5 pre-processing results to Main unit.
  • Step S306 the main unit extracts part of the data of the five columns of the matrix B, and the matrix B broadcasts the partial data of the five columns to the 16 basic units.
  • Step S307 the 16 basic units secondarily multiplex the partial data of the 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 processing results, and send 32*5 processing results to Main unit.
  • Step S308 the main unit extracts partial data of the last five columns of the matrix B, and the matrix B has the last five columns. Part of the data is broadcast to 16 basic units.
  • Step S309 the 16 basic units secondarily multiplex the partial data of the last 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 post-processing results, and send 32*5 post-processing results to the Main unit.
  • Step S310 the main unit combines 32*5 pre-processing results, 32*5 processing results, and 32*5 post-processing results in front, middle, and back to obtain a 32*15 matrix C, the matrix C. That is, the result of the instruction of the matrix A* matrix B.
  • the technical solution shown in FIG. 3 splits the matrix A into 32 basic data blocks, and then broadcasts the matrix B in batches, so that the basic unit can obtain the instruction result in batches, since the inner product is split into 16 basics.
  • the unit is calculated, so the calculation time can be greatly reduced, so it has the advantages of short calculation time and low energy consumption.
  • FIG. 1a is a chip device according to the disclosure, the chip device includes: a main unit and a basic unit, the main unit is a hardware chip unit, and the basic unit is also a hardware chip unit;
  • the main unit is configured to perform each successive operation in a neural network operation and transmit data with the basic unit;
  • the basic unit is configured to perform an operation of parallel acceleration in the neural network according to the data transmitted by the main unit, and transmit the operation result to the main unit.
  • the above parallel accelerated operations include, but are not limited to, multiplication operations between data blocks and data blocks, convolution operations, and the like, which are large-scale and parallelizable.
  • Each of the above consecutive operations includes, but is not limited to, a continuous operation such as an accumulation operation, a matrix transposition operation, a data sort operation, and the like.
  • the main unit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data block to the plurality of basic units; the basic unit And performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sending the operation result to the main unit; the main unit is configured to be used for the operation The result processing obtains the data block to be calculated and the instruction result of the operation instruction.
  • the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
  • the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
  • the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
  • the result is sent to the main unit.
  • the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
  • the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
  • the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
  • the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
  • the specific embodiment of the present disclosure further provides an application method of the chip device as shown in FIG. 1a, where the application method may be specifically used to perform one of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation. Or any combination.
  • the main unit may also perform a pooling operation, a regularization (normalization) operation, and a neural network operation step such as batch normalization, lrn.
  • a specific embodiment of the present application also provides a chip including the chip device as shown in FIG. 1a or as shown in FIG.
  • the specific implementation of the present application further provides a smart device, where the smart device includes the foregoing chip,
  • the chip integrates a chip arrangement as shown in Figure 1a or as shown in Figure 1b.
  • the smart device includes, but is not limited to, a smart device, a tablet computer, a personal digital assistant, a smart watch, a smart camera, a smart TV, a smart refrigerator, and the like.
  • the foregoing device is for illustrative purposes only, and the specific embodiment of the present application is not limited to the above. The specific form of the device.
  • the input data of the fully connected layer is a vector of length L (such as the vector B in the fully connected 1-single sample shown in Figure 3a) (ie, the input of the neural network is a single sample)
  • the output of the fully connected layer Is a vector of length M
  • the weight of the fully connected layer is an M*L matrix (such as "Matrix A in Figure 3b Fully Connected 1-Single Sample")
  • the weight matrix of the fully connected layer is used as the matrix A (ie, splitting the data block)
  • the input data is used as the vector B (ie, the broadcast data block)
  • the operation is performed in accordance with the method 1 shown in FIG.
  • the specific operation method can be:
  • the input data of the fully connected layer is a matrix (that is, the input of the neural network is a case where multiple samples are operated together as a batch)
  • the input data of the fully connected layer represents N input samples, each sample is a length L Vector
  • the input data is represented by a matrix of L*N, as shown by the matrix B in " Figure 3b Fully Connected 1-Multiple Samples”.
  • the output of the fully connected layer for each sample is a vector of length M, then all
  • the output data of the connection layer is an M*N matrix, such as the result matrix in "Fig.
  • the convolution layer When using the chip device for artificial neural network operation, the convolution layer, the pooling layer, and the regularization layer (also called normalization layer, such as BN (Batch normalization) or LRN (Local Response Normalization)) in the neural network, etc.
  • the main unit uses the data rearrangement circuit of the main unit for each sample of the input data, and places the input data in a certain order.
  • the order may be Arbitrary order;
  • the sequence will place input data such as NHWC and NWHC in the fastest way to change the C-dimensional coordinates represented by the above schematic diagram.
  • C is the dimension of the innermost layer of the data block
  • N is the dimension of the outermost layer of the data block
  • H and W are the dimensions of the middle layer.
  • H and W are the relevant operational window sliding dimensions for convolution and pooling operations (an example of the sliding of the operation window in the W dimension is shown in Figure 3e Convolution 3-Sliding a" and " Figure 3f Convolution 3 - Slide b"
  • Figure 3g the size of the operation window and the size of a convolution kernel in the M convolution kernels
  • Figure 3c M convolution kernels
  • each convolution kernel is a 5*3*3 three-dimensional data block
  • its operation window is also a 5*3*3 three-dimensional data block, as shown in Figure 3c.
  • the KH and KW in the M convolution kernels shown indicate that the dimension corresponding to its KH is the H dimension of the input data, and the corresponding dimension represented by the KW is the W dimension of the input data.
  • the gray part of the graph in Figures 3e, 3f, and 3g It is data used for calculation in each sliding operation window, and the direction of sliding may be H as the sliding direction and then after W is the sliding direction or W is the sliding direction, and then H is the sliding direction.
  • the operation at each sliding window is the data represented by the gray part of the figure.
  • 3c" are respectively subjected to an inner product operation, and the convolution will output a value corresponding to each convolution kernel for each sliding window position, that is, for each The sliding window has M output values; for pooling, the operation at each sliding window is the data block represented by the gray square in the figure in the H and W dimensions (in the example in the figure, the gray data block is in the same In the 9 numbers on a plane, the maximum value is selected, or the average value is calculated.
  • the pooling will output C values for each sliding window position.
  • C is a single sample of the 3D data block except H and W. Another dimension, N represents a total of N samples simultaneously The operation of this layer.
  • the C dimension is defined as: each basic LRN operation selects a continuous data block along the C dimension (ie, a data block of Y*1*1), where Y*1*1 Y in the data block is the value in the C dimension, the value of Y is less than or equal to the maximum value of the C dimension, the first 1 represents the H dimension, and the second 1 represents the W dimension; the remaining two dimensions are defined as The H and W dimensions, that is, for each of the three-dimensional data blocks of each sample, each time an LRN regularization operation is performed, a continuous portion of data in the same W coordinate and different C coordinates in the same H coordinate is performed.
  • the regularization algorithm BN all the values of the coordinates in the same C dimension in the three-dimensional data block of the N samples are averaged and variance (or standard deviation).
  • a square is used to represent a numerical value, which can also be called a weight; the numbers used in the schematic diagram are limited to examples.
  • the dimensional data may be any numerical value (including some The case where the dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block.
  • the input data is a three-dimensional data block; for example, when the volume In the case where the number of cores is 1, the convolution and the data are a three-dimensional data block).
  • each convolution kernel For a convolutional layer, its weight (all convolution kernels) is as shown in "Figure 3c Convolution 1 - Convolution Kernel", the number of convolution kernels is M, and each convolution kernel consists of C KHs.
  • the matrix of the KW column is composed, so the weight of the convolution layer can be expressed as a four-dimensional data block with four dimensions of M, C, KH, and KW respectively; the input data of the convolutional layer is a four-dimensional data block, and N three-dimensional data blocks.
  • Data block composition each of the three-dimensional data blocks is composed of C H rows and W columns of feature matrices (ie, four dimensions are N, C, H, W data blocks); such as "Figure 3d convolution 2-input The data is shown.
  • each convolution kernel can be a basic data block.
  • the basic data block can also be changed to a smaller temperature, such as a planar matrix of a convolution kernel.
  • the convolution kernel weight set distributed to the i-th base unit is Ai, which has a total of 46 convolution kernels.
  • the i-th base unit the received unit is distributed by the main unit.
  • the convolution kernel weight Ai is stored in its register and / Or in the on-chip cache; the parts of the input data (ie, the sliding window as shown in FIG. 3e, FIG. 3f or as shown in FIG. 3g) are transmitted to each basic unit in a broadcast manner (the above-mentioned manner of broadcasting may be in the above manner A or mode B)
  • the weight of the operation window can be broadcast to all the basic units by means of multiple broadcasts.
  • the weight of the partial operation window can be broadcast each time, for example, each time a plane matrix is broadcasted, As shown in FIG. 3e, a C-plane KH*KW matrix can be broadcast each time.
  • data of the first n rows or the first n columns in a C-plane KH*HW matrix can also be broadcast at a time.
  • the method of transmitting the partial data and the arrangement of the partial data are not limited; the placement mode of the input data is converted into the arrangement mode of the arbitrary dimension order, and then the input data of each part is sequentially broadcast to the base unit in sequence.
  • the foregoing distribution data may also be sent in a manner similar to the operation window of the input data, and details are not described herein again.
  • the input data is converted into a loop in which C is the innermost layer. The effect of this is that the data of C is twisted together, thereby increasing the degree of parallelism of the convolution operation, and making it easier to perform parallel operations on multiple feature maps.
  • the manner in which the input data is placed is converted into a layout order in which the NHWC or the NWHC is placed in each of the basic units, for example, the i-th base unit, and the convolution kernel in the weight Ai and the received broadcast are calculated.
  • the inner product of the corresponding part of the data (ie, the operation window); the data of the corresponding part of the weight Ai can be read directly from the on-chip buffer, or can be read into the register for multiplexing.
  • the result of the inner product of each base unit is accumulated and transmitted back to the main unit.
  • the portion obtained by performing the inner product operation each time the base unit is transferred to the main unit for accumulation; the portion obtained by the inner product operation performed by each base unit may be stored in the register and/or the on-chip buffer of the base unit, and may be accumulated.
  • the part obtained by the inner product operation performed by each basic unit and the partial and partial storage in the register and/or on-chip buffer of the base unit may be accumulated, and in some cases, transmitted to the main unit. Accumulate, transfer back to the main unit after the accumulation is completed.
  • GEMM GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
  • auxiliary integers as parameters to explain the width and height of the matrix A and B;
  • the input matrix A and the matrix B are respectively subjected to respective op operations; the op operation may be a transposition operation of the matrix, and of course, other operations, such as non-linear function operations, pooling, and the like.
  • the matrix op operation is implemented by using the vector operation function of the main unit; if the op of a certain matrix can be empty, the main unit does not perform any operation on the matrix;
  • GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
  • Corresponding op operation is performed on the input matrix A; the chip device uses the method shown in FIG. 2 to complete the matrix-vector multiplication calculation between the matrix op(A) and the vector B; using the vector operation function of the main unit, on the op ( A) Each of the results of *B is multiplied by alpha; the vector operation function of the main unit is used to implement the step of adding the corresponding positions between the matrices alpha*op(A)*B and beta*C.
  • An activation function usually refers to performing a nonlinear operation on each of a data block (which can be a vector or a multidimensional matrix).
  • the chip device uses the vector calculation function of the main unit to input a vector. Calculate the activation vector of the vector; the main unit passes each value in the input vector through an activation function (a value when the input of the activation function is used, and the output is also a value), and calculates a value output to the corresponding position of the output vector;
  • the source of the above input vector includes, but is not limited to, external data of the chip device, and calculation result data of the basic unit forwarded by the branch unit of the chip device.
  • the calculation result data may specifically be an operation result of performing a matrix multiplication vector; the calculation result data may further perform a calculation result of the matrix multiplication matrix; and the input data may be a calculation result after the offset is added to the main unit.
  • the function of adding two vectors or two matrices can be realized by using the main unit; the main unit can be used to add a vector to each line of a matrix, or the function on each column.
  • the matrix may be from the device performing a matrix multiplication matrix operation; the matrix may be from the device performing a matrix multiplication vector operation; the matrix may be externally accepted from the main unit of the device data.
  • the vector may be from data accepted externally by the main unit of the device.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner. Multiple units or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated units/modules are implemented in the form of hardware.
  • the hardware can be a circuit, including a digital circuit, an analog circuit, and the like.
  • Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.
  • the described units may or may not be physically separate, ie may be located in one place, or may be distributed over multiple network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Multi Processors (AREA)
  • Complex Calculations (AREA)
  • Image Processing (AREA)
  • Advance Control (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

La présente invention concerne un dispositif à puce et un produit associé. Le dispositif à puce comprend une unité principale et de multiples unités de base communiquant avec l'unité principale. Les fonctions de l'unité principale comprennent : l'obtention d'un bloc de données à calculer et d'une instruction d'opération (S201); la classification, en tant que bloc de données réparti ou que bloc de données diffusé sur la base de l'instruction d'opération, du bloc de données à calculer (S202); et la division du bloc de données réparti pour obtenir de multiples blocs de données de base, la répartition des multiples blocs de données de base entre les multiples unités de base, et la diffusion du bloc de données diffusé aux multiples unités de base (S203). Les fonctions des unités de base comprennent : la mise en œuvre d'une opération de produit interne sur les blocs de données de base et le bloc de données diffusé pour obtenir un résultat d'opération et l'envoi du résultat de l'opération à l'unité principale (S204). L'unité principale traite le résultat de l'opération pour obtenir le bloc de données à calculer et un résultat d'instruction de l'instruction d'opération (S205). L'invention raccourcit les durées de traitement et présente une faible consommation d'énergie.
PCT/CN2017/099991 2017-08-31 2017-08-31 Dispositif à puce et produit associé Ceased WO2019041251A1 (fr)

Priority Applications (28)

Application Number Priority Date Filing Date Title
CN201780002287.3A CN109729734B8 (zh) 2017-08-31 2017-08-31 芯片装置及相关产品
EP19211995.6A EP3651030A1 (fr) 2017-08-31 2017-08-31 Puce et produits associés
KR1020197037895A KR102481256B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
EP19212365.1A EP3654209A1 (fr) 2017-08-31 2017-08-31 Dispositif de puce et produits associés
CN201910530860.9A CN110245751B (zh) 2017-08-31 2017-08-31 一种gemm运算方法及装置
CN202010628834.2A CN111860815A (zh) 2017-08-31 2017-08-31 一种卷积运算方法及装置
JP2019553977A JP7065877B2 (ja) 2017-08-31 2017-08-31 チップ装置および関連製品
CN201910531031.2A CN110222308B (zh) 2017-08-31 2017-08-31 一种矩阵乘矩阵运算方法及装置
PCT/CN2017/099991 WO2019041251A1 (fr) 2017-08-31 2017-08-31 Dispositif à puce et produit associé
CN201910534528.XA CN110245752B (zh) 2017-08-31 2017-08-31 一种使用芯片装置进行全连接运算方法及装置
KR1020197029020A KR102467688B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
CN201910534527.5A CN110083390B (zh) 2017-08-31 2017-08-31 一种gemv运算运算方法及装置
CN201910534118.5A CN110231958B (zh) 2017-08-31 2017-08-31 一种矩阵乘向量运算方法及装置
EP17923228.5A EP3605402B1 (fr) 2017-08-31 2017-08-31 Dispositif à puce et produit associé
EP19212010.3A EP3654208A1 (fr) 2017-08-31 2017-08-31 Puce et produits associés
CN201811462676.7A CN109615061B (zh) 2017-08-31 2017-08-31 一种卷积运算方法及装置
EP19212002.0A EP3651031A1 (fr) 2017-08-31 2017-08-31 Puce et produits associés
CN201910102972.4A CN109902804B (zh) 2017-08-31 2017-08-31 一种池化运算方法及装置
KR1020197037903A KR102477404B1 (ko) 2017-08-31 2017-08-31 칩 장치 및 관련 제품
EP19212368.5A EP3654210A1 (fr) 2017-08-31 2017-08-31 Dispositif de puce et produits associés
TW107125681A TWI749249B (zh) 2017-08-31 2018-07-25 芯片裝置、芯片、智能設備以及神經網絡的運算方法
US16/168,778 US11409535B2 (en) 2017-08-31 2018-10-23 Processing device and related products
US16/663,210 US11354133B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,206 US11334363B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,205 US11347516B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,164 US11531553B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,181 US11561800B2 (en) 2017-08-31 2019-10-24 Processing device and related products
US16/663,174 US11775311B2 (en) 2017-08-31 2019-10-24 Processing device and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/099991 WO2019041251A1 (fr) 2017-08-31 2017-08-31 Dispositif à puce et produit associé

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/168,778 Continuation US11409535B2 (en) 2017-08-31 2018-10-23 Processing device and related products

Publications (1)

Publication Number Publication Date
WO2019041251A1 true WO2019041251A1 (fr) 2019-03-07

Family

ID=65436282

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/099991 Ceased WO2019041251A1 (fr) 2017-08-31 2017-08-31 Dispositif à puce et produit associé

Country Status (7)

Country Link
US (7) US11409535B2 (fr)
EP (6) EP3651031A1 (fr)
JP (1) JP7065877B2 (fr)
KR (3) KR102481256B1 (fr)
CN (8) CN110222308B (fr)
TW (1) TWI749249B (fr)
WO (1) WO2019041251A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126582A (zh) * 2019-12-20 2020-05-08 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN111161705A (zh) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 语音转换方法及装置
CN113743598A (zh) * 2020-05-27 2021-12-03 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN114936633A (zh) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 用于转置运算的数据处理单元及图像转置运算方法

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992743B (zh) * 2017-12-29 2020-06-16 华为技术有限公司 矩阵乘法器
CN116991225A (zh) * 2018-02-14 2023-11-03 上海寒武纪信息科技有限公司 处理器的控制装置、方法及设备
CN110210610B (zh) * 2018-03-27 2023-06-20 腾讯科技(深圳)有限公司 卷积计算加速器、卷积计算方法及卷积计算设备
US11277455B2 (en) 2018-06-07 2022-03-15 Mellanox Technologies, Ltd. Streaming system
US20200106828A1 (en) * 2018-10-02 2020-04-02 Mellanox Technologies, Ltd. Parallel Computation Network Device
CN110162799B (zh) * 2018-11-28 2023-08-04 腾讯科技(深圳)有限公司 模型训练方法、机器翻译方法以及相关装置和设备
US11175946B2 (en) * 2018-12-06 2021-11-16 Advanced Micro Devices, Inc. Pipelined matrix multiplication at a graphics processing unit
US11657119B2 (en) * 2018-12-10 2023-05-23 Advanced Micro Devices, Inc. Hardware accelerated convolution
US11625393B2 (en) 2019-02-19 2023-04-11 Mellanox Technologies, Ltd. High performance computing system
EP3699770B1 (fr) 2019-02-25 2025-05-21 Mellanox Technologies, Ltd. Système et procédés de communication collective
JPWO2021009901A1 (ja) * 2019-07-18 2021-09-13 技術研究組合光電子融合基盤技術研究所 並列計算方法およびシステム
US11688032B2 (en) * 2019-08-16 2023-06-27 Meta Platforms, Inc. Three-dimensional convolution pipeline with memory organizer unit
US11481471B2 (en) * 2019-08-16 2022-10-25 Meta Platforms, Inc. Mapping convolution to a matrix processor unit
CN110516793B (zh) * 2019-08-27 2022-06-17 Oppo广东移动通信有限公司 一种池化处理方法及装置、存储介质
CN110826687B (zh) * 2019-08-30 2023-11-21 安谋科技(中国)有限公司 数据处理方法及其装置、介质和系统
WO2021081854A1 (fr) * 2019-10-30 2021-05-06 华为技术有限公司 Circuit d'opération de convolution et procédé d'opération de convolution
US12039430B2 (en) * 2019-11-15 2024-07-16 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
KR102785402B1 (ko) * 2019-12-06 2025-03-21 삼성전자주식회사 뉴럴 네트워크의 행렬 곱셈 연산을 수행하는 장치 및 방법
US11750699B2 (en) 2020-01-15 2023-09-05 Mellanox Technologies, Ltd. Small message aggregation
US11252027B2 (en) 2020-01-23 2022-02-15 Mellanox Technologies, Ltd. Network element supporting flexible data reduction operations
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN113298843B (zh) 2020-02-24 2024-05-14 中科寒武纪科技股份有限公司 数据量化处理方法、装置、电子设备和存储介质
US11876885B2 (en) 2020-07-02 2024-01-16 Mellanox Technologies, Ltd. Clock queue with arming and/or self-arming features
CN114115995B (zh) * 2020-08-27 2025-05-16 华为技术有限公司 人工智能芯片及运算板卡、数据处理方法及电子设备
CN112491555B (zh) * 2020-11-20 2022-04-05 山西智杰软件工程有限公司 医疗电子签名的处理方法及电子设备
CN112416433B (zh) * 2020-11-24 2023-01-17 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
US11556378B2 (en) 2020-12-14 2023-01-17 Mellanox Technologies, Ltd. Offloading execution of a multi-task parameter-dependent operation to a network device
CN112953701B (zh) * 2021-02-04 2023-10-31 沈阳建筑大学 一种四维混沌电路装置
CN112799598B (zh) * 2021-02-08 2022-07-15 清华大学 一种数据处理方法、处理器及电子设备
CN113240570B (zh) * 2021-04-13 2023-01-06 华南理工大学 一种GEMM运算加速器及基于GoogLeNet的图像处理加速方法
CN112990370B (zh) * 2021-04-26 2021-09-10 腾讯科技(深圳)有限公司 图像数据的处理方法和装置、存储介质及电子设备
CN115481713A (zh) * 2021-06-15 2022-12-16 瑞昱半导体股份有限公司 改进卷积神经网络进行计算的方法
KR20230068572A (ko) * 2021-11-11 2023-05-18 삼성전자주식회사 메모리 어레이 내의 연결 회로
CN116150555B (zh) * 2021-11-19 2025-10-31 中科寒武纪科技股份有限公司 计算装置、利用计算装置实施卷积运算的方法及相关产品
US12309070B2 (en) 2022-04-07 2025-05-20 Nvidia Corporation In-network message aggregation for efficient small message transport
US11922237B1 (en) 2022-09-12 2024-03-05 Mellanox Technologies, Ltd. Single-step collective operations
US20240338144A1 (en) * 2023-04-10 2024-10-10 Silicon Storage Technology, Inc. Masking sparse inputs and outputs in neural network array
WO2025105928A1 (fr) * 2023-11-15 2025-05-22 서울대학교산학협력단 Procédé distribué de multiplication de matrices implanté dans une structure de réseau de système informatique distribué
CN117974417B (zh) * 2024-03-28 2024-07-02 腾讯科技(深圳)有限公司 Ai芯片、电子设备及图像处理方法
CN120234516A (zh) * 2025-06-03 2025-07-01 上海无问芯穹智能科技有限公司 用于包括多个计算单元的处理器执行矩阵乘法运算的方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
CN105608490A (zh) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 细胞阵列计算系统以及其中的通信方法
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、系统
CN105956659A (zh) * 2016-05-11 2016-09-21 北京比特大陆科技有限公司 数据处理装置和系统、服务器
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks

Family Cites Families (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5023833A (en) * 1987-12-08 1991-06-11 California Institute Of Technology Feed forward neural network for unary associative memory
US5956703A (en) * 1995-07-28 1999-09-21 Delco Electronics Corporation Configurable neural network integrated circuit
JPH117438A (ja) * 1997-06-18 1999-01-12 Fuji Xerox Co Ltd 積和演算処理方法、装置及び記録媒体
JP2001188767A (ja) * 1999-12-28 2001-07-10 Fuji Xerox Co Ltd ニューラルネットワーク演算装置及びニューラルネットワークの演算方法
US7672952B2 (en) * 2000-07-13 2010-03-02 Novell, Inc. System and method of semantic correlation of rich content
US6925479B2 (en) * 2001-04-30 2005-08-02 Industrial Technology Research Institute General finite-field multiplier and method of the same
US7065544B2 (en) * 2001-11-29 2006-06-20 Hewlett-Packard Development Company, L.P. System and method for detecting repetitions in a multimedia stream
US7737994B1 (en) * 2003-09-26 2010-06-15 Oracle America, Inc. Large-kernel convolution using multiple industry-standard graphics accelerators
US20050125477A1 (en) * 2003-12-04 2005-06-09 Genov Roman A. High-precision matrix-vector multiplication on a charge-mode array with embedded dynamic memory and stochastic method thereof
US7634137B2 (en) * 2005-10-14 2009-12-15 Microsoft Corporation Unfolded convolution for fast feature extraction
GB2453263A (en) * 2006-05-16 2009-04-01 Douglas S Greer System and method for modeling the neocortex and uses therefor
US8644643B2 (en) * 2006-06-14 2014-02-04 Qualcomm Incorporated Convolution filtering in a graphics processor
JP4942095B2 (ja) * 2007-01-25 2012-05-30 インターナショナル・ビジネス・マシーンズ・コーポレーション マルチコア・プロセッサにより演算を行う技術
US20080288756A1 (en) * 2007-05-18 2008-11-20 Johnson Timothy J "or" bit matrix multiply vector instruction
US8190543B2 (en) * 2008-03-08 2012-05-29 Tokyo Electron Limited Autonomous biologically based learning tool
EP2996035A1 (fr) * 2008-10-15 2016-03-16 Hyperion Core, Inc. Dispositif de traitement de données
US20100122070A1 (en) * 2008-11-07 2010-05-13 Nokia Corporation Combined associative and distributed arithmetics for multiple inner products
US20110025816A1 (en) * 2009-07-31 2011-02-03 Microsoft Corporation Advertising as a real-time video call
US8577950B2 (en) * 2009-08-17 2013-11-05 International Business Machines Corporation Matrix multiplication operations with data pre-conditioning in a high performance computing architecture
US8583896B2 (en) * 2009-11-13 2013-11-12 Nec Laboratories America, Inc. Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain
US20110314256A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Data Parallel Programming Model
US8577820B2 (en) * 2011-03-04 2013-11-05 Tokyo Electron Limited Accurate and fast neural network training for library-based critical dimension (CD) metrology
US10078620B2 (en) * 2011-05-27 2018-09-18 New York University Runtime reconfigurable dataflow processor with multi-port memory access module
CN102214160B (zh) * 2011-07-08 2013-04-17 中国科学技术大学 一种基于龙芯3a的单精度矩阵乘法优化方法
CN103631761B (zh) * 2012-08-29 2018-02-27 睿励科学仪器(上海)有限公司 并行处理架构进行矩阵运算并用于严格波耦合分析的方法
DE102013104567A1 (de) * 2013-05-03 2014-11-06 Infineon Technologies Ag Chipanordnung, Chipkartenanordnung und Verfahren zum Herstellen einer Chipanordnung
CN103440121B (zh) * 2013-08-20 2016-06-29 中国人民解放军国防科学技术大学 一种面向向量处理器的三角矩阵乘法向量化方法
DE102013109200A1 (de) * 2013-08-26 2015-02-26 Infineon Technologies Austria Ag Chip, Chip-Anordnung und Verfahren zum Herstellen eines Chips
CN104425299B (zh) * 2013-08-27 2017-08-11 珠海艾派克微电子有限公司 芯片加工装置以及应用芯片加工装置进行芯片加工的方法
US20150324686A1 (en) * 2014-05-12 2015-11-12 Qualcomm Incorporated Distributed model learning
CN104036451B (zh) * 2014-06-20 2018-12-11 深圳市腾讯计算机系统有限公司 基于多图形处理器的模型并行处理方法及装置
CN104317352B (zh) * 2014-10-13 2017-10-24 中国科学院光电技术研究所 一种自适应光学控制系统快速去倾斜分量处理方法
CN104346318B (zh) * 2014-10-15 2017-03-15 中国人民解放军国防科学技术大学 面向通用多核dsp的矩阵乘加速方法
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN105701120B (zh) * 2014-11-28 2019-05-03 华为技术有限公司 确定语义匹配度的方法和装置
CN104992430B (zh) * 2015-04-14 2017-12-22 杭州奥视图像技术有限公司 基于卷积神经网络的全自动的三维肝脏分割方法
CN104866855A (zh) * 2015-05-07 2015-08-26 华为技术有限公司 一种图像特征提取方法及装置
US10489703B2 (en) 2015-05-20 2019-11-26 Nec Corporation Memory efficiency for convolutional neural networks operating on graphics processing units
US10417555B2 (en) * 2015-05-29 2019-09-17 Samsung Electronics Co., Ltd. Data-optimized neural network traversal
CN104866904B (zh) * 2015-06-16 2019-01-01 中电科软件信息服务有限公司 一种基于spark的遗传算法优化的BP神经网络并行化方法
CN106293893B (zh) * 2015-06-26 2019-12-06 阿里巴巴集团控股有限公司 作业调度方法、装置及分布式系统
CN105005911B (zh) * 2015-06-26 2017-09-19 深圳市腾讯计算机系统有限公司 深度神经网络的运算系统及运算方法
US10970617B2 (en) * 2015-08-21 2021-04-06 Institute Of Automation Chinese Academy Of Sciences Deep convolutional neural network acceleration and compression method based on parameter quantification
CN105260776B (zh) * 2015-09-10 2018-03-27 华为技术有限公司 神经网络处理器和卷积神经网络处理器
CN106548124B (zh) * 2015-09-17 2021-09-07 松下知识产权经营株式会社 主题推定系统、主题推定方法
EP3154001B1 (fr) * 2015-10-08 2019-07-17 VIA Alliance Semiconductor Co., Ltd. Réseau neuronal avec mémoire neuronale et matrice de processeurs neuronaux de décalage de lignes de données reçues depuis la mémoire neuronale
CN106485322B (zh) * 2015-10-08 2019-02-26 上海兆芯集成电路有限公司 同时执行长短期记忆胞计算的神经网络单元
CN105426344A (zh) * 2015-11-09 2016-03-23 南京大学 基于Spark的分布式大规模矩阵乘法的矩阵计算方法
CN105373517A (zh) * 2015-11-09 2016-03-02 南京大学 基于Spark的分布式稠密矩阵求逆并行化运算方法
CN105608056A (zh) * 2015-11-09 2016-05-25 南京大学 一种基于Flink的大规模矩阵并行化的计算方法
US11024024B2 (en) * 2015-12-15 2021-06-01 The Regents Of The University Of California Systems and methods for analyzing perfusion-weighted medical imaging using deep neural networks
CN106991478B (zh) * 2016-01-20 2020-05-08 中科寒武纪科技股份有限公司 用于执行人工神经网络反向训练的装置和方法
CN107563497B (zh) * 2016-01-20 2021-03-19 中科寒武纪科技股份有限公司 用于稀疏人工神经网络的计算装置和运算方法
CN111353589B (zh) * 2016-01-20 2024-03-01 中科寒武纪科技股份有限公司 用于执行人工神经网络正向运算的装置和方法
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US10796220B2 (en) * 2016-05-24 2020-10-06 Marvell Asia Pte, Ltd. Systems and methods for vectorized FFT for multi-dimensional convolution operations
EP3465550B1 (fr) * 2016-05-26 2023-09-27 Samsung Electronics Co., Ltd. Accélérateur pour réseaux neuronaux profonds
CN106126481B (zh) * 2016-06-29 2019-04-12 华为技术有限公司 一种计算系统和电子设备
CN106203621B (zh) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 用于卷积神经网络计算的处理器
CN106228240B (zh) * 2016-07-30 2020-09-01 复旦大学 基于fpga的深度卷积神经网络实现方法
US10891538B2 (en) * 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
US20180046903A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)
CN106407561B (zh) * 2016-09-19 2020-07-03 复旦大学 一种并行gpdt算法在多核soc上的划分方法
CN106446546B (zh) * 2016-09-23 2019-02-22 西安电子科技大学 基于卷积自动编解码算法的气象数据填补方法
CN106650922B (zh) * 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作系统
CN106504232B (zh) * 2016-10-14 2019-06-14 北京网医智捷科技有限公司 一种基于3d卷积神经网络的肺部结节自动检测系统
US9779786B1 (en) * 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
KR102224510B1 (ko) * 2016-12-09 2021-03-05 베이징 호라이즌 인포메이션 테크놀로지 컴퍼니 리미티드 데이터 관리를 위한 시스템들 및 방법들
CN106844294B (zh) * 2016-12-29 2019-05-03 华为机器有限公司 卷积运算芯片和通信设备
US11562115B2 (en) * 2017-01-04 2023-01-24 Stmicroelectronics S.R.L. Configurable accelerator framework including a stream switch having a plurality of unidirectional stream links
IT201700008949A1 (it) * 2017-01-27 2018-07-27 St Microelectronics Srl Procedimento di funzionamento di reti neurali, rete, apparecchiatura e prodotto informatico corrispondenti
CN106940815B (zh) * 2017-02-13 2020-07-28 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN106951395B (zh) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
US11132599B2 (en) * 2017-02-28 2021-09-28 Microsoft Technology Licensing, Llc Multi-function unit for programmable hardware nodes for neural network processing
CN107066239A (zh) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 一种实现卷积神经网络前向计算的硬件结构
US10528147B2 (en) * 2017-03-06 2020-01-07 Microsoft Technology Licensing, Llc Ultrasonic based gesture recognition
CN110312992A (zh) * 2017-03-20 2019-10-08 英特尔公司 用于片矩阵乘法和累加的系统、方法和装置
CN106970896B (zh) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 面向向量处理器的二维矩阵卷积的向量化实现方法
US10186011B2 (en) * 2017-04-28 2019-01-22 Intel Corporation Programmable coarse grained and sparse matrix compute hardware with advanced scheduling
US10169298B1 (en) * 2017-05-11 2019-01-01 NovuMind Limited Native tensor processor, using outer product unit
CN110574050A (zh) * 2017-05-31 2019-12-13 英特尔公司 用于基于四元数的机器学习系统的基于梯度的训练引擎
US10167800B1 (en) * 2017-08-18 2019-01-01 Microsoft Technology Licensing, Llc Hardware node having a matrix vector unit with block-floating point processing
US10963780B2 (en) * 2017-08-24 2021-03-30 Google Llc Yield improvements for three-dimensionally stacked neural network accelerators
US12131250B2 (en) * 2017-09-29 2024-10-29 Intel Corporation Inner product convolutional neural network accelerator
US11222256B2 (en) * 2017-10-17 2022-01-11 Xilinx, Inc. Neural network processing system having multiple processors and a neural network accelerator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608490A (zh) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 细胞阵列计算系统以及其中的通信方法
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、系统
CN105956659A (zh) * 2016-05-11 2016-09-21 北京比特大陆科技有限公司 数据处理装置和系统、服务器

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161705A (zh) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 语音转换方法及装置
CN111126582A (zh) * 2019-12-20 2020-05-08 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN111126582B (zh) * 2019-12-20 2024-04-05 上海寒武纪信息科技有限公司 数据处理方法和相关产品
CN113743598A (zh) * 2020-05-27 2021-12-03 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN113743598B (zh) * 2020-05-27 2023-08-04 杭州海康威视数字技术股份有限公司 一种ai芯片的运行方式的确定方法和装置
CN114936633A (zh) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 用于转置运算的数据处理单元及图像转置运算方法

Also Published As

Publication number Publication date
CN109902804B (zh) 2020-12-18
EP3605402A4 (fr) 2020-10-21
EP3605402A1 (fr) 2020-02-05
US20200057649A1 (en) 2020-02-20
EP3654208A1 (fr) 2020-05-20
US11347516B2 (en) 2022-05-31
CN109729734B (zh) 2020-10-27
CN110083390A (zh) 2019-08-02
US20200057648A1 (en) 2020-02-20
CN110245751A (zh) 2019-09-17
US20200057651A1 (en) 2020-02-20
CN110245751B (zh) 2020-10-09
KR20200037749A (ko) 2020-04-09
JP2020530916A (ja) 2020-10-29
KR102481256B1 (ko) 2022-12-23
US20190065208A1 (en) 2019-02-28
EP3654210A1 (fr) 2020-05-20
CN110245752A (zh) 2019-09-17
CN109902804A (zh) 2019-06-18
JP7065877B2 (ja) 2022-05-12
CN111860815A (zh) 2020-10-30
TWI749249B (zh) 2021-12-11
EP3605402B1 (fr) 2022-08-31
TW201913460A (zh) 2019-04-01
CN110245752B (zh) 2020-10-09
CN109729734A (zh) 2019-05-07
KR20200037748A (ko) 2020-04-09
CN110083390B (zh) 2020-08-25
US11409535B2 (en) 2022-08-09
EP3651030A1 (fr) 2020-05-13
CN110231958B (zh) 2020-10-27
US11531553B2 (en) 2022-12-20
CN110231958A (zh) 2019-09-13
US11354133B2 (en) 2022-06-07
US11561800B2 (en) 2023-01-24
US20200057650A1 (en) 2020-02-20
US20200057647A1 (en) 2020-02-20
KR102477404B1 (ko) 2022-12-13
CN110222308B (zh) 2020-12-29
CN110222308A (zh) 2019-09-10
US20200057652A1 (en) 2020-02-20
KR20200008544A (ko) 2020-01-28
KR102467688B1 (ko) 2022-11-15
EP3654209A1 (fr) 2020-05-20
US11775311B2 (en) 2023-10-03
CN109729734B8 (zh) 2020-11-24
EP3651031A1 (fr) 2020-05-13
US11334363B2 (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN109729734B (zh) 芯片装置及相关产品
CN109615061B (zh) 一种卷积运算方法及装置
JP6888074B2 (ja) チップ装置および関連製品
JP6888073B2 (ja) チップ装置および関連製品
CN109615062B (zh) 一种卷积运算方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17923228

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019553977

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20197029020

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017923228

Country of ref document: EP

Effective date: 20191024

NENP Non-entry into the national phase

Ref country code: DE