[go: up one dir, main page]

WO2019136751A1 - Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal - Google Patents

Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal Download PDF

Info

Publication number
WO2019136751A1
WO2019136751A1 PCT/CN2018/072663 CN2018072663W WO2019136751A1 WO 2019136751 A1 WO2019136751 A1 WO 2019136751A1 CN 2018072663 W CN2018072663 W CN 2018072663W WO 2019136751 A1 WO2019136751 A1 WO 2019136751A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
artificial intelligence
storage module
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/072663
Other languages
English (en)
Chinese (zh)
Inventor
肖梦秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Corerain Technologies Co Ltd
Original Assignee
Shenzhen Corerain Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Corerain Technologies Co Ltd filed Critical Shenzhen Corerain Technologies Co Ltd
Priority to CN201880002151.7A priority Critical patent/CN109416755B/zh
Priority to PCT/CN2018/072663 priority patent/WO2019136751A1/fr
Publication of WO2019136751A1 publication Critical patent/WO2019136751A1/fr
Priority to US16/929,819 priority patent/US11874898B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of artificial intelligence, and in particular to an artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal.
  • AI Artificial Intelligence
  • the artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.
  • the object of the present invention is to provide an artificial intelligence parallel processing method and an artificial intelligence processing device for solving the technical problems such as insufficient parallelism of the artificial intelligence algorithm processing in the prior art.
  • the present invention provides an artificial intelligence parallel processing method, which is applied to a processing module, and the method includes: causing a data transmission module to take out multiple channel data from an external storage module according to a preset data size; And causing the data transmission module to transmit the channel data extracted according to the preset data size to the convolution operation module; wherein the convolution operation module includes a plurality of convolution kernel matrices for parallelizing the channel data Convolution operation.
  • the data transmission module is configured to take out the plurality of channel data from the external storage module according to the preset data size, and specifically includes: using each of the channel data according to a 1*1 data size.
  • the external storage module is taken out to the first storage module; each of the channel data is taken out from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism,
  • the number of columns of channel data is an integer multiple of pv; each channel data is extracted from the second storage module to the matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix;
  • Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.
  • extracting each of the channel data from the second storage module to the matrix module according to a pv*k data size specifically, including: causing the channel data to perform a set of data per k
  • the data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data of the data size pv*k is sequentially taken out from the group of data until all the data of the group is taken out.
  • each of the channel data is taken out from the matrix module according to a data size of pv*k*k, and specifically includes: a second location taken out from each group of data Starting with the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form a second to-be-processed data of (pv+2)*k data size; Each of the second to-be-processed data is matrix-extracted with a step size of 1, to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for convolution with the plurality of The kernel matrix performs parallel convolution operations.
  • the plurality of convolution kernel matrices includes a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.
  • an artificial intelligence parallel processing apparatus including: an external storage module that stores a plurality of channel data; a processing module that communicatively connects the external storage module; and a data transmission module Extracting and transmitting the plurality of channel data from the external storage module according to a preset data size; the convolution operation module includes a plurality of convolution kernel matrices for paralleling the channel data taken according to the preset data size Convolution operation.
  • the artificial intelligence parallel processing device includes a first storage module for storing the channel data from the external storage module.
  • the artificial intelligence parallel processing device includes a second storage module for storing the channel data from the first storage module.
  • the artificial intelligence parallel processing device includes a matrix module for storing the channel data from the second storage module.
  • the present invention provides a computer readable storage medium having stored thereon a computer program that implements the artificial intelligence parallel processing method when executed by a processor.
  • an artificial intelligence processing terminal including: a processor and a memory; the memory is for storing a computer program, and the processor is configured to execute the computer program of the memory storage, And causing the terminal to execute the artificial intelligence parallel processing method.
  • the artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal of the present invention have the following advantageous effects: the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the next convolution kernel
  • the convolution operation of the matrix, and the present invention realizes the parallel convolution operation by hardware devices such as a convolution operation circuit, especially in the face of a large amount of data calculation, and the convolution operation efficiency is greatly improved compared with the software calculation. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
  • FIG. 1 is a flow chart showing a method for parallel processing of artificial intelligence according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram showing a data matrix to be processed in an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing an artificial intelligence parallel processing apparatus according to an embodiment of the present invention.
  • the artificial intelligence parallel processing method is applied to a processing module, and the processing module may be, for example, an ARM module, an MCU module, or a Soc module or the like.
  • the artificial intelligence parallel processing method specifically includes:
  • the data transmission module is configured to take out multiple channel data from the external storage module according to a preset data size.
  • the data transmission module can transmit data by means of DMA.
  • the DMA is called Direct Memory Access, which is a direct memory access, and is used for data transmission between the external memory and the Programmable Logic terminal.
  • DMA transfer is a high-speed data transfer operation that allows direct read and write operations between external devices and memory without the need for CPU intervention.
  • the external storage module may be, for example, a DDR memory, and is disposed outside the Programmable Logic terminal for storing a plurality of channel data.
  • the channel data is data to be processed, and is usually stored in a memory in the form of a data matrix.
  • the data transmission module is configured to transmit the extracted channel data to a convolution operation module for parallel convolution operation with multiple convolution kernel matrices.
  • the convolution operation module is a convolution operation circuit, and may be a circuit composed of a multiplier and an adder.
  • the convolution operation module includes a plurality of convolution kernel matrices, and each of the convolution kernel matrices has different weights.
  • the image has three channel data of R, G, and B, that is, three two-dimensional matrices, each of which has a length and width set to K*K, assuming that K is an odd number 3; further, assuming the data transmission
  • the module fetches the channel data according to the data size of the 8*3*3 matrix, that is, the data transmission module takes out 8 3*3 matrices at a time.
  • the three two-dimensional matrices of R, G, and B are not subjected to parallel convolution operations, it is necessary to undergo three consecutive calculations to complete the calculation, which is time-consuming and computationally inefficient.
  • the three two-dimensional matrices of R, G, and B are convoluted in parallel with the eight 3*3 matrices so that each set of 8 3*3 matrices obtains 8*3. Convolution result value.
  • the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of
  • the data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
  • the data transmission module is taken out from the external storage module to the first storage module according to a 1*1 data size.
  • the first storage module may be a RAM or a ROM memory, such as three generations, four generations of DDR SDRAM, or the like.
  • FIG. 2 a schematic diagram of channel data in an embodiment of the present invention is shown.
  • the data transmission module is taken out from the first storage module to the second storage module according to a pv*1 data size.
  • pv is a data transmission parallelism, and is used to indicate the number of columns of the data transmission module to be processed each time, and the size thereof is associated with the efficiency of the artificial intelligence parallel processing method; the number of columns of the channel data is an integer of pv Times.
  • the schematic diagram of the transmission module extracting channel data according to the 8*1 data size is described below in conjunction with a specific illustration.
  • FIG. 3 a schematic diagram of the data transmission module taking out channel data in an embodiment of the present invention is shown.
  • the data transmission module starts from the leftmost side of the first row of data to be processed, and takes out 8*1 data each time until all the data to be processed in the first row is taken out. Based on the same principle, the data transmission module continues to take the second row, the third row... until the entire 34*40 matrix is taken out.
  • the data transmission module After the data transmission module stores the 34*40 matrix in the first storage module, and according to the pv*k data size, k is the size of the convolution kernel matrix, and the convolution kernel matrix is A weight matrix for a convolution operation; the convolution kernel matrix may be set to an odd-order matrix, and in the present embodiment, the convolution kernel matrix is set to a 3*3 matrix. That is, the data transmission module takes out the 34*40 matrix from the second storage module in batches and puts it into the matrix module in an 8*3 matrix for data combination.
  • the data transmission module sequentially extracts 8*3 matrices from the first three rows of the 34*40 matrix in order from left to right in each clock cycle. That is, a total of five 8*3 matrices can be taken out in the first three rows. Based on the same principle as described above, the data transmission module continues to fetch the pending data of the subsequent row after the first three rows are taken.
  • the rectangular dotted frame R1 R R5 in FIG. 2 represents a total of five 8*3 matrices in the first three rows.
  • FIG. 4 a schematic diagram of data removal by the data transmission module in an embodiment of the present invention is shown.
  • the first 8*3 matrix M1 taken out by the data transmission module from the second storage module is generally used to improve the pipeline of the artificial intelligence calculation, and the first one is taken out of each row.
  • the 8*3 matrix can only obtain convolution result values less than 8 by convolution operation. Therefore, the first 8*3 matrix extracted per line is set as invalid data to improve the pipeline operation degree of artificial intelligence processing.
  • the convolution result of the 8*3 matrix M1 is an invalid value.
  • the data transmission module takes out a second 8*3 matrix M2, and the 8*3 matrix M2 and the last two columns of the 8*3 matrix M1 are combined into a 10*3 matrix M12.
  • a line L1 is used to represent matrix data combined with each other.
  • the data matrix M2 is combined with the last two columns of the data matrix M1 to obtain a data matrix M12 of (pv+2), that is, 10 columns.
  • the 10*3 matrix M12 can perform matrix extraction according to the step size 1, thereby obtaining eight 3*3 matrices.
  • the rectangular dotted frame R6 takes the matrix covered in FIG. 4 as a starting position, moves to the right column by column according to the step size 1, and obtains a matrix of size 3*3 for each column moved.
  • the rectangular dashed box R6 can be moved a total of 7 times in the 10*3 matrix M12, for a total of 8 3*3 matrices, that is, pv k*k matrices.
  • the eight 3*3 matrices are used for transmission to the convolution operation module to perform parallel convolution operations with the three 3*3 convolution kernel matrices respectively, thereby obtaining 3*8 calculation result values.
  • the data transmission module takes out a third 8*3 matrix M3, and the 8*3 matrix M3 and the last two columns of the 8*3 matrix M2 are combined into 10* 3 matrix M23, in which the line L2 represents matrix data combined with each other.
  • the data matrix M3 is combined with the last two columns of the data matrix M2 to obtain a data matrix M23 having a column number of 10.
  • the 10*3 matrix M23 can perform matrix extraction according to the step size 1 to obtain 8 3*3 matrices; the 8 3*3 fifth to-be-processed data matrices are used for transmission to the convolution operation module, A convolution operation is performed with three of the 3*3 convolution kernel matrices and 3*8 calculation result values are obtained.
  • the data transmission module is based on the same principle, and can process data processing of the entire 34*40 matrix after a plurality of clock cycles.
  • an artificial intelligence parallel processing apparatus includes: a first storage module 51, a second storage module 52, a data transmission module 53, a processing module 54, and a matrix module 55.
  • the first storage module 51, the second storage module 52, the data transmission module 53, the matrix module 55, and the convolution operation module 56 are collectively disposed on the Programmable Logic terminal 50 of the FPGA, which is generally referred to as a PL terminal.
  • the data transmission module is specifically configured to transmit the channel data from the external storage module 57 to the first storage module 51 according to the 1*1 data size through the system bus, and then take out the first storage module 51 and follow the pv*1 data.
  • the size is transferred to the second storage module 52, and is taken out from the second storage module 52 and transmitted to the matrix module according to the pv*k data size, and then taken out from the matrix module and transmitted to the pv*k 2 data size to Convolution operation module 56.
  • the convolution operation module 56 is provided with a plurality of convolution kernel matrices for parallel convolution operations.
  • the plurality of convolution kernel matrices are specifically: a convolution kernel matrix 1, a convolution kernel matrix 2, ..., a convolution kernel matrix n.
  • the first storage module 51 may be, for example, a BRAM memory, that is, a block RAM, which is a RAM storage resource of an FPGA (Field-Programmable Gate Array) field programmable gate array.
  • the processing module 54 can be, for example, an ARM module, an MCU module, or a Soc module, and the like.
  • the implementation manner of the artificial intelligence processing device is similar to the implementation manner of the artificial intelligence parallel processing method, and therefore will not be described again. Those skilled in the art should be able to understand the artificial intelligence processing based on the artificial intelligence parallel processing method. The principle and implementation of the device.
  • the aforementioned computer program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • the present invention also provides an artificial intelligence processing terminal, comprising: a processor and a memory; the memory is for storing a computer program, the processor is configured to execute the computer program stored by the memory, so that the terminal performs the manual Intelligent parallel processing method.
  • the above memory may include random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the above processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short), and the like; or a digital signal processor (DSP), an application specific integrated circuit (DSP). ApplicationSpecificIntegratedCircuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP digital signal processor
  • DSP application specific integrated circuit
  • ASIC ApplicationSpecificIntegratedCircuit
  • FPGA Field-Programmable Gate Array
  • the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of
  • the data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Multi Processors (AREA)

Abstract

L'invention concerne un procédé de traitement parallèle d'intelligence artificielle, destiné à être utilisé dans un module de traitement (54), le procédé consistant : à amener un module de transmission de données à extraire une pluralité de données de canal à partir d'un module de stockage externe, selon une taille de données prédéfinie (S101) ; à amener le module de transmission de données à transmettre les données de canal extraites vers un module de convolution, pour mettre en œuvre des opérations parallèles de convolution avec une pluralité de matrices de noyau de convolution (S102). Le présent procédé n'a pas besoin d'attendre l'opération de convolution d'une matrice de noyau de convolution pour finir de mettre en œuvre l'opération de convolution de la matrice de noyau de convolution suivante et met en œuvre des opérations parallèles de convolution au moyen d'un dispositif matériel tel qu'un circuit d'opération de convolution et, en particulier lorsqu'elle est confrontée à une grande quantité de calculs de données, améliore considérablement l'efficacité d'opérations de convolution par rapport à un calcul de logiciel. Ainsi, le parallélisme de traitement est considérablement amélioré et l'efficacité de calcul est améliorée au moyen du présent procédé de traitement parallèle d'intelligence artificielle.
PCT/CN2018/072663 2018-01-15 2018-01-15 Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal Ceased WO2019136751A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201880002151.7A CN109416755B (zh) 2018-01-15 2018-01-15 人工智能并行处理方法、装置、可读存储介质、及终端
PCT/CN2018/072663 WO2019136751A1 (fr) 2018-01-15 2018-01-15 Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal
US16/929,819 US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/072663 WO2019136751A1 (fr) 2018-01-15 2018-01-15 Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072665 Continuation-In-Part WO2019136752A1 (fr) 2018-01-15 2018-01-15 Procédé et dispositif de traitement de convolution d'intelligence artificielle, support de stockage et terminal

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2018/072665 Continuation-In-Part WO2019136752A1 (fr) 2018-01-15 2018-01-15 Procédé et dispositif de traitement de convolution d'intelligence artificielle, support de stockage et terminal
US16/929,819 Continuation-In-Part US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Publications (1)

Publication Number Publication Date
WO2019136751A1 true WO2019136751A1 (fr) 2019-07-18

Family

ID=65462117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072663 Ceased WO2019136751A1 (fr) 2018-01-15 2018-01-15 Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal

Country Status (2)

Country Link
CN (1) CN109416755B (fr)
WO (1) WO2019136751A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132275A (zh) * 2020-09-30 2020-12-25 南京风兴科技有限公司 一种并行计算方法及装置
CN112306949A (zh) * 2019-07-31 2021-02-02 中科寒武纪科技股份有限公司 数据处理方法及装置以及相关产品

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298441B (zh) * 2019-05-24 2022-01-11 深圳云天励飞技术有限公司 一种数据处理方法、电子装置及计算机可读存储介质
CN110928216B (zh) * 2019-11-14 2020-12-15 深圳云天励飞技术有限公司 人工智能装置
CN113705795B (zh) * 2021-09-16 2024-12-17 深圳思谋信息科技有限公司 卷积处理方法、装置、卷积神经网络加速器和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106530210A (zh) * 2016-10-31 2017-03-22 北京大学 基于阻变存储器件阵列实现并行卷积计算的设备和方法
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328644A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Adaptive selection of artificial neural networks
CN106228238B (zh) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 现场可编程门阵列平台上加速深度学习算法的方法和系统
CN106845635A (zh) * 2017-01-24 2017-06-13 东南大学 基于级联形式的cnn卷积核硬件设计方法
CN106951395B (zh) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
CN106970896B (zh) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 面向向量处理器的二维矩阵卷积的向量化实现方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106530210A (zh) * 2016-10-31 2017-03-22 北京大学 基于阻变存储器件阵列实现并行卷积计算的设备和方法
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306949A (zh) * 2019-07-31 2021-02-02 中科寒武纪科技股份有限公司 数据处理方法及装置以及相关产品
CN112306949B (zh) * 2019-07-31 2022-11-01 中科寒武纪科技股份有限公司 数据处理方法及装置以及相关产品
CN112132275A (zh) * 2020-09-30 2020-12-25 南京风兴科技有限公司 一种并行计算方法及装置

Also Published As

Publication number Publication date
CN109416755B (zh) 2021-11-23
CN109416755A (zh) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109992743B (zh) 矩阵乘法器
CN106445471B (zh) 处理器和用于在处理器上执行矩阵乘运算的方法
CN112214726B (zh) 运算加速器
WO2019136751A1 (fr) Procédé et appareil de traitement parallèle d'intelligence artificielle, support d'informations lisible par ordinateur et terminal
CN108388537B (zh) 一种卷积神经网络加速装置和方法
US11550586B2 (en) Method and tensor traversal engine for strided memory access during execution of neural networks
CN108090565A (zh) 一种卷积神经网络并行化训练加速方法
WO2017185389A1 (fr) Dispositif et procédé servant à exécuter des opérations de multiplication de matrices
WO2018107383A1 (fr) Procédé et dispositif de calcul de convolution d'un réseau de neurones artificiels, et support d'enregistrement lisible par ordinateur
CN102053948A (zh) 在单指令多数据多核处理器架构上转置矩阵的方法和系统
WO2019136764A1 (fr) Convoluteur et dispositif de traitement intelligent artificiel appliqué à celui-ci
WO2017185393A1 (fr) Appareil et procédé d'exécution d'une opération de produit interne de vecteurs
CN108388527A (zh) 直接存储器存取引擎及其方法
CN111353575A (zh) 用于卷积神经网络的图块化格式
CN110929854B (zh) 一种数据处理方法、装置及硬件加速器
CN114995782B (zh) 数据处理方法、装置、设备和可读存储介质
WO2019136750A1 (fr) Dispositif et procédé de traitement assisté par ordinateur basé sur l'intelligence artificielle, support de stockage, et terminal
CN109313723B (zh) 人工智能卷积处理方法、装置、可读存储介质、及终端
WO2021083101A1 (fr) Procédé et appareil de traitement de données, et produit connexe
KR20210014561A (ko) 다수 컨벌루션 윈도우 중의 이미지 데이터를 추출하는 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장매체
US11409840B2 (en) Dynamically adaptable arrays for vector and matrix operations
WO2020103883A1 (fr) Procédé d'exécution de multiplication de matrice, circuit et soc
CN110837483A (zh) 张量维度变换的方法以及装置
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
CN111047021A (zh) 一种计算装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899322

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899322

Country of ref document: EP

Kind code of ref document: A1