WO2025124574A1 - Computing system, method executed by computing system, and storage medium - Google Patents
Computing system, method executed by computing system, and storage medium Download PDFInfo
- Publication number
- WO2025124574A1 WO2025124574A1 PCT/CN2024/139346 CN2024139346W WO2025124574A1 WO 2025124574 A1 WO2025124574 A1 WO 2025124574A1 CN 2024139346 W CN2024139346 W CN 2024139346W WO 2025124574 A1 WO2025124574 A1 WO 2025124574A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- computing
- instruction
- tensor
- data
- general
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17306—Intercommunication techniques
- G06F15/17331—Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the present invention relates to the field of computer technology, and more specifically, to a computing system, a method executed by the computing system, and a storage medium.
- artificial intelligence chip solutions for detection tasks and computing tasks are mainly divided into two categories.
- One is general-purpose processors, such as CPU (Central Processing Unit), FPGA (Field Programmable Gate Array), GPU (Graphics Processing Unit), etc.; the other is acceleration processors specifically for artificial neural networks, such as Google's TPU (Tensor Processing Units).
- CPU Central Processing Unit
- FPGA Field Programmable Gate Array
- GPU Graphics Processing Unit
- acceleration processors specifically for artificial neural networks such as Google's TPU (Tensor Processing Units).
- AISC application-specific integrated circuit
- the purpose of the present invention is to provide a computing system, a method executed by the computing system and a storage medium, which combine a general-purpose processor and a tensor processor together, use the general-purpose processor to process general-purpose calculations in neural networks and use the tensor processor to perform in-memory calculations and non-in-memory calculations, thereby reducing data movement and improving computing efficiency while ensuring versatility.
- a computing system comprising a general-purpose processor and a tensor processor; wherein the general-purpose processor comprises an instruction scheduling module and at least one computing unit, the instruction scheduling module being used to identify computing instructions, and when the computing instructions are identified as tensor operation instructions, sending the tensor operation instructions to the tensor processor; the tensor processor being used to perform in-memory computing and/or non-in-memory computing according to the tensor operation instructions.
- the instruction scheduling module is further used to send the general operation instruction to the computing unit when identifying that the computing instruction is a general operation instruction; the computing unit is used to perform general calculation according to the general operation instruction.
- the computing unit comprises at least one computing core, and the computing core is used to perform general computing according to general operation instructions.
- the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes the source address and target address of the data to be operated.
- the instruction types include tensor operations, vector operations, scalar operations and transcendental function operations.
- the instruction scheduling module identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation and a transcendental function operation, the instruction scheduling module identifies the computing instruction as a general operation instruction.
- the tensor operation instruction includes at least one of a matrix multiplication operation, a matrix multiplication-addition operation and a convolution operation;
- the general operation instruction includes at least one of a floating-point multiplication-addition operation, an integer multiplication-addition operation and a transcendental function operation.
- the tensor processor includes an instruction decoding module, a first computing module and a second computing module.
- the instruction decoding module is used to parse the tensor operation instruction, generate a selection signal according to the source address of the data to be operated and obtain the data to be operated, divide the data to be operated into multiple groups of operation data, and send the operation code and the multiple groups of operation data to the first computing module or the second computing module according to the selection signal;
- the first computing module is used to perform in-memory calculations according to the received operation code, multiple groups of operation data and target address;
- the second computing module is used to perform non-in-memory calculations according to the received operation code, multiple groups of operation data and target address.
- the instruction decoding module includes: an instruction parsing unit, used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated; a data grouping unit, used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated into multiple groups of operation data; a control unit, used to generate a selection signal according to the source address of the data to be operated, and send the operation code, multiple groups of data and the target address to the first computing module or the second computing module according to the selection signal.
- an instruction parsing unit used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated
- a data grouping unit used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated into multiple groups of operation data
- a control unit used to generate a selection signal according to the source address of the data to be operated, and send the operation code, multiple groups of data
- the first computing module and the second computing module are connected to a general processor using a unified interface.
- the unified interface includes one of a PCIE interface and a UCIE interface.
- the general purpose processor and the tensor processor are packaged in the same core.
- the general-purpose processor and the tensor processor are packaged into different cores and integrated into the same chip or deployed on different chips.
- the general-purpose processor is any one of a CPU, a GPU, a DSP, and a GPGPU.
- a computing method performed by a computing system, wherein the computing system comprises a general-purpose processor and a tensor processor, the general-purpose processor comprises an instruction scheduling module and at least one computing unit, and the computing method comprises: the instruction scheduling module of the general-purpose processor identifies computing instructions; when the instruction scheduling module identifies that the computing instruction is a tensor operation instruction, the tensor operation instruction is sent to the tensor processor; the tensor processor performs in-memory computing and/or non-memory computing according to the tensor operation instruction.
- the calculation method further comprises:
- the instruction scheduling module identifies that the computing instruction is a general computing instruction
- the general computing instruction is sent to the computing unit; and the computing unit performs general computing according to the general computing instruction.
- the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes the source address and target address of the data to be operated.
- the instruction types include tensor operations, vector operations and scalar operations.
- the tensor processor performs in-memory calculations and/or non-memory calculations according to the tensor operation instructions, including: parsing the tensor operation instructions to obtain the operation code and the source address and target address of the data to be operated; obtaining the data to be operated according to the source address of the data to be operated, and dividing the data to be operated into multiple groups of operation data; generating a selection signal according to the source address of the data to be operated, and performing in-memory calculations and/or non-memory calculations according to the selection signal, the operation code, multiple groups of data and the target address.
- a computer-readable storage medium wherein the computer-readable storage medium stores a computer program, and wherein the computer program implements the above-described method when executed by a processor.
- FIG2 is a schematic diagram showing the structure of an instruction decoding module in a tensor processor provided in an embodiment of the present invention
- FIG3 is a block diagram showing a computing system according to another embodiment of the present invention.
- FIG4 is a block diagram showing a computing system according to another embodiment of the present invention.
- FIG5 is a flow chart showing a calculation method provided by an embodiment of the present invention.
- FIG. 6 shows a flow chart of step S530 provided by an embodiment of the present invention.
- In-memory computing generally adopts the AISC (application-specific integrated circuit) mode, which supports very limited operator types, and has poor programming flexibility and versatility, and cannot flexibly match the current development speed of neural network models.
- AISC application-specific integrated circuit
- the basic idea of the present application is to combine a general-purpose processor and a tensor processor, describe the instruction type in the computing instruction to distinguish between tensor operation instructions and general-purpose operation instructions, send the tensor operation instructions to the tensor processor to perform in-memory calculations or non-memory calculations, and send the general-purpose operation instructions to the general-purpose processor to perform general-purpose calculations.
- This can not only ensure versatility but also improve computing power, and support complex and changeable computing operators in neural networks.
- Fig. 1 is a schematic diagram showing the structure of a computing system provided according to an embodiment of the present application.
- the computing system 100 includes a general purpose processor 110 and a tensor processor 120 .
- the general purpose processor 110 includes an instruction scheduling module 111 and at least one computing unit CU.
- the instruction scheduling module 111 is used to identify a computing instruction, and when the computing instruction is identified as a tensor operation instruction, send the tensor operation instruction to the tensor processor 120 .
- the instruction type is used to describe the type of operation involved in the computing instruction, and the instruction type includes tensor operation, vector operation, scalar operation and transcendental function operation.
- the operation code is used to describe the operation to be completed by the computing instruction (for example, addition, subtraction, multiplication, division, special function, etc.), which specifically describes the nature and function of the operation.
- the operation domain includes the source address and target address of the data to be operated.
- the source address and the target address can be a memory address or a register address (i.e., a register number).
- the memory or register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data.
- the opcode may be the part of the instruction or field (usually represented by a code) specified in the computer program to perform the operation, which is an instruction sequence number used to inform the device executing the instruction which specific instruction needs to be executed.
- the operation domain may be the source of all data required to execute the corresponding instruction, such as the corresponding address, etc. All data required to execute the corresponding instruction include the data to be calculated and the corresponding instruction processing method, etc.
- For a calculation instruction it must include an opcode and an operation domain, wherein the operation domain at least includes the source address and the target address of the data to be calculated. It should be understood that those skilled in the art can set the instruction format of the calculation instruction and the opcode and operation domain contained therein as needed, and the present disclosure does not limit this.
- the source address of the data to be operated may be the starting address of the storage space where the data to be operated is located, and the general processor 110 or the tensor processor 120 may obtain instructions and data through a data input and output unit, which may be one or more data I/O interfaces or I/O pins. Further, the general processor 110 or the tensor processor 120 may determine the data to be operated according to the source address of the data to be operated, and obtain the data to be operated. Of course, in other embodiments, the general processor 110 or the tensor processor 120 may also determine the data required for the operation according to the operation code of the calculation instruction.
- the instruction format of the calculation instruction may be as shown in the following table:
- the scalar operation and vector operation in the general operation instruction may be shown in the following table:
- the general processor 110 When executing the operation instructions shown in the above table, the general processor 110 reads Num1 floating-point data or integer data from the address specified in the register Reg1, reads Num2 floating-point data or integer data from the address specified in the register Reg2, then performs multiplication and addition operations, and then stores the calculation results in the address space specified in the register Reg3.
- the transcendental function operation in the general operation instruction may be as shown in the following table:
- the general processor 110 When executing the operation instructions shown in the above table, the general processor 110 reads Num1 data from the address specified in the register Reg1, reads Num2 data from the address specified in the register Reg2, then performs the function operation, and then stores the calculation result in the address space specified in the register Reg3.
- the tensor operation instruction may be as shown in the following table:
- the tensor processor 120 When executing the operation instructions shown in the above table, the tensor processor 120 reads Num1 tensor data from the address specified in the register Reg1, reads Num2 weight data from the address specified in the register Reg2, and then performs a convolution operation or a multiplication and addition operation, and then stores the calculation result in the address space specified in the register Reg3.
- the data that needs to be read is a fixed number of data, which can be one data, a row of data, or a column of data, etc., and the embodiment of the present application does not make specific restrictions.
- the instruction scheduling module 111 identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation, and a transcendental function operation (for example, the instruction type is Float, INT, SUF), the instruction scheduling module 111 identifies the computing instruction as a general operation instruction.
- the instruction type in the computing instruction is a tensor operation (for example, the instruction type is Tensor)
- the instruction scheduling module 111 identifies the computing instruction as a general operation instruction.
- the tensor operation instruction includes at least one of a matrix multiplication operation (MM), a matrix multiplication and addition operation (MAC), and a convolution operation (Conv);
- the general operation instruction includes at least one of a floating-point multiplication and addition operation, an integer multiplication and addition operation, and a transcendental function operation.
- the instruction scheduling module 111 is further configured to send the general operation instruction to the computing unit CU when identifying that the computing instruction is a general operation instruction; the computing unit CU is configured to perform general calculation according to the general operation instruction.
- FIG1 specifically shows two computing units CU as examples, and omits other possible computing units.
- Each computing unit CU includes an instruction distribution module, multiple computing cores (Kernel), a register file, a shared L1 cache, etc.
- the instruction scheduling module 111 is also used to schedule the execution of computing tasks between multiple computing units CU.
- the general-purpose processor in this embodiment is any one of a CPU, a GPU, a DSP, and a GPGPU.
- the computing system can be used for computing tasks such as matrix calculations, which can be executed in parallel by multiple threads. For example, before execution, these threads are divided into multiple thread blocks in the instruction scheduling module 111, and then these thread blocks are distributed to each computing unit CU (for example, a streaming multiprocessor (SM)). All threads in a thread block are usually assigned to the same computing unit for execution. At the same time, the thread block is split into thread bundles (or simply thread bundles, thread warps), for example, each thread bundle contains a fixed number (or less than this fixed number) of threads, for example, 32 threads. Multiple thread blocks can be executed in the same computing unit, or in different computing units.
- SM streaming multiprocessor
- the instruction distribution module 112 schedules and distributes the thread bundles so that the multiple computing cores of the computing unit CU run the corresponding thread bundles.
- Each computing core includes an arithmetic logic unit (ALU), a floating point computing unit, etc.
- ALU arithmetic logic unit
- multiple thread bundles in a thread block can be executed simultaneously or in time-sharing. Multiple threads in each thread bundle will execute the same instruction, and the result obtained after the instruction is executed is updated to the register corresponding to each thread bundle.
- the instructions and data corresponding to each computing unit CU are sent to the shared cache (e.g., shared L1 cache) in the computing unit or further sent to the unified cache for read and write operations, etc.
- the tensor processor 120 is used to perform in-memory calculations and/or non-in-memory calculations according to the tensor operation instructions.
- the tensor processor 120 includes an instruction decoding module 121 , a first computing module 122 , and a second computing module 123 .
- the instruction decoding module 121 is used to parse the tensor operation instruction, generate a selection signal according to the source address of the data to be operated and obtain the data to be operated, divide the data to be operated into multiple groups of operation data, and send the operation code and multiple groups of operation data to the first computing module 122 or the second computing module 123 according to the selection signal.
- the instruction decoding module 121 obtains the data to be operated and generates a selection signal according to the source address of the data to be operated described in the tensor operation instruction.
- the data to be operated includes activation data and weight data; if the source address of the weight data is a storage-computation integrated unit, the weight data is static data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the first calculation module 122 according to the selection signal; if the source address of the weight data is a memory, the weight data is dynamic data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the second calculation module 123 according to the selection signal.
- the instruction decoding module 121 can also generate a selection signal according to the operation code.
- the instruction decoding module 121 includes an instruction parsing unit 1211, a data grouping unit 1212 and a control unit 1213, wherein the instruction parsing unit 1211 is used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated; the data grouping unit 1212 is used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated into multiple groups of operation data; the control unit 1213 is used to generate a selection signal according to the source address of the data to be operated, and send the operation code, multiple groups of data and the target address to the first computing module 122 or the second computing module 123 according to the selection signal.
- the instruction parsing unit 1211 is used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated
- the data grouping unit 1212 is used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated
- the operation domain may also include an execution amount.
- the instruction decoding module 121 is also used to obtain the execution amount and divide the data to be operated into multiple groups of operation data according to the execution amount.
- the execution amount is the amount of data that the first calculation module 122 or the second calculation module 123 can execute and process at one time.
- the data to be operated can be divided into multiple groups of operation data according to a preset default execution amount.
- the first calculation module 122 is used to perform in-memory calculation according to the received operation code, multiple sets of operation data and target address.
- the first computing module 122 can perform in-memory computing (CIM).
- CCM in-memory computing
- the first computing module 122 is composed of an SRAM, ReRAM or other storage medium storage and computing integrated unit.
- the second calculation module 123 is used to perform non-memory calculation according to the received operation code, multiple groups of operation data and target address.
- the non-in-memory calculation includes near-memory calculation or other calculation, such as general matrix multiplication (GEMM for short).
- GEMM general matrix multiplication
- the first computing module 122 and the second computing module 123 are connected to the general processor 110 using a unified interface, and the unified interface can be one of a PCIE interface and a UCIE interface.
- the data structures of the input data and the output data of the first computing module 122 and the second computing module 123 are the same, and the first computing module 122 and the second computing module 123 can be driven by one instruction to perform calculations, and the first computing module 122 or the second computing module 123 can be switched to perform calculations through instructions.
- the general processor 110 and the tensor processor 120 can be packaged in the same core.
- the general processor 110 and the tensor processor 120 can also be packaged into different cores, integrated into the same chip or deployed on different chips.
- the positional relationship between the general processor 110 and the tensor processor 120 can be set according to actual applications, and is not limited thereto.
- the computing system provided by the present invention combines a general-purpose processor and a tensor processor together, uses the general-purpose processor to process general-purpose calculations in neural networks and uses the tensor processor to perform in-memory calculations or non-in-memory calculations, which can not only ensure versatility but also improve computing power and support complex and changeable computing operators in neural networks.
- the computing instructions use a unified instruction format, and one instruction can be used to drive different computing modules in the tensor processor to perform in-memory or non-memory computing, greatly reducing programming complexity.
- FIG4 is a schematic diagram showing the structure of a computing system provided by another embodiment of the present invention. As shown in FIG4 , the computing system includes a task distribution module 310 and a plurality of computing devices 320 .
- This embodiment is described by taking two computing devices 320A and 320B as an example, but is not limited thereto.
- the task distribution module 310 is used to distribute the computing instructions to multiple computing devices 320.
- the computing device 320 receives the computing instruction and performs corresponding calculations according to the computing instruction.
- the computing devices 320A and 320B include a general processor 321 and a tensor processor 322.
- the general processor 321 and the tensor processor 322 are the same as those described in the above embodiment, and will not be repeated here.
- the computing devices 320A and 320B only include a general-purpose processor 321, the tensor processor 322 is located outside the computing devices, and multiple computing devices 320 share the same tensor processor.
- Fig. 5 shows a flow chart of a calculation method provided by an embodiment of the present invention. Referring to Fig. 5, the calculation method is executed by the calculation system 100 provided by the above embodiment, and includes the following steps.
- step S510 an instruction dispatching module of a general purpose processor identifies a computing instruction.
- the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes at least a source address and a target address of data to be operated.
- the instruction type is used to describe the type of operation involved in the computing instruction, and the instruction type includes tensor operation, vector operation, scalar operation and transcendental function operation.
- the operation code is used to describe the operation to be completed by the computing instruction (for example, addition, subtraction, multiplication, division, special function, etc.), which specifically describes the nature and function of the operation.
- the operation domain includes the source address and target address of the data to be operated.
- the source address and the target address can be a memory address or a register address (i.e., a register number).
- the memory or register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data.
- the instruction scheduling module 111 identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation, and a transcendental function operation (for example, the instruction type is Float, INT, SUF), the instruction scheduling module 111 identifies the computing instruction as a general operation instruction.
- the instruction type in the computing instruction is a tensor operation (for example, the instruction type is Tensor)
- the instruction scheduling module 111 identifies the computing instruction as a general operation instruction.
- the tensor operation instruction includes at least one of a matrix multiplication operation (MM), a matrix multiplication and addition operation (MAC), and a convolution operation (Conv);
- the general operation instruction includes at least one of a floating-point multiplication and addition operation, an integer multiplication and addition operation, and a transcendental function operation.
- step S520 when the instruction scheduling module identifies that the computing instruction is a tensor operation instruction, the tensor operation instruction is sent to the tensor processor.
- step S530 the tensor processor performs in-memory calculations and/or non-in-memory calculations according to the tensor operation instructions.
- step S530 includes steps S531 to S533 .
- step S531 the tensor operation instruction is parsed to obtain the operation code and the source address and target address of the data to be operated.
- step S532 the data to be operated is obtained according to the source address of the data to be operated, and the data to be operated is divided into a plurality of groups of operation data.
- the operation domain may also include an execution amount.
- the instruction decoding module 121 is also used to obtain the execution amount and divide the data to be operated into multiple groups of operation data according to the execution amount.
- the execution amount is the amount of data that the first calculation module 122 or the second calculation module 123 can execute and process at one time.
- the data to be operated can be divided into multiple groups of operation data according to a preset default execution amount.
- step S533 a selection signal is generated according to the source address of the data to be operated, and in-memory calculation and/or non-in-memory calculation is performed according to the selection signal, the operation code, multiple groups of data and the target address.
- the data to be operated includes activation data and weight data; if the source address of the weight data is a storage-computation integrated unit, the weight data is static data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the first calculation module 122 according to the selection signal; if the source address of the weight data is a memory, the weight data is dynamic data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the second calculation module 123 according to the selection signal.
- the instruction decoding module 121 can also generate a selection signal according to the operation code.
- step S540 when the instruction scheduling module identifies that the computing instruction is a general operation instruction, the general operation instruction is sent to the computing unit.
- step S550 the computing unit performs general computing according to the general computing instruction.
- the computing method provided by the present invention combines a general-purpose processor and a tensor processor together, uses the general-purpose processor to process general-purpose calculations in a neural network, and uses the tensor processor to perform in-memory calculations or non-in-memory calculations, which can both ensure versatility and improve computing power, and support complex and changeable computing operators in neural networks.
- the computing instructions use a unified instruction format, and one instruction can be used to drive different computing modules in the tensor processor to perform in-memory or non-memory computing, greatly reducing programming complexity.
- An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments can be implemented.
- An embodiment of the present application provides a computer program product.
- the computer program product runs on an electronic device, the electronic device can implement the steps in the above-mentioned method embodiments when executing the computer program product.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the present application implements all or part of the processes in the above-mentioned embodiment method, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium.
- the computer program is executed by the processor, the steps of the above-mentioned various method embodiments can be implemented.
- the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form.
- the computer-readable medium may at least include: any entity or device that can carry the computer program code to the device/electronic device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), an electric carrier signal, a telecommunication signal, and a software distribution medium.
- a USB flash drive for example, a USB flash drive, a mobile hard disk, a magnetic disk or an optical disk.
- computer-readable media cannot be electric carrier signals and telecommunication signals.
- the disclosed devices/electronic devices and methods can be implemented in other ways.
- the device/electronic device embodiments described above are merely schematic.
- the block division of the modules or units is only a logical function block division. There may be other block division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
Abstract
Description
本申请要求了2023年12月13日向中国专利局提交的、申请号为202311713616.9、申请名称为“计算系统、由计算系统执行的方法及存储介质”的中国专利申请的优先权,其全部内容通过引用并入本申请。This application claims priority to the Chinese patent application filed with the China Patent Office on December 13, 2023, with application number 202311713616.9 and application name “Computing system, method executed by a computing system, and storage medium”, the entire contents of which are incorporated into this application by reference.
本发明涉及计算机技术领域,更具体地,涉及一种计算系统、由计算系统执行的方法及存储介质。The present invention relates to the field of computer technology, and more specifically, to a computing system, a method executed by the computing system, and a storage medium.
现有技术中,用于检测任务、计算任务的人工智能芯片方案主要分为两类,一类是通用处理器,如CPU(Central Processing Unit,中央处理器)、FPGA(Field Programmable Gate Array,现场可编程门阵列)、GPU(Graphics Processing Unit,图形处理器)等;另一类是专门针对人工神经网络的加速处理器,如Google的TPU(Tensor Processing Units,张量处理器)等。这些芯片均属于冯诺依曼架构,其计算和存储是分离的,二者配合完成数据的存取与运算。然而,由于处理器的设计以提升计算速度为主,存储则更注重容量提升和成本优化,“存”和“算〞之间性能失配,从而导致了访存带宽低、时延长、功耗高等问题,即通常所说的〝存储墙〞和“功耗墙”。访存愈密集,“墙”的问题愈严重,算力提升愈困难。In the existing technology, artificial intelligence chip solutions for detection tasks and computing tasks are mainly divided into two categories. One is general-purpose processors, such as CPU (Central Processing Unit), FPGA (Field Programmable Gate Array), GPU (Graphics Processing Unit), etc.; the other is acceleration processors specifically for artificial neural networks, such as Google's TPU (Tensor Processing Units). These chips all belong to the von Neumann architecture, and their computing and storage are separated. The two work together to complete data access and computing. However, since the design of the processor is mainly to improve the computing speed, and the storage is more focused on capacity improvement and cost optimization, there is a performance mismatch between "storage" and "computing", which leads to problems such as low memory access bandwidth, extended latency, and high power consumption, which are commonly referred to as the "storage wall" and "power consumption wall". The denser the memory access, the more serious the "wall" problem, and the more difficult it is to improve computing power.
存算一体作为一种新的计算架构,将存储与计算完全融合,有效克服冯诺依曼架构瓶颈,可以大大提高算力。但是存内计算一般采用AISC(应用特定集成电路)模式,所支持的算子类型非常有限,而且编程灵活性及通用性较差,无法很灵活的匹配现在神经网络模型发展速度。CPU、GPU等在数据密集的通用计算上具有优势,但是通用计算难以应对复杂多变的计算算子。As a new computing architecture, storage-computing integration fully integrates storage and computing, effectively overcomes the bottleneck of the von Neumann architecture, and can greatly improve computing power. However, in-memory computing generally adopts the AISC (application-specific integrated circuit) mode, which supports very limited operator types, and has poor programming flexibility and versatility, and cannot flexibly match the current development speed of neural network models. CPUs, GPUs, etc. have advantages in data-intensive general computing, but general computing has difficulty in dealing with complex and changeable computing operators.
鉴于上述问题,本发明的目的在于提供一种计算系统、由计算系统执行的方法及存储介质,将通用处理器和张量处理器结合在一起,利用通用处理器处理神经网络中的通用计算以及利用张量处理器进行存内计算和非存内计算,在保证通用性的基础上减少数据搬运,提高计算效率。In view of the above problems, the purpose of the present invention is to provide a computing system, a method executed by the computing system and a storage medium, which combine a general-purpose processor and a tensor processor together, use the general-purpose processor to process general-purpose calculations in neural networks and use the tensor processor to perform in-memory calculations and non-in-memory calculations, thereby reducing data movement and improving computing efficiency while ensuring versatility.
根据本发明的第一方面,提供一种计算系统,包括通用处理器和张量处理器;其中,所述通用处理器包括指令调度模块和至少一个计算单元,所述指令调度模块,用于识别计算指令,以及当识别出所述计算指令为张量运算指令时将所述张量运算指令发送至所述张量处理器;张量处理器,用于根据所述张量运算指令进行存内计算和/或非存内计算。According to a first aspect of the present invention, there is provided a computing system, comprising a general-purpose processor and a tensor processor; wherein the general-purpose processor comprises an instruction scheduling module and at least one computing unit, the instruction scheduling module being used to identify computing instructions, and when the computing instructions are identified as tensor operation instructions, sending the tensor operation instructions to the tensor processor; the tensor processor being used to perform in-memory computing and/or non-in-memory computing according to the tensor operation instructions.
优选地,所述指令调度模块还用于当识别出所述计算指令为通用运算指令时,将所述通用运算指令发送至计算单元;所述计算单元用于根据通用运算指令进行通用计算。Preferably, the instruction scheduling module is further used to send the general operation instruction to the computing unit when identifying that the computing instruction is a general operation instruction; the computing unit is used to perform general calculation according to the general operation instruction.
优选地,所述计算单元包括至少一个计算核心,所述计算核心用于根据通用运算指令进行通用计算。Preferably, the computing unit comprises at least one computing core, and the computing core is used to perform general computing according to general operation instructions.
优选地,所述计算指令包括操作码和操作域,其中,所述操作码包括计算指令的指令类型和运算码,所述操作域包括待运算数据的源地址和目标地址。Preferably, the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes the source address and target address of the data to be operated.
优选地,所述指令类型包括张量运算、向量运算、标量运算和超越函数运算。Preferably, the instruction types include tensor operations, vector operations, scalar operations and transcendental function operations.
优选地,当所述计算指令中指令类型为张量运算时,所述指令调度模块识别出所述计算指令为张量运算指令;当所述计算指令中指令类型为向量运算、标量运算和超越函数运算中的一种时,所述指令调度模块识别出所述计算指令为通用运算指令。Preferably, when the instruction type in the computing instruction is a tensor operation, the instruction scheduling module identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation and a transcendental function operation, the instruction scheduling module identifies the computing instruction as a general operation instruction.
优选地,所述张量运算指令包括矩阵乘运算、矩阵乘加运算和卷积运算中的至少一种;所述通用运算指令包括浮点乘加运算、整型乘加运算、超越函数运算中的至少一种。Preferably, the tensor operation instruction includes at least one of a matrix multiplication operation, a matrix multiplication-addition operation and a convolution operation; the general operation instruction includes at least one of a floating-point multiplication-addition operation, an integer multiplication-addition operation and a transcendental function operation.
优选地,所述张量处理器包括指令译码模块、第一计算模块和第二计算模块,指令译码模块用于解析所述张量运算指令,根据所述待运算数据的源地址产生选择信号以及获取待运算数据,将所述待运算数据划分成多组操作数据,并根据所述选择信号将运算码和多组操作数据发送至第一计算模块或第二计算模块;第一计算模块用于根据接收到的运算码、多组操作数据和目标地址进行存内计算;第二计算模块用于根据接收到的运算码、多组操作数据和目标地址进行非存内计算。Preferably, the tensor processor includes an instruction decoding module, a first computing module and a second computing module. The instruction decoding module is used to parse the tensor operation instruction, generate a selection signal according to the source address of the data to be operated and obtain the data to be operated, divide the data to be operated into multiple groups of operation data, and send the operation code and the multiple groups of operation data to the first computing module or the second computing module according to the selection signal; the first computing module is used to perform in-memory calculations according to the received operation code, multiple groups of operation data and target address; the second computing module is used to perform non-in-memory calculations according to the received operation code, multiple groups of operation data and target address.
优选地,所述指令译码模块包括:指令解析单元,用于解析所述张量运算指令以获取运算码和待运算数据的源地址和目标地址;数据分组单元,用于根据所述待运算数据的源地址获取待运算数据,将所述待运算数据划分成多组操作数据;控制单元,用于根据所述待运算数据的源地址产生选择信号,并根据所述选择信号将运算码、多组数据和目标地址发送至第一计算模块或第二计算模块。Preferably, the instruction decoding module includes: an instruction parsing unit, used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated; a data grouping unit, used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated into multiple groups of operation data; a control unit, used to generate a selection signal according to the source address of the data to be operated, and send the operation code, multiple groups of data and the target address to the first computing module or the second computing module according to the selection signal.
优选地,所述第一计算模块和所述第二计算模块采用统一接口与通用处理器连接。Preferably, the first computing module and the second computing module are connected to a general processor using a unified interface.
优选地,所述统一接口包括PCIE接口和UCIE接口中的一个。Preferably, the unified interface includes one of a PCIE interface and a UCIE interface.
优选地,所述通用处理器和所述张量处理器封装在同一芯粒中。Preferably, the general purpose processor and the tensor processor are packaged in the same core.
优选地,所述通用处理器和所述张量处理器封装成不同的芯粒,集成在同一芯片或部署在不同芯片上。Preferably, the general-purpose processor and the tensor processor are packaged into different cores and integrated into the same chip or deployed on different chips.
优选地,所述通用处理器为CPU、GPU、DSP、GPGPU中的任意一种。Preferably, the general-purpose processor is any one of a CPU, a GPU, a DSP, and a GPGPU.
根据本发明的第二方面,提供一种由计算系统执行的计算方法,所述计算系统包括通用处理器和张量处理器,所述通用处理器包括指令调度模块和至少一个计算单元,所述计算方法包括:通用处理器的指令调度模块识别计算指令;当指令调度模块识别出所述计算指令为张量运算指令时将所述张量运算指令发送至张量处理器;张量处理器根据所述张量运算指令进行存内计算和/或非存内计算。According to a second aspect of the present invention, there is provided a computing method performed by a computing system, wherein the computing system comprises a general-purpose processor and a tensor processor, the general-purpose processor comprises an instruction scheduling module and at least one computing unit, and the computing method comprises: the instruction scheduling module of the general-purpose processor identifies computing instructions; when the instruction scheduling module identifies that the computing instruction is a tensor operation instruction, the tensor operation instruction is sent to the tensor processor; the tensor processor performs in-memory computing and/or non-memory computing according to the tensor operation instruction.
优选地,所述计算方法还包括:Preferably, the calculation method further comprises:
当指令调度模块识别出所述计算指令为通用运算指令时,将所述通用运算指令发送至计算单元;计算单元根据通用运算指令进行通用计算。When the instruction scheduling module identifies that the computing instruction is a general computing instruction, the general computing instruction is sent to the computing unit; and the computing unit performs general computing according to the general computing instruction.
优选地,所述计算指令包括操作码和操作域,其中,所述操作码包括计算指令的指令类型和运算码,所述操作域包括待运算数据的源地址和目标地址。Preferably, the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes the source address and target address of the data to be operated.
优选地,所述指令类型包括张量运算、向量运算和标量运算。Preferably, the instruction types include tensor operations, vector operations and scalar operations.
优选地,当所述计算指令中指令类型为张量运算时,所述指令调度模块识别出所述计算指令为张量运算指令;当所述计算指令中指令类型为向量运算或标量运算时,所述指令调度模块识别出所述计算指令为通用运算指令。Preferably, when the instruction type in the computing instruction is a tensor operation, the instruction scheduling module identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is a vector operation or a scalar operation, the instruction scheduling module identifies the computing instruction as a general operation instruction.
优选地,张量处理器根据所述张量运算指令进行存内计算和/或非存内计算包括:解析所述张量运算指令以获取运算码和待运算数据的源地址和目标地址;根据所述待运算数据的源地址获取待运算数据,将所述待运算数据划分成多组操作数据;根据所述待运算数据的源地址产生选择信号,并根据所述选择信号、运算码、多组数据和目标地址进行存内计算和/或非存内计算。Preferably, the tensor processor performs in-memory calculations and/or non-memory calculations according to the tensor operation instructions, including: parsing the tensor operation instructions to obtain the operation code and the source address and target address of the data to be operated; obtaining the data to be operated according to the source address of the data to be operated, and dividing the data to be operated into multiple groups of operation data; generating a selection signal according to the source address of the data to be operated, and performing in-memory calculations and/or non-memory calculations according to the selection signal, the operation code, multiple groups of data and the target address.
根据本发明的第三方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述所述的方法。According to a third aspect of the present invention, there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and wherein the computer program implements the above-described method when executed by a processor.
本发明提供的计算系统、由计算系统执行的方法及存储介质,将通用处理器和张量处理器结合在一起,利用通用处理器处理神经网络中的通用计算以及利用张量处理器进行存内计算或非存内计算,既可以保证通用性也可以提高算力,支持神经网络中复杂多变的计算算子。The computing system, method executed by the computing system, and storage medium provided by the present invention combine a general-purpose processor and a tensor processor together, using the general-purpose processor to process general-purpose calculations in a neural network and using the tensor processor to perform in-memory calculations or non-in-memory calculations, which can both ensure versatility and improve computing power, and support complex and changeable computing operators in neural networks.
进一步地,计算指令采用统一的指令格式,可以采用一条指令驱动张量处理器内不同的计算模块进行存内计算或非存内计算,大大降低了编程复杂性。Furthermore, the computing instructions use a unified instruction format, and one instruction can be used to drive different computing modules in the tensor processor to perform in-memory or non-memory computing, greatly reducing programming complexity.
通过以下参照附图对本发明实施例的描述,本发明的上述以及其他目的、特征和优点将更为清楚,在附图中:The above and other objects, features and advantages of the present invention will become more apparent through the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
图1示出本发明实施例提供的计算系统的结构框图;FIG1 is a block diagram showing a computing system according to an embodiment of the present invention;
图2示出本发明实施例提供的张量处理器中指令译码模块的结构示意图;FIG2 is a schematic diagram showing the structure of an instruction decoding module in a tensor processor provided in an embodiment of the present invention;
图3示出本发明另一实施例提供的计算系统的结构框图;FIG3 is a block diagram showing a computing system according to another embodiment of the present invention;
图4示出本发明另一实施例提供的计算系统的结构框图;FIG4 is a block diagram showing a computing system according to another embodiment of the present invention;
图5示出本发明实施例提供的计算方法的流程图;FIG5 is a flow chart showing a calculation method provided by an embodiment of the present invention;
图6示出本发明实施例提供的步骤S530的流程图。FIG. 6 shows a flow chart of step S530 provided by an embodiment of the present invention.
以下将参照附图更详细地描述本发明的各种实施例。在各个附图中,相同的元件采用相同或类似的附图标记来表示。为了清楚起见,附图中的各个部分没有按比例绘制。Various embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. In each of the accompanying drawings, the same elements are represented by the same or similar reference numerals. For the sake of clarity, the various parts in the accompanying drawings are not drawn to scale.
下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。The specific implementation of the present invention is further described in detail below in conjunction with the drawings and examples.
如上所述,通用CPU和GPU在人工智能相关的数据密集型通用计算上具有优势,但是通用计算难以应对复杂多变的计算算子。而存内计算一般采用AISC(应用特定集成电路)模式,所支持的算子类型非常有限,而且编程灵活性及通用性较差,无法很灵活的匹配现在神经网络模型发展速度。As mentioned above, general-purpose CPUs and GPUs have advantages in data-intensive general-purpose computing related to artificial intelligence, but general-purpose computing is difficult to cope with complex and changeable computing operators. In-memory computing generally adopts the AISC (application-specific integrated circuit) mode, which supports very limited operator types, and has poor programming flexibility and versatility, and cannot flexibly match the current development speed of neural network models.
针对上述技术问题,本申请的基本构思是通过将通用处理器和张量处理器结合起来,在计算指令中描述指令类型以区分张量运算指令和通用运算指令,将张量运算指令发送至张量处理器来执行存内计算或非存内计算,将通用运算指令发送至通用处理器来执行通用计算,既可以保证通用性也可以提高算力,支持神经网络中复杂多变的计算算子。In response to the above technical problems, the basic idea of the present application is to combine a general-purpose processor and a tensor processor, describe the instruction type in the computing instruction to distinguish between tensor operation instructions and general-purpose operation instructions, send the tensor operation instructions to the tensor processor to perform in-memory calculations or non-memory calculations, and send the general-purpose operation instructions to the general-purpose processor to perform general-purpose calculations. This can not only ensure versatility but also improve computing power, and support complex and changeable computing operators in neural networks.
图1示出根据本申请实施例提供的计算系统的结构示意图。如图1所示,所述计算系统100包括通用处理器110和张量处理器120。Fig. 1 is a schematic diagram showing the structure of a computing system provided according to an embodiment of the present application. As shown in Fig. 1 , the computing system 100 includes a general purpose processor 110 and a tensor processor 120 .
其中,通用处理器110包括指令调度模块111和至少一个计算单元CU。The general purpose processor 110 includes an instruction scheduling module 111 and at least one computing unit CU.
所述指令调度模块111用于识别计算指令,以及当识别出所述计算指令为张量运算指令时将所述张量运算指令发送至所述张量处理器120。The instruction scheduling module 111 is used to identify a computing instruction, and when the computing instruction is identified as a tensor operation instruction, send the tensor operation instruction to the tensor processor 120 .
在本实施例中,计算指令包括操作码和操作域,其中,所述操作码包括计算指令的指令类型和运算码,所述操作域至少包括待运算数据的源地址和目标地址。In this embodiment, the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes at least a source address and a target address of data to be operated.
具体地,指令类型用于描述所述计算指令所涉及的运算的类型,所述指令类型包括张量运算、向量运算、标量运算和超越函数运算。所述运算码用于描述所述计算指令所要完成的操作(例如,加、减、乘、除、特殊函数等),它具体说明了操作的性质及功能。Specifically, the instruction type is used to describe the type of operation involved in the computing instruction, and the instruction type includes tensor operation, vector operation, scalar operation and transcendental function operation. The operation code is used to describe the operation to be completed by the computing instruction (for example, addition, subtraction, multiplication, division, special function, etc.), which specifically describes the nature and function of the operation.
操作域包括待运算数据的源地址和目标地址。源地址和目标地址可以存储器地址或寄存器地址(即寄存器号)。存储器或者寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为1个数据值(标量)或者n维数据(向量或张量),n为大于等于1的整数。例如,n=1时,即为1维数据,即向量;如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维张量。The operation domain includes the source address and target address of the data to be operated. The source address and the target address can be a memory address or a register address (i.e., a register number). The memory or register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data. The data can specifically be 1 data value (scalar) or n-dimensional data (vector or tensor), where n is an integer greater than or equal to 1. For example, when n=1, it is 1-dimensional data, i.e., a vector; when n=2, it is 2-dimensional data, i.e., a matrix; when n=3 or more, it is a multi-dimensional tensor.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应指令所需的所有数据包括待运算数据以及对应的指令处理方法等等。对于一个计算指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据的源地址和目标地址。应当理解的是,本领域技术人员可以根据需要对计算指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。In this embodiment, the opcode may be the part of the instruction or field (usually represented by a code) specified in the computer program to perform the operation, which is an instruction sequence number used to inform the device executing the instruction which specific instruction needs to be executed. The operation domain may be the source of all data required to execute the corresponding instruction, such as the corresponding address, etc. All data required to execute the corresponding instruction include the data to be calculated and the corresponding instruction processing method, etc. For a calculation instruction, it must include an opcode and an operation domain, wherein the operation domain at least includes the source address and the target address of the data to be calculated. It should be understood that those skilled in the art can set the instruction format of the calculation instruction and the opcode and operation domain contained therein as needed, and the present disclosure does not limit this.
在一个优选地实施例中,待运算数据的源地址可以是待运算数据所在存储空间的起始地址,通用处理器110或者张量处理器120可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。进一步地,通用处理器110或者张量处理器120可以根据待运算数据的源地址确定待运算数据,并获得待运算数据。当然,在其他实施例中,通用处理器110或者张量处理器120还可以根据计算指令的操作码确定运算所需的数据。In a preferred embodiment, the source address of the data to be operated may be the starting address of the storage space where the data to be operated is located, and the general processor 110 or the tensor processor 120 may obtain instructions and data through a data input and output unit, which may be one or more data I/O interfaces or I/O pins. Further, the general processor 110 or the tensor processor 120 may determine the data to be operated according to the source address of the data to be operated, and obtain the data to be operated. Of course, in other embodiments, the general processor 110 or the tensor processor 120 may also determine the data required for the operation according to the operation code of the calculation instruction.
在一种可能实现的方式中,计算指令的指令格式可以如下表所示:In one possible implementation, the instruction format of the calculation instruction may be as shown in the following table:
表1:Table 1:
其中,type,opcode为计算指令的操作码,dst,scr为计算指令的操作域,其中,dst为目标地址。src为待运算数据的源地址,在待运算数据为多个时,src可以包括多个待运算数据src0,src1,……srcn,dst也可以包括多个目标地址dst0,dst1,……,dstn,本公开对此不作限制。Wherein, type, opcode is the operation code of the computing instruction, dst, scr is the operation domain of the computing instruction, and dst is the target address. src is the source address of the data to be operated. When there are multiple data to be operated, src may include multiple data to be operated src0, src1, ..., srcn, and dst may also include multiple target addresses dst0, dst1, ..., dstn, which is not limited in the present disclosure.
在一些具体地实施例中,所述通用运算指令中的标量运算、向量运算可以如下表所示:In some specific embodiments, the scalar operation and vector operation in the general operation instruction may be shown in the following table:
表2:Table 2:
执行上表所示的运算指令时,通用处理器110从寄存器Reg1中指定的地址中读取Num1个浮点数据或整型数据,从寄存器Reg2中指定的地址中读取Num2个浮点数据或整型数据,然后执行乘加操作,然后将计算结果存入寄存器Reg3中指定的地址空间。When executing the operation instructions shown in the above table, the general processor 110 reads Num1 floating-point data or integer data from the address specified in the register Reg1, reads Num2 floating-point data or integer data from the address specified in the register Reg2, then performs multiplication and addition operations, and then stores the calculation results in the address space specified in the register Reg3.
在一些具体地实施例中,所述通用运算指令中的超越函数运算可以如下表所示:In some specific embodiments, the transcendental function operation in the general operation instruction may be as shown in the following table:
表3:Table 3:
执行上表所示的运算指令时,通用处理器110从寄存器Reg1中指定的地址中读取Num1个数据,从寄存器Reg2中指定的地址中读取Num2个数据,然后执行函数运算操作,然后将计算结果存入寄存器Reg3中指定的地址空间。When executing the operation instructions shown in the above table, the general processor 110 reads Num1 data from the address specified in the register Reg1, reads Num2 data from the address specified in the register Reg2, then performs the function operation, and then stores the calculation result in the address space specified in the register Reg3.
在一些具体地实施例中,所述张量运算指令可以如下表所示:In some specific embodiments, the tensor operation instruction may be as shown in the following table:
表4:Table 4:
执行上表所示的运算指令时,张量处理器120从寄存器Reg1中指定的地址中读取Num1个张量数据,从寄存器Reg2中指定的地址中读取Num2个权重数据,然后执行卷积操作或乘加操作,然后将计算结果存入寄存器Reg3中指定的地址空间。When executing the operation instructions shown in the above table, the tensor processor 120 reads Num1 tensor data from the address specified in the register Reg1, reads Num2 weight data from the address specified in the register Reg2, and then performs a convolution operation or a multiplication and addition operation, and then stores the calculation result in the address space specified in the register Reg3.
在一个优选地实施例中,若所述运算指令的操作域中没有需要读取的数据的个数信息,则表示需要读取的数据为固定的数据个数,可以是一个数据、一行数据或者是一列数据等,本申请实施例不做具体限制。In a preferred embodiment, if there is no information on the number of data that needs to be read in the operation domain of the operation instruction, it means that the data that needs to be read is a fixed number of data, which can be one data, a row of data, or a column of data, etc., and the embodiment of the present application does not make specific restrictions.
在本实施例中,当所述计算指令中指令类型为张量运算(例如指令类型为Tensor)时,所述指令调度模块111识别出所述计算指令为张量运算指令;当所述计算指令中指令类型为向量运算、标量运算和超越函数运算(例如指令类型为Float、INT、SUF)中的一种时,所述指令调度模块111识别出所述计算指令为通用运算指令。所述张量运算指令包括矩阵乘运算(MM)、矩阵乘加运算(MAC)和卷积运算(Conv)中的至少一种;所述通用运算指令包括浮点乘加运算、整型乘加运算、超越函数运算中的至少一种。In this embodiment, when the instruction type in the computing instruction is a tensor operation (for example, the instruction type is Tensor), the instruction scheduling module 111 identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation, and a transcendental function operation (for example, the instruction type is Float, INT, SUF), the instruction scheduling module 111 identifies the computing instruction as a general operation instruction. The tensor operation instruction includes at least one of a matrix multiplication operation (MM), a matrix multiplication and addition operation (MAC), and a convolution operation (Conv); the general operation instruction includes at least one of a floating-point multiplication and addition operation, an integer multiplication and addition operation, and a transcendental function operation.
所述指令调度模块111还用于当识别出所述计算指令为通用运算指令时,将所述通用运算指令发送至计算单元CU;所述计算单元CU用于根据通用运算指令进行通用计算。The instruction scheduling module 111 is further configured to send the general operation instruction to the computing unit CU when identifying that the computing instruction is a general operation instruction; the computing unit CU is configured to perform general calculation according to the general operation instruction.
图1中具体示出了2个计算单元CU作为示例,而省略了其他可能的计算单元。每个计算单元CU包括指令分发模块、多个计算核心(Kernel)、寄存器堆、共享L1缓存等。指令调度模块111还用于在多个计算单元CU之间对执行计算任务进行调度。本实施例中的通用处理器为CPU、GPU、DSP、GPGPU中的任意一种。FIG1 specifically shows two computing units CU as examples, and omits other possible computing units. Each computing unit CU includes an instruction distribution module, multiple computing cores (Kernel), a register file, a shared L1 cache, etc. The instruction scheduling module 111 is also used to schedule the execution of computing tasks between multiple computing units CU. The general-purpose processor in this embodiment is any one of a CPU, a GPU, a DSP, and a GPGPU.
该计算系统可以用于例如矩阵计算等计算任务,这些计算任务可以通过多个线程(thread)并行执行。例如,这些线程在执行前,在指令调度模块111中被划分成多个线程块(thread block),然后这些线程块被分发到各个计算单元CU(例如,流多处理器(SM))。一个线程块中的所有线程通常要分配到同一个计算单元上执行。同时,线程块会被拆分成线程束(或简称线程束,thread warp),例如,每个线程束包含了固定数量(或小于这个固定数量)的线程,例如,32个线程。多个线程块可以在同一个计算单元中执行,或者在不同计算单元中执行。The computing system can be used for computing tasks such as matrix calculations, which can be executed in parallel by multiple threads. For example, before execution, these threads are divided into multiple thread blocks in the instruction scheduling module 111, and then these thread blocks are distributed to each computing unit CU (for example, a streaming multiprocessor (SM)). All threads in a thread block are usually assigned to the same computing unit for execution. At the same time, the thread block is split into thread bundles (or simply thread bundles, thread warps), for example, each thread bundle contains a fixed number (or less than this fixed number) of threads, for example, 32 threads. Multiple thread blocks can be executed in the same computing unit, or in different computing units.
在每个计算单元中,指令分发模块112对线程束进行调度、分配,以便该计算单元CU的多个计算核心运行对应的线程束。每个计算核心包括算术逻辑单元(ALU)、浮点计算单元等。根据计算单元中计算核心的个数,一个线程块中的多个线程束可以同时执行或分时执行。每个线程束中的多个线程会执行相同的指令,指令执行完得到的结果更新到每个线程束对应的寄存器中。每个计算单元CU各自对应的指令和数据被发送到计算单元中的共享缓存(例如共享L1缓存)或进一步发送到统一缓存中以进行读写操作等。In each computing unit, the instruction distribution module 112 schedules and distributes the thread bundles so that the multiple computing cores of the computing unit CU run the corresponding thread bundles. Each computing core includes an arithmetic logic unit (ALU), a floating point computing unit, etc. According to the number of computing cores in the computing unit, multiple thread bundles in a thread block can be executed simultaneously or in time-sharing. Multiple threads in each thread bundle will execute the same instruction, and the result obtained after the instruction is executed is updated to the register corresponding to each thread bundle. The instructions and data corresponding to each computing unit CU are sent to the shared cache (e.g., shared L1 cache) in the computing unit or further sent to the unified cache for read and write operations, etc.
张量处理器120用于根据所述张量运算指令进行存内计算和/或非存内计算。The tensor processor 120 is used to perform in-memory calculations and/or non-in-memory calculations according to the tensor operation instructions.
在本实施例中,张量处理器120包括指令译码模块121、第一计算模块122和第二计算模块123。In this embodiment, the tensor processor 120 includes an instruction decoding module 121 , a first computing module 122 , and a second computing module 123 .
其中,指令译码模块121用于解析所述张量运算指令,根据所述待运算数据的源地址产生选择信号以及获取待运算数据,将所述待运算数据划分成多组操作数据,并根据所述选择信号将运算码和多组操作数据发送至第一计算模块122或第二计算模块123。Among them, the instruction decoding module 121 is used to parse the tensor operation instruction, generate a selection signal according to the source address of the data to be operated and obtain the data to be operated, divide the data to be operated into multiple groups of operation data, and send the operation code and multiple groups of operation data to the first computing module 122 or the second computing module 123 according to the selection signal.
在本实施例中,指令译码模块121根据张量运算指令中所述待运算数据的源地址获取待运算数据以及产生选择信号。该待运算数据包括激活数据和权重数据;若权重数据的源地址为存算一体单元,则该权重数据为静态数据,指令译码模块121根据选择信号将运算码和多组操作数据发送至第一计算模块122;若权重数据的源地址为存储器,则该权重数据为动态数据,指令译码模块121根据选择信号将运算码和多组操作数据发送至第二计算模块123。在其他实施例中,指令译码模块121还可以根据操作码产生选择信号。In this embodiment, the instruction decoding module 121 obtains the data to be operated and generates a selection signal according to the source address of the data to be operated described in the tensor operation instruction. The data to be operated includes activation data and weight data; if the source address of the weight data is a storage-computation integrated unit, the weight data is static data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the first calculation module 122 according to the selection signal; if the source address of the weight data is a memory, the weight data is dynamic data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the second calculation module 123 according to the selection signal. In other embodiments, the instruction decoding module 121 can also generate a selection signal according to the operation code.
参见图2,所述指令译码模块121包括指令解析单元1211、数据分组单元1212和控制单元1213,其中,指令解析单元1211用于解析所述张量运算指令以获取运算码和待运算数据的源地址和目标地址;数据分组单元1212用于根据所述待运算数据的源地址获取待运算数据,将所述待运算数据划分成多组操作数据;控制单元1213用于根据所述待运算数据的源地址产生选择信号,并根据所述选择信号将运算码、多组数据和目标地址发送至第一计算模块122或第二计算模块123。Referring to Figure 2, the instruction decoding module 121 includes an instruction parsing unit 1211, a data grouping unit 1212 and a control unit 1213, wherein the instruction parsing unit 1211 is used to parse the tensor operation instruction to obtain the operation code and the source address and target address of the data to be operated; the data grouping unit 1212 is used to obtain the data to be operated according to the source address of the data to be operated, and divide the data to be operated into multiple groups of operation data; the control unit 1213 is used to generate a selection signal according to the source address of the data to be operated, and send the operation code, multiple groups of data and the target address to the first computing module 122 or the second computing module 123 according to the selection signal.
在本实施例中,所述操作域还可以包括执行量。指令译码模块121还用于获取执行量,并根据执行量将待运算数据划分成多组操作数据。执行量为第一计算模块122或第二计算模块123一次可以执行处理的数据量。In this embodiment, the operation domain may also include an execution amount. The instruction decoding module 121 is also used to obtain the execution amount and divide the data to be operated into multiple groups of operation data according to the execution amount. The execution amount is the amount of data that the first calculation module 122 or the second calculation module 123 can execute and process at one time.
在一个优选地实施例中,在操作域中不包括执行量时,可以根据预先设置的默认执行量将待运算数据划分成多组操作数据。In a preferred embodiment, when the operation domain does not include the execution amount, the data to be operated can be divided into multiple groups of operation data according to a preset default execution amount.
第一计算模块122用于根据接收到的运算码、多组操作数据和目标地址进行存内计算。The first calculation module 122 is used to perform in-memory calculation according to the received operation code, multiple sets of operation data and target address.
在本实施例中,第一计算模块122可以进行存内计算(CIM)。第一计算模块122由SRAM、ReRAM或其他存储介质存算一体单元组成。In this embodiment, the first computing module 122 can perform in-memory computing (CIM). The first computing module 122 is composed of an SRAM, ReRAM or other storage medium storage and computing integrated unit.
第二计算模块123用于根据接收到的运算码、多组操作数据和目标地址进行非存内计算。The second calculation module 123 is used to perform non-memory calculation according to the received operation code, multiple groups of operation data and target address.
在本实施例中,非存内计算包括近存计算或者其他计算,例如通用矩阵乘计算(General Matrix Multiplication,简称GEMM)。In this embodiment, the non-in-memory calculation includes near-memory calculation or other calculation, such as general matrix multiplication (GEMM for short).
在一个优选地实施例中,所述第一计算模块122和所述第二计算模块123采用统一接口与通用处理器110连接,所述统一接口可以为PCIE接口和UCIE接口中的一个。在本实施例中,第一计算模块122和第二计算模块123的输入数据和输出数据的数据结构相同,可以通过一条指令来驱动第一计算模块122和第二计算模块123来进行计算,也可以通过指令来切换采用第一计算模块122或第二计算模块123来进行计算。In a preferred embodiment, the first computing module 122 and the second computing module 123 are connected to the general processor 110 using a unified interface, and the unified interface can be one of a PCIE interface and a UCIE interface. In this embodiment, the data structures of the input data and the output data of the first computing module 122 and the second computing module 123 are the same, and the first computing module 122 and the second computing module 123 can be driven by one instruction to perform calculations, and the first computing module 122 or the second computing module 123 can be switched to perform calculations through instructions.
在本实施例中,通用处理器110和张量处理器120可以封装在同一芯粒中。在优选的实施例中,参见图3,通用处理器110和张量处理器120也可以封装成不同的芯粒,集成在同一芯片或部署在不同的芯片上。通用处理器110和张量处理器120的位置关系可以根据实际应用来设置,并不局限于此。In this embodiment, the general processor 110 and the tensor processor 120 can be packaged in the same core. In a preferred embodiment, referring to FIG3 , the general processor 110 and the tensor processor 120 can also be packaged into different cores, integrated into the same chip or deployed on different chips. The positional relationship between the general processor 110 and the tensor processor 120 can be set according to actual applications, and is not limited thereto.
本发明提供的计算系统,将通用处理器和张量处理器结合在一起,利用通用处理器处理神经网络中的通用计算以及利用张量处理器进行存内计算或非存内计算,既可以保证通用性也可以提高算力,支持神经网络中复杂多变的计算算子。The computing system provided by the present invention combines a general-purpose processor and a tensor processor together, uses the general-purpose processor to process general-purpose calculations in neural networks and uses the tensor processor to perform in-memory calculations or non-in-memory calculations, which can not only ensure versatility but also improve computing power and support complex and changeable computing operators in neural networks.
进一步地,计算指令采用统一的指令格式,可以采用一条指令驱动张量处理器内不同的计算模块进行存内计算或非存内计算,大大降低了编程复杂性。Furthermore, the computing instructions use a unified instruction format, and one instruction can be used to drive different computing modules in the tensor processor to perform in-memory or non-memory computing, greatly reducing programming complexity.
图4示出本发明另一实施例提供的计算系统的结构示意图。如图4所示,所述计算系统包括任务分发模块310和多个计算装置320。FIG4 is a schematic diagram showing the structure of a computing system provided by another embodiment of the present invention. As shown in FIG4 , the computing system includes a task distribution module 310 and a plurality of computing devices 320 .
本实施例以2个计算装置320A和320B为例进行说明,但并不局限于此。This embodiment is described by taking two computing devices 320A and 320B as an example, but is not limited thereto.
在本实施例中,任务分发模块310用于将计算指令分发至多个计算装置320。计算装置320接收该计算指令,并根据所述计算指令执行相应的计算。计算装置320A和320B包括通用处理器321和张量处理器322。通用处理器321和张量处理器322与上述实施例中描述的相同,在此不再赘述。In this embodiment, the task distribution module 310 is used to distribute the computing instructions to multiple computing devices 320. The computing device 320 receives the computing instruction and performs corresponding calculations according to the computing instruction. The computing devices 320A and 320B include a general processor 321 and a tensor processor 322. The general processor 321 and the tensor processor 322 are the same as those described in the above embodiment, and will not be repeated here.
在其他实施例中,计算装置320A和320B仅包括通用处理器321,张量处理器322位于计算装置外,多个计算装置320共享同张量处理器。In other embodiments, the computing devices 320A and 320B only include a general-purpose processor 321, the tensor processor 322 is located outside the computing devices, and multiple computing devices 320 share the same tensor processor.
图5示出本发明实施例提供的计算方法的流程图。参见图5,所述计算方法由上述实施例提供的计算系统100执行,包括以下步骤。Fig. 5 shows a flow chart of a calculation method provided by an embodiment of the present invention. Referring to Fig. 5, the calculation method is executed by the calculation system 100 provided by the above embodiment, and includes the following steps.
在步骤S510中,通用处理器的指令调度模块识别计算指令。In step S510 , an instruction dispatching module of a general purpose processor identifies a computing instruction.
在本实施例中,计算指令包括操作码和操作域,其中,所述操作码包括计算指令的指令类型和运算码,所述操作域至少包括待运算数据的源地址和目标地址。In this embodiment, the computing instruction includes an operation code and an operation field, wherein the operation code includes the instruction type and operation code of the computing instruction, and the operation field includes at least a source address and a target address of data to be operated.
具体地,指令类型用于描述所述计算指令所涉及的运算的类型,所述指令类型包括张量运算、向量运算、标量运算和超越函数运算。所述运算码用于描述所述计算指令所要完成的操作(例如,加、减、乘、除、特殊函数等),它具体说明了操作的性质及功能。Specifically, the instruction type is used to describe the type of operation involved in the computing instruction, and the instruction type includes tensor operation, vector operation, scalar operation and transcendental function operation. The operation code is used to describe the operation to be completed by the computing instruction (for example, addition, subtraction, multiplication, division, special function, etc.), which specifically describes the nature and function of the operation.
操作域包括待运算数据的源地址和目标地址。源地址和目标地址可以存储器地址或寄存器地址(即寄存器号)。存储器或者寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为1个数据值(标量)或者n维数据(向量或张量),n为大于等于1的整数。例如,n=1时,即为1维数据,即向量;如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维张量。The operation domain includes the source address and target address of the data to be operated. The source address and the target address can be a memory address or a register address (i.e., a register number). The memory or register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data. The data can specifically be 1 data value (scalar) or n-dimensional data (vector or tensor), where n is an integer greater than or equal to 1. For example, when n=1, it is 1-dimensional data, i.e., a vector; when n=2, it is 2-dimensional data, i.e., a matrix; when n=3 or more, it is a multi-dimensional tensor.
在本实施例中,当所述计算指令中指令类型为张量运算(例如指令类型为Tensor)时,所述指令调度模块111识别出所述计算指令为张量运算指令;当所述计算指令中指令类型为向量运算、标量运算和超越函数运算(例如指令类型为Float、INT、SUF)中的一种时,所述指令调度模块111识别出所述计算指令为通用运算指令。所述张量运算指令包括矩阵乘运算(MM)、矩阵乘加运算(MAC)和卷积运算(Conv)中的至少一种;所述通用运算指令包括浮点乘加运算、整型乘加运算、超越函数运算中的至少一种。In this embodiment, when the instruction type in the computing instruction is a tensor operation (for example, the instruction type is Tensor), the instruction scheduling module 111 identifies the computing instruction as a tensor operation instruction; when the instruction type in the computing instruction is one of a vector operation, a scalar operation, and a transcendental function operation (for example, the instruction type is Float, INT, SUF), the instruction scheduling module 111 identifies the computing instruction as a general operation instruction. The tensor operation instruction includes at least one of a matrix multiplication operation (MM), a matrix multiplication and addition operation (MAC), and a convolution operation (Conv); the general operation instruction includes at least one of a floating-point multiplication and addition operation, an integer multiplication and addition operation, and a transcendental function operation.
在步骤S520中,当指令调度模块识别出所述计算指令为张量运算指令时将所述张量运算指令发送至张量处理器。In step S520, when the instruction scheduling module identifies that the computing instruction is a tensor operation instruction, the tensor operation instruction is sent to the tensor processor.
在步骤S530中,张量处理器根据所述张量运算指令进行存内计算和/或非存内计算。In step S530, the tensor processor performs in-memory calculations and/or non-in-memory calculations according to the tensor operation instructions.
在本实施例中,参见图6,步骤S530包括步骤S531至步骤S533。In this embodiment, referring to FIG. 6 , step S530 includes steps S531 to S533 .
在步骤S531中,解析所述张量运算指令以获取运算码和待运算数据的源地址和目标地址。In step S531, the tensor operation instruction is parsed to obtain the operation code and the source address and target address of the data to be operated.
在步骤S532中,根据所述待运算数据的源地址获取待运算数据,将所述待运算数据划分成多组操作数据。In step S532, the data to be operated is obtained according to the source address of the data to be operated, and the data to be operated is divided into a plurality of groups of operation data.
在本实施例中,所述操作域还可以包括执行量。指令译码模块121还用于获取执行量,并根据执行量将待运算数据划分成多组操作数据。执行量为第一计算模块122或第二计算模块123一次可以执行处理的数据量。In this embodiment, the operation domain may also include an execution amount. The instruction decoding module 121 is also used to obtain the execution amount and divide the data to be operated into multiple groups of operation data according to the execution amount. The execution amount is the amount of data that the first calculation module 122 or the second calculation module 123 can execute and process at one time.
在一个优选地实施例中,在操作域中不包括执行量时,可以根据预先设置的默认执行量将待运算数据划分成多组操作数据。In a preferred embodiment, when the operation domain does not include the execution amount, the data to be operated can be divided into multiple groups of operation data according to a preset default execution amount.
在步骤S533中,根据所述待运算数据的源地址产生选择信号,并根据所述选择信号、运算码、多组数据和目标地址进行存内计算和/或非存内计算。In step S533, a selection signal is generated according to the source address of the data to be operated, and in-memory calculation and/or non-in-memory calculation is performed according to the selection signal, the operation code, multiple groups of data and the target address.
在本实施例中,该待运算数据包括激活数据和权重数据;若权重数据的源地址为存算一体单元,则该权重数据为静态数据,指令译码模块121根据选择信号将运算码和多组操作数据发送至第一计算模块122;若权重数据的源地址为存储器,则该权重数据为动态数据,指令译码模块121根据选择信号将运算码和多组操作数据发送至第二计算模块123。在其他实施例中,指令译码模块121还可以根据操作码产生选择信号。In this embodiment, the data to be operated includes activation data and weight data; if the source address of the weight data is a storage-computation integrated unit, the weight data is static data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the first calculation module 122 according to the selection signal; if the source address of the weight data is a memory, the weight data is dynamic data, and the instruction decoding module 121 sends the operation code and multiple groups of operation data to the second calculation module 123 according to the selection signal. In other embodiments, the instruction decoding module 121 can also generate a selection signal according to the operation code.
在步骤S540中,当指令调度模块识别出所述计算指令为通用运算指令时,将所述通用运算指令发送至计算单元。In step S540, when the instruction scheduling module identifies that the computing instruction is a general operation instruction, the general operation instruction is sent to the computing unit.
在步骤S550中,计算单元根据通用运算指令进行通用计算。In step S550, the computing unit performs general computing according to the general computing instruction.
本发明提供的计算方法,将通用处理器和张量处理器结合在一起,利用通用处理器处理神经网络中的通用计算以及利用张量处理器进行存内计算或非存内计算,既可以保证通用性也可以提高算力,支持神经网络中复杂多变的计算算子。The computing method provided by the present invention combines a general-purpose processor and a tensor processor together, uses the general-purpose processor to process general-purpose calculations in a neural network, and uses the tensor processor to perform in-memory calculations or non-in-memory calculations, which can both ensure versatility and improve computing power, and support complex and changeable computing operators in neural networks.
进一步地,计算指令采用统一的指令格式,可以采用一条指令驱动张量处理器内不同的计算模块进行存内计算或非存内计算,大大降低了编程复杂性。Furthermore, the computing instructions use a unified instruction format, and one instruction can be used to drive different computing modules in the tensor processor to perform in-memory or non-memory computing, greatly reducing programming complexity.
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments can be implemented.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行时实现可实现上述各个方法实施例中的步骤。An embodiment of the present application provides a computer program product. When the computer program product runs on an electronic device, the electronic device can implement the steps in the above-mentioned method embodiments when executing the computer program product.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到装置/电子设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present application implements all or part of the processes in the above-mentioned embodiment method, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, the steps of the above-mentioned various method embodiments can be implemented. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. The computer-readable medium may at least include: any entity or device that can carry the computer program code to the device/electronic device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), an electric carrier signal, a telecommunication signal, and a software distribution medium. For example, a USB flash drive, a mobile hard disk, a magnetic disk or an optical disk. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electric carrier signals and telecommunication signals.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described or recorded in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/电子设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/电子设备实施例仅仅是示意性的,例如,所述模块或单元的分块,仅仅为一种逻辑功能分块,实际实现时可以有另外的分块方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in the present application, it should be understood that the disclosed devices/electronic devices and methods can be implemented in other ways. For example, the device/electronic device embodiments described above are merely schematic. For example, the block division of the modules or units is only a logical function block division. There may be other block division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
依照本发明的实施例如上文所述,这些实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施例。显然,根据以上描述,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本发明的原理和实际应用,从而使所属技术领域技术人员能很好地利用本发明以及在本发明基础上的修改使用。本发明仅受权利要求书及其全部范围和等效物的限制。According to the embodiments of the present invention as described above, these embodiments do not describe all the details in detail, nor do they limit the invention to the specific embodiments described. Obviously, many modifications and changes can be made based on the above description. This specification selects and specifically describes these embodiments in order to better explain the principles and practical applications of the present invention, so that those skilled in the art can make good use of the present invention and the modified use based on the present invention. The present invention is limited only by the claims and their full scope and equivalents.
Claims (21)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311713616.9A CN119088751A (en) | 2023-12-13 | 2023-12-13 | Computing system, method executed by the computing system, and storage medium |
| CN202311713616.9 | 2023-12-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025124574A1 true WO2025124574A1 (en) | 2025-06-19 |
Family
ID=93664360
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/139346 Pending WO2025124574A1 (en) | 2023-12-13 | 2024-12-13 | Computing system, method executed by computing system, and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN119088751A (en) |
| WO (1) | WO2025124574A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119088751A (en) * | 2023-12-13 | 2024-12-06 | 苏州亿铸智能科技有限公司 | Computing system, method executed by the computing system, and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190340499A1 (en) * | 2018-05-04 | 2019-11-07 | Microsoft Technology Licensing, Llc | Quantization for dnn accelerators |
| CN111966401A (en) * | 2019-05-20 | 2020-11-20 | 上海寒武纪信息科技有限公司 | Instruction processing method, device and related products |
| CN115169541A (en) * | 2022-08-17 | 2022-10-11 | 无锡江南计算技术研究所 | Tensor, vector and scalar calculation acceleration and data scheduling system |
| CN115860079A (en) * | 2023-01-30 | 2023-03-28 | 深圳市九天睿芯科技有限公司 | Neural network acceleration device, method, chip, electronic device, and storage medium |
| CN119088751A (en) * | 2023-12-13 | 2024-12-06 | 苏州亿铸智能科技有限公司 | Computing system, method executed by the computing system, and storage medium |
-
2023
- 2023-12-13 CN CN202311713616.9A patent/CN119088751A/en active Pending
-
2024
- 2024-12-13 WO PCT/CN2024/139346 patent/WO2025124574A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190340499A1 (en) * | 2018-05-04 | 2019-11-07 | Microsoft Technology Licensing, Llc | Quantization for dnn accelerators |
| CN111966401A (en) * | 2019-05-20 | 2020-11-20 | 上海寒武纪信息科技有限公司 | Instruction processing method, device and related products |
| CN115169541A (en) * | 2022-08-17 | 2022-10-11 | 无锡江南计算技术研究所 | Tensor, vector and scalar calculation acceleration and data scheduling system |
| CN115860079A (en) * | 2023-01-30 | 2023-03-28 | 深圳市九天睿芯科技有限公司 | Neural network acceleration device, method, chip, electronic device, and storage medium |
| CN119088751A (en) * | 2023-12-13 | 2024-12-06 | 苏州亿铸智能科技有限公司 | Computing system, method executed by the computing system, and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119088751A (en) | 2024-12-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10942716B1 (en) | Dynamic computational acceleration using a heterogeneous hardware infrastructure | |
| CN103221933B (en) | The method and apparatus moving data to simd register file from general-purpose register file | |
| CN101482810B (en) | Method and apparatus for loading vector data from different memory locations and storing same in said locations | |
| US11609792B2 (en) | Maximizing resource utilization of neural network computing system | |
| CN112381220B (en) | Neural network tensor processor | |
| CN112580792B (en) | Neural network multi-core tensor processor | |
| CN112214443B (en) | Secondary unloading device and method arranged in graphic processor | |
| WO2022134729A1 (en) | Risc-v-based artificial intelligence inference method and system | |
| CN118035618B (en) | Data processor, data processing method, electronic device, storage medium | |
| WO2025124574A1 (en) | Computing system, method executed by computing system, and storage medium | |
| CN104750660A (en) | Embedded reconfigurable processor with multiple operating modes | |
| US11237994B2 (en) | Interrupt controller for controlling interrupts based on priorities of interrupts | |
| CN102629238B (en) | Method and device for supporting vector condition memory access | |
| WO2025124578A1 (en) | Tensor processing unit and method, and computer-readable storage medium | |
| CN112230931B (en) | Compiling method, device and medium suitable for secondary unloading of graphic processor | |
| CN116400926A (en) | Scalar engine processing method and device for artificial intelligence chip | |
| JP2013246816A (en) | Reconfigurable processor of mini-core base and flexible multiple data processing method using reconfigurable processor | |
| CN116136762A (en) | A Design Method of FPGA Semi-custom Heterogeneous Computing System Based on OpenCL | |
| JP7575841B2 (en) | Reuse of adjacent SIMD units for fast and comprehensive results | |
| CN119441130B (en) | A three-dimensional reconfigurable hardware acceleration core chip | |
| US20240289168A1 (en) | Programmable look up table free hardware accelerator and instruction set architecture for activation functions | |
| US20240427705A1 (en) | Flash based transformer accelerator | |
| CN111966399A (en) | Instruction processing method, device and related products | |
| CN111966403A (en) | Instruction processing method, device and related products | |
| Bailey | REPORT MICROPROCESSOR |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24902989 Country of ref document: EP Kind code of ref document: A1 |