WO2025055399A1 - Direct memory access system, data transport method, device and storage medium - Google Patents
Direct memory access system, data transport method, device and storage medium Download PDFInfo
- Publication number
- WO2025055399A1 WO2025055399A1 PCT/CN2024/097048 CN2024097048W WO2025055399A1 WO 2025055399 A1 WO2025055399 A1 WO 2025055399A1 CN 2024097048 W CN2024097048 W CN 2024097048W WO 2025055399 A1 WO2025055399 A1 WO 2025055399A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- macro instruction
- read
- micro
- write
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
Definitions
- the present application relates to the field of computer technology, and in particular to a direct memory access system, a data transfer method, a device and a storage medium.
- the Direct Memory Access (DMA) system is also known as the direct data transfer hardware unit, which can realize data transfer without the participation of the Central Processing Unit (CPU).
- CPU Central Processing Unit
- the processor in the computer device when executing data transfer, can send a data transfer instruction to the DMA system, and through the instruction, the read address, data length and write address of the data to be transferred are indicated to the DMA system.
- the DMA system reads the data from the bus according to the read address and data length, and writes the read data back to the bus in sequence starting from the write address.
- the processor needs to first generate an instruction including a read address, a data length, and a write address according to the structure of the data to be moved.
- the instruction generation process takes a long time, thereby affecting the efficiency of data transfer.
- the embodiments of the present application provide a direct memory access system, a data transfer method, a device and a storage medium, which can improve the efficiency of data transfer.
- the technical solution is as follows.
- a direct memory access system comprising: a macro instruction memory, a macro instruction controller, a macro instruction parsing engine, and a data read and write transmission path;
- the macro instruction memory and the macro instruction controller are respectively connected to a processor in a computer device, and the macro instruction memory is connected to the macro instruction controller;
- the macro instruction controller is connected to the macro instruction parsing engine, the macro instruction parsing engine is connected to the data reading and writing transmission path, and the data reading and writing transmission path is connected to the bus of the computer device;
- the macro instruction memory is used to store the macro instructions issued by the processor; the macro instructions are used to instruct the transport of multi-dimensional structure data;
- the macro instruction controller is used to receive the trigger instruction sent by the processor, read the macro instruction from the macro instruction memory according to the trigger instruction, and send the macro instruction to the macro instruction parsing engine;
- the macro instruction parsing engine is used to parse the macro instruction into a micro operation code and send the micro operation code to the data read and write transmission path;
- the micro operation code is used to instruct the transfer of one-dimensional structure data in the multi-dimensional structure data;
- the data read/write transmission path is used to read and write the one-dimensional structure data to the bus according to the micro-operation code.
- a data handling method is provided, the method is used in the direct memory access system as described above, and the method is executed by a macro instruction controller in the direct memory access system, the method comprising:
- the macroinstruction is sent to a macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to a data read-write transmission path, and the data read-write transmission path converts the micro-operation code into a micro-operation code.
- the operation code reads and writes one-dimensional structure data to the bus; the micro-operation code is used to instruct the transfer of the one-dimensional structure data in the multi-dimensional structure data.
- a data handling device is provided, the device is used for a macro instruction controller in the direct memory access system as described above, the device comprising:
- a trigger instruction receiving module used to receive a trigger instruction sent by a processor
- a macro instruction reading module used for reading the macro instruction from the macro instruction storage according to the trigger instruction; the macro instruction is used for instructing the transport of multi-dimensional structure data;
- a macroinstruction sending module is used to send the macroinstruction to a macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to a data read-write transmission path, and the data read-write transmission path reads and writes one-dimensional structure data to a bus according to the micro-operation code; the micro-operation code is used to indicate the transfer of the one-dimensional structure data in the multi-dimensional structure data.
- a computer device comprising the direct memory access system as described above.
- a computer-readable storage medium in which at least one program is stored.
- the at least one program is loaded and executed by a controller to implement the data transfer method described above; the controller is a macro instruction controller in the direct memory access system described above.
- a computer program product comprising a computer program, wherein when the computer program is executed by a controller, the data transfer method described above is implemented; the controller is a macro instruction controller in the direct memory access system described above.
- a chip is provided, wherein the chip includes the direct memory access system as described above.
- a computer device wherein the computer device comprises the chip as described above.
- What the processor sends to the DMA system is a macro instruction for instructing the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read-write transmission path, and the data read-write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code.
- the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data.
- the above DMA system can greatly reduce the time required for the processor to generate the instruction for transporting the one-dimensional structure data, and reduce the configuration overhead of the processor sending the above instruction to the DMA system for execution, thereby improving the efficiency of data transport.
- FIG1 is a schematic diagram of a DMA system
- FIG2 is a schematic diagram of a DMA system supporting scatter-gather discrete transfers
- FIG. 3 is a schematic diagram of the format of a DMA instruction or descriptor
- FIG4 is a flow chart of a DMA implementation of data transfer
- FIG5 is a schematic diagram of the structure of a direct memory access system provided by an exemplary embodiment of the present application.
- FIG6 is a schematic diagram showing an example of a hardware circuit parsing macro instruction involved in the present application.
- FIG7 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application.
- FIG8 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application.
- FIG9 is a schematic diagram showing an example of format conversion involved in the present application.
- FIG10 is a structural diagram of a macro instruction provided by an exemplary embodiment of the present application.
- FIG11 is a schematic diagram of the process of parsing and generating micro-op codes for macro instructions
- FIG12 is a micro-operation code provided by an exemplary embodiment of the present application.
- FIG13 is a timing diagram of a data transport path for stream processing involved in the present application.
- FIG14 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application.
- FIG16 is a flow chart of a data transport method provided by an exemplary embodiment of the present application.
- FIG. 17 is a schematic diagram of a data handling device provided by an exemplary embodiment of the present application.
- Direct Memory Access (DMA) system also known as direct data transfer hardware unit, is a technology that can realize data transfer without the participation of the Central Processing Unit (CPU). DMA transfer copies data from one address space to another, providing high-speed data transfer between peripherals and memory or between memory and memory.
- CPU Central Processing Unit
- data transfer does not require CPU involvement.
- external device A and external device B provide a DMA data channel
- the data can be directly copied from external device A to external device B, and this process does not require CPU processing. Therefore, DMA can liberate the CPU to a certain extent, which plays an extremely important role in realizing efficient embedded systems and accelerating network data processing.
- DMA controller When implementing DMA transmission, the DMA controller directly controls the bus, so there is a problem of bus control transfer. That is, before DMA transmission, the CPU must hand over the bus control to the DMA controller, and after the DMA transmission ends, the DMA controller should hand over the bus control back to the CPU.
- a complete DMA transmission process usually requires four steps: DMA request, DMA response, DMA transmission, and DMA end.
- FIG1 shows a schematic diagram of a basic DMA system, as shown in FIG1 :
- the DMA system consists of a controller, a read data interface, a First Input First Output (FIFO) cache, and a write data interface.
- the controller is responsible for receiving and parsing instructions from the processor, instructing the read data interface to generate a read data request to the bus, and instructing the write data interface to generate a write data request to the bus;
- the FIFO cache is used for data caching for asynchronous execution of read and write data operations;
- the processor sends transfer instructions to the DMA system one by one, instructing the DMA system to perform the transfer operation.
- One DMA system only corresponds to one instruction at a time.
- FIG. 2 shows a schematic diagram of a DMA system supporting scatter-gather discrete transport, as shown in FIG. 2 :
- the DMA system is used to batch transfer multiple discrete data blocks, reducing the number of processor interventions, thereby improving transfer efficiency. It is mainly used in the transfer scenario of fragmented data;
- a series of transport instructions are defined as descriptors.
- the processor sends the instructions in batches to the descriptor linked list of the DMA system in advance, and expresses the order relationship between the instructions in a linked list structure;
- FIG. 3 shows a schematic diagram of the format of a DMA instruction or descriptor.
- the DMA instructions or descriptors issued by the processor can instruct the DMA to directly initiate data read and write operations.
- Their common feature is that they can only handle the transfer of simple one-dimensional data structures, where the one-dimensional structure data refers to a data block indicated by a starting point (such as a starting address) and a length.
- the one-dimensional structure data is data with continuous addresses.
- FIG. 4 shows a flow chart of DMA implementing data transfer. As shown in FIG. 4 , the steps of implementing data transfer in the DMA system structures shown in FIG. 1 and FIG. 2 are consistent:
- Step 401 receiving an instruction and determining whether to execute the instruction according to the state of the DMA system
- Step 402 when it is determined that the instruction is to be executed, a read operation is initiated to the bus according to the read address and data size in the instruction, such as Advanced eXtensible Interface (AXI) read transmission, Advanced Peripheral Bus (APB), Advanced High-performance Bus (AHB), AXI Coherence Extension (ACE) and other read operations;
- AXI Advanced eXtensible Interface
- APIB Advanced Peripheral Bus
- AHB Advanced High-performance Bus
- ACE AXI Coherence Extension
- Step 403 receiving the data returned by the bus and buffering it into FIFO;
- Step 404 initiating a write operation to the bus according to the write address and data size in the instruction, such as AXI write transfer, APB, AHB, ACE, etc. write operation;
- Step 405 write the data in the FIFO into the bus.
- the DMA system structure shown in FIG. 1 and FIG. 2 is a DMA system structure for performing transfer of one-dimensional structure data, that is, the processor needs to determine the read/write address and data length of the one-dimensional structure data to be transferred in advance, and then notify the DMA system. Since the process of the processor determining the read/write address and data length of the one-dimensional structure data to be transferred is implemented through program code, it takes a long time to generate a transfer instruction for a one-dimensional structure data. Therefore, the efficiency of the processor issuing data transfer instructions has become a bottleneck affecting the data transfer efficiency.
- FIG. 5 shows a schematic diagram of the structure of a direct memory access system provided by an exemplary embodiment of the present application.
- the system may include: a macro instruction memory 501 , a macro instruction controller 502 , a macro instruction parsing engine 503 , and a data read and write transmission path 504 .
- the macro instruction memory 501 and the macro instruction controller 502 are respectively connected to the processor in the computer device, and the macro instruction memory 501 is connected to the macro instruction controller 502 .
- the above processor may be a scalar processor.
- the processor may be a computing device that implements computing functions by running computer instructions or codes.
- the processor may be a central processing unit (CPU) of a computer device.
- the above-mentioned macroinstruction memory 501 and macroinstruction controller 502 are respectively connected to the processor in the computer device, which means that the macroinstruction memory 501 and macroinstruction controller 502 are respectively electrically connected to the processor in the computer device, and the macroinstruction memory 501 and macroinstruction controller 502 can respectively receive instructions/signals/data sent by the processor.
- the above-mentioned macroinstruction memory 501 is connected to the macroinstruction controller 502 , which means that the macroinstruction memory 501 is electrically connected to the macroinstruction controller 502 , and the macroinstruction controller 502 can read instructions/data from the macroinstruction memory 501 .
- the above-mentioned macroinstruction memory 501 can be a random access memory (Random Access Memory, RAM), such as resistance random access memory (ReRAM) or dynamic random access memory (DRAM), etc.
- RAM Random Access Memory
- ReRAM resistance random access memory
- DRAM dynamic random access memory
- the macro instruction controller 502 is connected to the macro instruction parsing engine 503, the macro instruction parsing engine 503 is connected to the data reading and writing transmission path 504, and the data reading and writing transmission path 504 is connected to the bus of the computer device.
- the above-mentioned macro instruction controller 502 is connected to the macro instruction parsing engine 503 , which may mean that the macro instruction controller 502 is electrically connected to the macro instruction parsing engine 503 , and the macro instruction controller 502 can send instructions/data to the macro instruction parsing engine 503 .
- the above-mentioned macroinstruction parsing engine 503 is connected to the data read/write transmission path 504 , which means that the macroinstruction parsing engine 503 is electrically connected to the data read/write transmission path 504 , and the macroinstruction parsing engine 503 can send instructions/data to the data read/write transmission path 504 .
- the data read/write transmission path 504 and the bus of the computer device refer to that the data read/write transmission path 504 is electrically connected to the bus interface of the computer device, and the data read/write transmission path 504 can read data from the bus of the computer device (or, read data from other memory of the computer device/memory connected to the computer device through the bus), and write data to the bus of the computer device (or, write data to other memory of the computer device/memory connected to the computer device through the bus). data to the connected memory).
- the macroinstruction memory 501 is used to store the macroinstructions issued by the processor; the macroinstructions are used to instruct the transport of multi-dimensional structured data.
- the processor of the computer device may not need to disassemble the multi-dimensional structure data, but directly generate a macro instruction indicating information such as the read and write address and data volume of the multi-dimensional structure data, and send the macro instruction to the macro instruction memory 501 for storage by the macro instruction memory 501.
- the macro instruction controller 502 is used to receive the trigger instruction sent by the processor, read the macro instruction from the macro instruction storage 501 according to the trigger instruction, and send the macro instruction to the macro instruction parsing engine 503.
- the trigger instruction can be used to trigger the macro instruction controller 502 to specify a macro instruction corresponding to the transfer operation of the multi-dimensional structure data.
- the trigger instruction may indicate a macro instruction stored in the macro instruction memory 501 .
- the trigger instruction may include an address of a macro instruction stored in the macro instruction memory 501 .
- one trigger instruction may instruct the macroinstruction controller 502 to extract a single macroinstruction, or one trigger instruction may instruct the macroinstruction controller 502 to extract multiple macroinstructions.
- the trigger instruction can include the addresses of the multiple macroinstructions respectively; the macroinstruction controller 502 extracts the multiple macroinstructions respectively according to the addresses of the multiple macroinstructions.
- the trigger instruction may include the starting addresses and instruction numbers of the multiple macroinstructions, and the macroinstruction controller 502 may read the multiple macroinstructions in sequence starting from the starting address.
- the trigger instruction only needs to carry an address and an instruction number, that is, it can instruct the macroinstruction controller 502 to extract multiple macroinstructions, which can save the transmission resources of the trigger instruction and improve the indication efficiency of the macro instruction.
- the macroinstruction parsing engine 503 is used to parse the macroinstructions into micro-operation codes and send the micro-operation codes to the data read/write transmission path 504; the micro-operation codes are used to instruct the transfer of one-dimensional structure data in the multi-dimensional structure data.
- the macroinstruction parsing engine 503 can parse the macroinstructions through hardware circuits to parse one macroinstruction into multiple micro-operation codes, and send the micro-operation codes to the data read and write transmission path 504.
- the multidimensional structure data is composed of multiple one-dimensional structure data.
- the addresses of the multidimensional structure data may be continuous; or the addresses of the multidimensional structure data may be discontinuous, for example, among the multiple one-dimensional structure data constituting the multidimensional structure data, the addresses of each one-dimensional structure data are continuous, and the addresses of different one-dimensional structure data are discontinuous.
- the multidimensional structure data is an n ⁇ m two-dimensional structure data
- it can be regarded as a matrix composed of n one-dimensional structure data, and the length of each one-dimensional structure data is m; or, the n ⁇ m two-dimensional matrix data can also be regarded as a matrix composed of m one-dimensional structure data, and the length of each one-dimensional structure data is n.
- FIG. 6 shows an example schematic diagram of a hardware circuit parsing macro instruction involved in the present application.
- a filling instruction of a two-dimensional data block the instruction configuration of which includes:
- 0x represents a hexadecimal value.
- the macro instruction parsing engine 503 performs parsing on a two-dimensional data block and parses it into five one-dimensional transport instructions, namely, three one-dimensional constant filling micro operation codes (micro operations, uOPs) and two one-dimensional transport micro operation codes, specifically including:
- uop_type constant_filling
- data_type INT32
- read_addr NA
- write_addr 0x0;
- uop_type read_write
- data_type INT32
- read_addr 0x0
- uop_type constant_filling
- data_type INT32
- read_addr NA
- uop_type read_write
- data_type INT32
- uop_type constant_filling
- data_type INT32
- read_addr NA
- 0x represents a hexadecimal value.
- the above-mentioned macroinstruction parsing engine 503 can parse macroinstructions through hardware circuits, and determine the read and write addresses and data sizes (the data structure is fixed to one dimension) of each one-dimensional structure data in the multidimensional structure data according to the data structure, read and write addresses and data size of the multidimensional structure data in the macroinstructions, thereby generating corresponding micro-operation codes.
- the data read/write transmission path 504 is used to read and write one-dimensional structured data to the bus according to the micro-operation code.
- the above-mentioned read address can be the starting address for data reading
- the above-mentioned write address can be the starting address for data writing
- the data read and write transmission path 504 can first read the one-dimensional structure data of the length corresponding to the above-mentioned data size through the bus starting from the read address according to the read address, and then cache the one-dimensional structure data in the data read and write transmission path 504, and then, according to the write address, write the above-mentioned one-dimensional structure data in sequence starting from the write address through the bus, thereby completing the transfer of the one-dimensional structure data from the read address to the write address; wherein, for multiple micro-operation codes obtained by parsing a macro instruction, when the above-mentioned multiple micro-operation codes are processed, it can be considered that the transfer of the multi-dimensional structure data corresponding to the above-mentioned macro instruction is completed.
- the scheme shown in the embodiment of the present application is that the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code.
- the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate the instruction for transporting the one-dimensional structure data, as well as the configuration overhead of sending the above instruction to the DMA system for execution, thereby improving the efficiency of data transport.
- the transfer of complex data structures in the computing system is generally broken down by the processor into specific transfer data sizes and source and destination addresses (corresponding to the above read addresses and write addresses), and then sent to the DMA for transfer.
- This system cannot meet the storage and computing bandwidth requirements of heterogeneous processors (such as AI processors, image/video encoding chips) for parallel data transmission, as well as the transfer requirements of complex data structures.
- the DMA system structure shown in Figure 1 or Figure 2 can only handle the transfer of simple one-dimensional data structures, that is, data blocks with only a starting point and a length.
- This feature cannot be directly processed by DMA in the current popular AI inference training processors when faced with complex and diverse data structures. It needs to be processed by software, that is, the internal processor uses a line of labels to The instruction data structure is disassembled into segments of exact length and read/write address data blocks before being sent to DMA for transfer. This often requires a large amount of processor hardware resources and time, resulting in a decrease in the effective performance of other business programs executed by the processor.
- AI artificial intelligence
- a DMA system which can parse complex instruction sequences by itself and efficiently transfer them, and has the function of executing data synchronization with the processor and other subsystems:
- the DMA system can be applied to AI reasoning and training chips, the handling of complex data structures, and scenarios where high computing power requires high handling bandwidth;
- the DMA system can also be used in image codec chips, where image codec chips also have the same usage scenarios as AI reasoning and training, namely, the scenario of moving complex data structures.
- a data read and write transmission path 504 includes a plurality of processing units, and the plurality of processing units at least include a read data interface unit 504 a, a cache unit 504 b, and a write data interface unit 504 c;
- the macro instruction parsing engine 503 is connected to the read data interface unit 504a, and the macro instruction parsing engine 503 is used to send the micro operation code to the read data interface unit 504a;
- micro-operation codes are used to be transmitted sequentially among multiple processing units.
- the macroinstruction parsing engine 503 is connected to the read data interface unit 504a, which means that the macroinstruction parsing engine 503 is electrically connected to the read data interface unit 504a, so that the macroinstruction parsing engine 503 can send a signal to the read data interface unit 504a, thereby being able to send the micro-operation code to the read data interface unit 504a.
- multiple processing units in the above-mentioned data read and write transmission path 504 are connected in sequence, and, during the data transfer process, starting from the read data interface unit 504a, the read data interface unit 504a receives the micro-operation code sent by the macro instruction parsing engine 503, reads the one-dimensional structure data from the bus according to the micro-operation code, and sends the one-dimensional structure data and the micro-operation code together to the next processing unit.
- a processing unit receives the one-dimensional structure data and micro-operation code sent by the previous processing unit, it passes the one-dimensional structure data and the micro-operation code to the next processing unit.
- the current processing unit can also perform corresponding processing on the one-dimensional structure data, and then pass the processed one-dimensional structure data and the micro-operation code to the next processing unit.
- the write data interface unit 504c receives the one-dimensional structure data and micro-operation code sent by the previous processing unit, it writes the one-dimensional structure data to the bus according to the micro-operation code.
- the macroinstruction parsing engine 503 only needs to send the micro-operation code to the read data interface unit 504a, and does not need to send it to each processing unit that needs the micro-operation code, thereby reducing the design and wiring difficulty of the DMA system.
- the macroinstruction parsing engine 503 is used to parse the data structure of the multidimensional structure data when parsing the macroinstruction into a micro-operation code, and determine the read and write addresses and data size of the one-dimensional structure data in the multidimensional structure data; and generate a micro-operation code based on the read and write addresses and data size of the one-dimensional structure data.
- the above-mentioned macro instruction may include information such as the read address, write address, data structure, and data size of the multi-dimensional structure data.
- the macro instruction parsing engine 503 may determine the data size of each one-dimensional structure data in the multi-dimensional structure data, and then combine the read address and write address of the multi-dimensional structure data to determine the data size of each one-dimensional structure data.
- the read address and write address of each one-dimensional structure data are determined, and then the macro instruction parsing engine 503 can generate a micro operation code including the read address, write address and data size of the one-dimensional structure data, and each one-dimensional structure data can correspond to a micro operation code.
- the data structure of the above-mentioned multidimensional structured data may include the original data structure of the multidimensional structured data (also called the source-end data structure), and may also include the data structure of the multidimensional structured data after being moved (also called the destination-end data structure); among them, if the data structure of the multidimensional structured data only includes the source-end data structure, it can be considered that the data structure of the multidimensional structured data after being moved remains unchanged; if the data structure of the multidimensional structured data also includes the destination-end data structure, it can be considered that the data structure of the multidimensional structured data after being moved changes.
- FIG. 8 shows a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application.
- the macro instruction is also used to instruct to perform format conversion processing on multi-dimensional structure data
- the micro operation code is also used to instruct to perform format conversion processing on one-dimensional structure data;
- the multiple processing units also include at least one of a pre-format conversion processing unit 504d and a post-format conversion processing unit 504e, the pre-format conversion processing unit 504d is located before the cache unit, and the post-format conversion processing unit 504e is located after the cache unit; wherein the pre-format conversion processing unit 504d is located before the cache unit, which may mean that in the order in which the multiple processing units are connected in sequence, the pre-format conversion processing unit is located before the cache unit; correspondingly, the post-format conversion processing unit 504e is located after the cache unit, which may mean that in the order in which the multiple processing units are connected in sequence, the post-format conversion processing unit is located after the cache unit;
- the format conversion pre-processing unit 504d is used to perform format conversion pre-processing on the one-dimensional structure data according to the format conversion processing processing mode indicated by the micro-operation code, and write the result data of the format conversion pre-processing into the cache unit 504b;
- the format conversion post-processing unit 504e is used to perform format conversion post-processing on the data in the cache unit 504b according to the format conversion processing method indicated by the micro-operation code, and then send it to the next processing unit.
- the format conversion of data may include one or more combinations of data space conversions such as data type conversion, shifting, splicing, deletion, interleaving, mirroring, and filling.
- data space conversions such as data type conversion, shifting, splicing, deletion, interleaving, mirroring, and filling.
- these are usually implemented by the processor through a combination of executing program instructions; in an embodiment of the present application, at least one processing unit of the format conversion pre-processing unit and the format conversion post-processing unit is set at least one place before or after the cache unit to implement data format conversion through hardware circuits, thereby improving the efficiency of data format conversion.
- a "pre-processing-cache-post-processing" structure is set in the data read and write transmission path 504 to process the data format conversion process step by step.
- the format conversion operation of data mirroring + tail padding requires that a piece of data be segmented and reversed in the pre-processing, stored in the cache, and then the data is taken out in reverse order during the post-processing process, and finally output to the downstream after the tail is filled.
- each format conversion operation can correspond to a different format conversion pre-processing unit 504d or format conversion post-processing unit 504e.
- a DMA structure can support a limited number of conversion types.
- Figure 9 shows an example schematic diagram of the format conversion involved in the present application.
- the data format conversion operation is to broadcast a group of one-dimensional data (7, 8, 9) into a group of two-dimensional arrays ((7, 7, 7), (8, 8, 8), (9, 9, 9)), and the hardware circuit involved includes at least the following circuit units: a multiplexer, a counter and a replication unit.
- the counter accumulates from 0 to N-1 (N is the number of elements in the input one-dimensional array) as the selection signal of the multiplexer, selects each element of the read-in data in turn, and copies each element through the broadcast number of the replication unit and writes it into the cache unit.
- the number of copies is the broadcast number configured by the instruction.
- broadcasting a set of one-dimensional data (7, 8, 9) into a set of two-dimensional arrays requires one read operation and three copy operations, which specifically include the following steps:
- Step 1 Perform a read operation to input the one-dimensional data (7, 8, 9) into the format conversion pre-processing unit, and then perform a copy operation to obtain (7, 7, 7);
- Step 2 Perform the second copy operation to obtain ((7, 7, 7), (8, 8, 8));
- Step 3 Perform the third copy operation and finally obtain the two-dimensional array ((7, 7, 7), (8, 8, 8), (9, 9, 9)).
- the macro instruction includes a first format conversion configuration, which is a configuration for performing format conversion on the multi-dimensional structure data; for example, the first format conversion configuration may indicate a method for performing format conversion on the multi-dimensional structure data;
- the macro instruction parsing engine 503 is used to generate a second format conversion configuration according to the first format conversion configuration when generating a micro operation code according to the read/write address and data size of the one-dimensional structure data, wherein the second format conversion configuration is a configuration for performing format conversion on the one-dimensional structure data; for example, the second format conversion configuration may indicate a method for format conversion of the one-dimensional structure data;
- a micro-operation code is generated according to the read and write addresses, data size, and the second format conversion configuration of the one-dimensional structure data.
- a format conversion configuration can be set in the micro-operation code to support control of the format conversion operation and improve the application flexibility of the format conversion.
- the format conversion configuration set in the micro-op code In order to implement the format conversion configuration set in the micro-op code, it is necessary to indicate it in the macro instruction. Since the macro instruction is an instruction set for multi-dimensional structure data, the format conversion configuration therein is also the format conversion configuration corresponding to the multi-dimensional structure data setting. Therefore, when the macro instruction parsing engine 503 parses the macro instruction to generate the micro-op code, it is necessary to generate the format conversion configuration for the one-dimensional structure data in the micro-op code according to the format conversion configuration set for the multi-dimensional structure data in the macro instruction. In other words, according to the format conversion configuration set for the multi-dimensional structure data in the macro instruction, the format conversion configuration is updated to obtain the format conversion configuration for the one-dimensional structure data in the micro-op code.
- the macro instruction parsing engine 503 can query a pre-stored correspondence between a first format conversion configuration and a second format conversion configuration, and determine a second format conversion configuration corresponding to the first format conversion configuration.
- the macroinstruction is further used to instruct the execution of an operation process on multi-dimensional structure data
- the micro-operation code is further used to instruct the execution of an operation process on one-dimensional structure data
- the plurality of processing units also include a data operation unit 504f;
- the data operation unit is used to perform operation processing on the input data according to the operation processing method indicated by the micro-operation code, and then send it to the next processing unit.
- the above-mentioned data operations may include data type conversion, for example, 32-bit integer type (32bit integer, INT32) to FP32, and data calculation, such as addition, subtraction, multiplication and division of the arithmetic and logic unit (Arithmetic and Logic Unit, ALU).
- data type conversion for example, 32-bit integer type (32bit integer, INT32) to FP32
- data calculation such as addition, subtraction, multiplication and division of the arithmetic and logic unit (Arithmetic and Logic Unit, ALU).
- ALU Arimetic and Logic Unit
- the macroinstructions and micro-operation codes respectively include operation configurations, which are configurations of operation processing.
- the operation configurations can be used to indicate the operation processing method, for example, the operation configurations can indicate that the operation processing method is a combination of one or more operation methods such as addition, subtraction, multiplication, and division.
- the operation configuration can be set in the micro-operation code to support the control of the operation operation and improve the application flexibility of the operation operation.
- the setting of the operation configuration in the micro-operation code it is necessary to indicate it in the macro instruction.
- the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration is used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of data;
- the macro instruction controller 502 is used to check the state of the target address according to the synchronization configuration in the macro instruction when sending the macro instruction to the macro instruction parsing engine 503, and send the macro instruction to the macro instruction parsing engine 503 when the state of the target address meets the specified conditions;
- the specified conditions include: when the target address includes a read address, the state of the read address is readable; when the target address includes a write address, the state of the write address is writable;
- a read data interface unit 504a configured to read one-dimensional structure data from the read address when the target address includes the read address and the state of the read address is readable;
- the write data interface unit 504c is used to send a write address to the write address when the target address includes the write address and the state of the write address is writable. address to write data.
- the above-mentioned synchronization status refers to the source end (i.e., the external device or memory from which data is read) to be read by DMA, and the state of the destination end (i.e., the external device or memory from which data is written), that is, whether the source end is readable and whether the destination end is writable, which is generally represented by a 1-bit status bit.
- These status bits are stored in the processor and are globally visible, generally containing multiple groups, such as 32 groups, to meet the read and write requirements of multiple DMA or computing units.
- each group of status can have a number (Identity Document, ID).
- a macro instruction when a macro instruction is configured with "ID0, check its read status, do not check the write status", that is, before the DMA system executes the read operation of the address corresponding to the macro instruction, it will first check whether ID0 is readable, and only initiate the read operation after checking that it is readable. After reading, the processor is notified to update the read status. For the write operation, since it is configured not to check, the write operation is directly initiated.
- macro instructions that can immediately execute data transfer can be issued first, thereby improving the efficiency of data transfer.
- the above-mentioned synchronization check operation can be executed respectively in the macroinstruction controller 502, the read data interface unit 504a and the write data interface unit 504c, which requires setting the above-mentioned synchronization configuration in both the macroinstruction and the micro-operation code. Accordingly, the macroinstruction controller 502, the read data interface unit 504a and the write data interface unit 504c can first check the status of the read address/write address before executing the macroinstruction issuance, reading data and writing data, and determine whether to perform the corresponding operation according to the status of the read address/write address.
- the macroinstruction controller 502 when the result of the synchronization check is that the condition is not met, the current macroinstruction can be stored in the instruction waiting queue, and the next macroinstruction can be processed. After waiting for a period of time, the macroinstructions in the instruction waiting queue can be synchronized and checked again to improve the concurrent processing capability of the macroinstructions.
- Figure 10 shows a structure diagram of a macro instruction provided by an exemplary embodiment of the present application.
- the present application provides a DMA macro instruction for expressing multi-dimensional complex data structures and data operations such as transport, format conversion, and calculation.
- Source data block structure information which includes data types, such as half-precision floating point (Floating Point 16, FP16), 8-bit integer type (8bit integer, INT8), single-precision floating point (Floating Point 32, FP32), etc.
- Data dimension information i.e., the size of dimension 0, dimension 1, dimension 2, dimension 3, etc., that is, a data block is no longer described in traditional bytes, but in a data structure.
- Destination data structure also includes data type and data dimension information, but its type and size may be different from those of the source. That is, it expresses the operation of taking out a part of a multidimensional data block from a multidimensional data block and performing data type conversion. This operation is particularly common in AI reasoning and training.
- Format conversion operation configuration including but not limited to matrix transposition, data broadcasting, data mirroring, data padding, data shifting, data splicing and other operations.
- Operation configuration including but not limited to data type conversion (such as FP32 to FP16), data accumulation, activation (RELU) operations, etc.
- Number of loops a macro instruction can be executed once or multiple times.
- Read stepping and write stepping in conjunction with the loop operation, the read and write address are stepped in each loop.
- Synchronization configuration mainly includes synchronization check enable, synchronization update enable, and synchronization relationship ID of the source and sink.
- FIG. 11 shows a schematic diagram of the process of parsing the macro instruction to generate a micro operation code, which specifically includes the following steps:
- Step 1101 parse the macro instruction loop operation and update the read and write start address according to the read and write step
- Step 1102 parse the multi-dimensional data structure (corresponding to the multi-dimensional structure data), extract the one-dimensional vector (corresponding to the one-dimensional structure data), and calculate the vector read and write address and size;
- Step 1103 Update other format conversion configurations according to the parsed one-dimensional vector
- Step 1104 Generate DMA micro-operation code (uOP).
- updating other format conversion configurations refers to updating the size of the data involved in the format conversion, as well as updating the dimension, because the complex data at this time has been reduced from multi-dimensional to one-dimensional.
- Figure 12 shows a micro-op code provided by an exemplary embodiment of the present application.
- the present application describes a DMA micro-op code, which is similar to the content of the instruction or descriptor of the traditional DMA, consisting of "read and write address + data size", and newly adds format conversion configuration and operation configuration fields, which are used to indicate the format conversion pre/post processing module and data operation module respectively.
- FIG. 13 shows a working timing diagram of a data handling path for streaming processing involved in the present application.
- the micro-op code generated by the parsing engine only needs to be passed to the read data interface module of the first-level link.
- the module In each subsequent link, the module only extracts the domain segment that the module needs to use, and after processing, the micro-op code and data (Date, D) are synchronously passed to the next-level module. This solves the problem of the front-to-back dependency of the control unit in the traditional DMA that it needs to pass the control word to each module in the data channel and coordinate the working rhythm of each module.
- the DMA system may further support multiple virtual channels to support simultaneous scheduling of multiple processors.
- Figure 14 shows a structural diagram of a direct memory access system provided by another exemplary embodiment of the present application.
- the macroinstruction controller 502 is used to receive macroinstructions respectively sent by multiple processors through multiple virtual channels 505, and send the macroinstructions respectively sent by multiple processors to the macroinstruction parsing engine 503 in sequence according to the priorities of the multiple virtual channels 505.
- the DMA system involved in the embodiment of the present application can also add virtual channels to support simultaneous use by multiple users.
- multiple processors can issue instructions to the DMA system at the same time, each processor can register a DMA virtual channel, and the controller of the DMA system can execute the corresponding trigger instructions in sequence according to the priority of each virtual channel.
- the DMA system provided in the present application has the following characteristics.
- the processor sends a macro instruction to the DMA system, which describes a multi-dimensional complex data structure and its data operations such as transportation, format conversion, and calculation.
- the processor can issue macro instructions of the DMA system in batches and write them into the macro instruction memory of the DMA system.
- the processor issues a trigger instruction to instruct the DMA system to start executing the macro instruction.
- the content of the trigger instruction may be the starting execution address in the macro instruction memory of the DMA system and the number of instructions.
- the DMA system can execute multiple macroinstructions in batches. Since each macroinstruction may have data dependencies with other components in the chip, each macroinstruction can be configured with synchronization information. The DMA system checks the synchronization relationship and makes its own decisions on the timing of executing each instruction.
- the macroinstruction controller of the DMA system receives the trigger instruction of the processor, reads the macroinstruction from the macroinstruction memory, performs pre-analysis, checks the synchronization relationship, and determines whether to send the instruction to the analysis engine or store it in the waiting queue.
- the DMA system has a built-in macro-instruction parsing engine that parses complex instructions through hardware circuits, rather than having the processor execute a program to parse complex instructions.
- the macro-instruction parsing engine mainly handles the parsing of data structures, decomposing abstract multi-dimensional data structures into multiple micro-operation codes that can be recognized by downstream data paths, namely "read and write address + data size + configuration".
- a format conversion module and a data operation module can be newly added.
- the purpose is to perform format conversion and some calculation operations along the data transmission path to achieve data processing without bandwidth loss.
- the present application provides a streaming processing DMA system. After the module processing of each link is completed, the load (data or instruction) and control word (synchronization relationship, instruction or micro-operation code) are passed to the next link module, and then new processing operations can be received. There is no dependency between the modules in the previous and next links.
- FIG15 The overall processing flow chart of the DMA system for processing data transfer provided by the present application may be shown in FIG15. Please refer to FIG15, the specific process is as follows:
- Step 1501 receiving a DMA trigger instruction
- Step 1502 Read the DMA macro instruction according to the address carried by the trigger instruction
- Step 1503 pre-analyze the macro instruction and perform synchronization relationship check according to the synchronization configuration
- the macro instruction is transferred to the instruction waiting queue and synchronization check is performed until synchronization succeeds;
- Step 1504 Parse the macroinstruction into micro-operation code
- Step 1505 Generate a read operation to the bus according to the instruction read address and data size, such as AXI read transfer, APB, AHB, ACE and other read operations;
- Step 1506 Perform format conversion pre-processing according to the format conversion configuration
- Step 1507 Write the data and micro-op code into the cache
- Step 1508 Perform format conversion post-processing according to the format conversion configuration
- Step 1509 Execute data computing operations according to the computing configuration
- Step 1510 Generate a write operation to the bus according to the instruction write address and data size, such as AXI write transfer, APB, AHB, ACE, etc. write operation;
- step 1503 return to step 1503 and execute the instruction completion report to instruct the controller to update the synchronization relationship.
- FIG. 16 shows a flow chart of a data transfer method provided by an exemplary embodiment of the present application.
- the method is used in the above-mentioned direct memory access system, and the method is executed by a macro instruction controller in the direct memory access system, and the method includes:
- Step 1601 receiving a trigger instruction sent by a processor
- Step 1602 reading a macro instruction from a macro instruction memory according to a trigger instruction; the macro instruction is used to instruct the transfer of multi-dimensional structure data;
- Step 1603 sending the macroinstruction to the macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to the data read and write transmission path, and the data read and write transmission path reads and writes one-dimensional structure data to the bus according to the micro-operation code; the micro-operation code is used to indicate the transfer of one-dimensional structure data in multi-dimensional structure data.
- the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration being used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of the data;
- the above sending of macro instructions to the macro instruction parsing engine includes:
- the macro instruction is sent to the macro instruction parsing engine;
- the specified conditions include:
- the state of the read address is readable
- the state of the write address is writable.
- receiving a trigger instruction sent by a processor includes:
- Send the macro to the macro parsing engine including:
- the macro instructions respectively sent by the multiple processors are sent to the macro instruction parsing engine in sequence.
- the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data;
- the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine;
- the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code.
- the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate instructions for transporting one-dimensional structure data, thereby improving the efficiency of data transport.
- FIG. 17 shows a schematic diagram of a data handling device provided by an exemplary embodiment of the present application.
- the device is used for a macro instruction controller in the above direct memory access system, and the device comprises:
- the trigger instruction receiving module 1701 is used to receive the trigger instruction sent by the processor
- the macro instruction reading module 1702 is used to read the macro instruction from the macro instruction storage according to the trigger instruction; the macro instruction is used to instruct the transport of multi-dimensional structure data;
- the macroinstruction sending module 1703 is used to send macroinstructions to the macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstructions into micro-operation codes and sends the micro-operation codes to the data read and write transmission path, and the data read and write transmission path reads and writes one-dimensional structure data to the bus according to the micro-operation codes; the micro-operation codes are used to indicate the movement of one-dimensional structure data in multi-dimensional structure data.
- the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration being used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of the data;
- the macro instruction sending module 1703 is used to check the status of the target address according to the synchronization configuration
- the macro instruction is sent to the macro instruction parsing engine;
- the specified conditions include:
- the state of the read address is readable
- the state of the write address is writable.
- the trigger instruction receiving module 1701 is used to receive macro instructions respectively sent by multiple processors through multiple virtual channels;
- the macro instruction sending module 1703 is used to send the macro instructions sent by the multiple processors to the macro instruction parsing engine in sequence according to the priorities of the multiple virtual channels.
- the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data;
- the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine;
- the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code.
- the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate instructions for transporting one-dimensional structure data, thereby improving the efficiency of data transport.
- An embodiment of the present application further provides a computer device, which includes a direct memory access system as shown in any one of FIG. 5 , FIG. 7 , FIG. 8 and FIG. 14 .
- the embodiment of the present application further provides a chip, which includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14.
- the chip can be a chip other than a processor for executing a direct memory access process, for example, the chip can be implemented as a DMA controller, or the chip can include one or more DMA controllers; the DMA controller includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14.
- An embodiment of the present application further provides a computer device, which may include the above chip, and the chip includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14.
- An embodiment of the present application also provides a computer-readable storage medium, in which at least one program is stored, and the at least one program is loaded and executed by a controller to implement the above-mentioned data transfer method; the controller is a macro instruction controller in the direct memory access system shown in any of the above-mentioned Figures 5, 7, 8 and 14.
- the computer readable storage medium may include: a read-only memory (ROM), a random access memory (RAM), a solid state drive (SSD) or an optical disk, etc.
- the random access memory may include a resistance random access memory (ReRAM) and a dynamic random access memory (DRAM).
- ReRAM resistance random access memory
- DRAM dynamic random access memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
- Bus Control (AREA)
Abstract
Description
本申请要求于2023年09月13日提交的、申请号为202311177686.7、发明名称为“直接内存访问系统、数据搬运方法、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application filed on September 13, 2023, with application number 202311177686.7 and invention name “Direct Memory Access System, Data Transfer Method, Device and Storage Medium”, the entire contents of which are incorporated by reference into this application.
本申请涉及计算机技术领域,特别涉及一种直接内存访问系统、数据搬运方法、设备及存储介质。The present application relates to the field of computer technology, and in particular to a direct memory access system, a data transfer method, a device and a storage medium.
直接内存访问(Direct Memory Access,DMA)系统又称直接数据搬运硬件单元,其能够在不需要中央处理器(Central Processing Unit,CPU)参与的情况下实现数据搬运。The Direct Memory Access (DMA) system is also known as the direct data transfer hardware unit, which can realize data transfer without the participation of the Central Processing Unit (CPU).
相关技术中,在执行数据搬运时,计算机设备中的处理器可以向DMA系统发送数据搬运的指令,通过该指令将要搬迁的数据读取地址、数据长度以及写入地址指示给DMA系统,由DMA系统根据读取地址和数据长度从总线读取数据,并将读取的数据从写入地址开始依次写回总线。In the related technology, when executing data transfer, the processor in the computer device can send a data transfer instruction to the DMA system, and through the instruction, the read address, data length and write address of the data to be transferred are indicated to the DMA system. The DMA system reads the data from the bus according to the read address and data length, and writes the read data back to the bus in sequence starting from the write address.
然而,上述方案中,处理器需要先根据要搬迁的数据的结构,生成包含读取地址、数据长度以及写入地址的指令,该指令生成过程需要消耗较长的时间,从而影响数据搬运的效率。However, in the above solution, the processor needs to first generate an instruction including a read address, a data length, and a write address according to the structure of the data to be moved. The instruction generation process takes a long time, thereby affecting the efficiency of data transfer.
发明内容Summary of the invention
本申请实施例提供了一种直接内存访问系统、数据搬运方法、设备及存储介质,能够提高数据搬运的效率。所述技术方案如下。The embodiments of the present application provide a direct memory access system, a data transfer method, a device and a storage medium, which can improve the efficiency of data transfer. The technical solution is as follows.
一方面,提供一种直接内存访问系统,所述系统包括:宏指令存储器、宏指令控制器、宏指令解析引擎以及数据读写传输通路;In one aspect, a direct memory access system is provided, the system comprising: a macro instruction memory, a macro instruction controller, a macro instruction parsing engine, and a data read and write transmission path;
所述宏指令存储器和所述宏指令控制器分别与计算机设备中的处理器相连,且所述宏指令存储器与所述宏指令控制器相连;The macro instruction memory and the macro instruction controller are respectively connected to a processor in a computer device, and the macro instruction memory is connected to the macro instruction controller;
所述宏指令控制器与所述宏指令解析引擎相连,所述宏指令解析引擎与所述数据读写传输通路相连,所述数据读写传输通路与所述计算机设备的总线相连;The macro instruction controller is connected to the macro instruction parsing engine, the macro instruction parsing engine is connected to the data reading and writing transmission path, and the data reading and writing transmission path is connected to the bus of the computer device;
所述宏指令存储器,用于存储所述处理器下发的宏指令;所述宏指令用于指示搬运多维结构数据;The macro instruction memory is used to store the macro instructions issued by the processor; the macro instructions are used to instruct the transport of multi-dimensional structure data;
所述宏指令控制器,用于接收处理器下发的触发指令,根据所述触发指令从所述宏指令存储器中读取所述宏指令,并将所述宏指令发送给所述宏指令解析引擎;The macro instruction controller is used to receive the trigger instruction sent by the processor, read the macro instruction from the macro instruction memory according to the trigger instruction, and send the macro instruction to the macro instruction parsing engine;
所述宏指令解析引擎,用于将所述宏指令解析为微操作码,并将所述微操作码发送给所述数据读写传输通路;所述微操作码用于指示搬运所述多维结构数据中的一维结构数据;The macro instruction parsing engine is used to parse the macro instruction into a micro operation code and send the micro operation code to the data read and write transmission path; the micro operation code is used to instruct the transfer of one-dimensional structure data in the multi-dimensional structure data;
所述数据读写传输通路,用于根据所述微操作码向所述总线读取和写入所述一维结构数据。The data read/write transmission path is used to read and write the one-dimensional structure data to the bus according to the micro-operation code.
另一方面,提供了一种数据搬运方法,所述方法用于如上所述的直接内存访问系统,且所述方法由所述直接内存访问系统中的宏指令控制器执行,所述方法包括:On the other hand, a data handling method is provided, the method is used in the direct memory access system as described above, and the method is executed by a macro instruction controller in the direct memory access system, the method comprising:
接收处理器下发的触发指令;Receive the trigger instruction sent by the processor;
根据所述触发指令从宏指令存储器中读取所述宏指令;所述宏指令用于指示搬运多维结构数据;Reading the macro instruction from the macro instruction storage according to the trigger instruction; the macro instruction is used to instruct the transport of multi-dimensional structure data;
将所述宏指令发送给宏指令解析引擎,以便所述宏指令解析引擎将所述宏指令解析为微操作码,并将所述微操作码发送给数据读写传输通路,由所述数据读写传输通路根据所述微 操作码向总线读取和写入一维结构数据;所述微操作码用于指示搬运所述多维结构数据中的所述一维结构数据。The macroinstruction is sent to a macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to a data read-write transmission path, and the data read-write transmission path converts the micro-operation code into a micro-operation code. The operation code reads and writes one-dimensional structure data to the bus; the micro-operation code is used to instruct the transfer of the one-dimensional structure data in the multi-dimensional structure data.
另一方面,提供了一种数据搬运装置,所述装置用于如上所述的直接内存访问系统中的宏指令控制器,所述装置包括:On the other hand, a data handling device is provided, the device is used for a macro instruction controller in the direct memory access system as described above, the device comprising:
触发指令接收模块,用于接收处理器下发的触发指令;A trigger instruction receiving module, used to receive a trigger instruction sent by a processor;
宏指令读取模块,用于根据所述触发指令从宏指令存储器中读取所述宏指令;所述宏指令用于指示搬运多维结构数据;A macro instruction reading module, used for reading the macro instruction from the macro instruction storage according to the trigger instruction; the macro instruction is used for instructing the transport of multi-dimensional structure data;
宏指令发送模块,用于将所述宏指令发送给宏指令解析引擎,以便所述宏指令解析引擎将所述宏指令解析为微操作码,并将所述微操作码发送给数据读写传输通路,由所述数据读写传输通路根据所述微操作码向总线读取和写入一维结构数据;所述微操作码用于指示搬运所述多维结构数据中的所述一维结构数据。A macroinstruction sending module is used to send the macroinstruction to a macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to a data read-write transmission path, and the data read-write transmission path reads and writes one-dimensional structure data to a bus according to the micro-operation code; the micro-operation code is used to indicate the transfer of the one-dimensional structure data in the multi-dimensional structure data.
另一方面,提供了一种计算机设备,所述计算机设备包括如上所述的直接内存访问系统。On the other hand, a computer device is provided, wherein the computer device comprises the direct memory access system as described above.
另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一段程序,所述至少一段程序由控制器加载并执行以实现如上所述的数据搬运方法;所述控制器是如上所述的直接内存访问系统中的宏指令控制器。On the other hand, a computer-readable storage medium is provided, in which at least one program is stored. The at least one program is loaded and executed by a controller to implement the data transfer method described above; the controller is a macro instruction controller in the direct memory access system described above.
另一方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被控制器执行时实现如上所述的数据搬运方法;所述控制器是如上所述的直接内存访问系统中的宏指令控制器。On the other hand, a computer program product is provided, comprising a computer program, wherein when the computer program is executed by a controller, the data transfer method described above is implemented; the controller is a macro instruction controller in the direct memory access system described above.
另一方面,提供了一种芯片,所述芯片包括如上所述的直接内存访问系统。On the other hand, a chip is provided, wherein the chip includes the direct memory access system as described above.
另一方面,提供了一种计算机设备,所述计算机设备包括如上所述的芯片。On the other hand, a computer device is provided, wherein the computer device comprises the chip as described above.
本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solution provided by the embodiment of the present application include at least:
处理器下发给DMA系统的是一种宏指令,用于指示搬运多维结构数据;DMA系统中的宏指令控制器将宏指令发送给宏指令解析引擎;宏指令解析引擎将宏指令解析为用于指示搬运多维结构数据中的一维结构数据的微操作码,并发送给数据读写传输通路,由数据读写传输通路根据微操作码执行一维结构数据的搬迁。在上述方案中,处理器直接向DMA系统指示了多维结构数据的搬运,后续由DMA系统中的宏指令解析引擎将多维结构数据的搬运操作拆解为多个一维结构数据的搬运操作,由于DMA系统对宏指令执行解析的速度要远快于处理器拆解多维结构数据的搬运操作的速度,因此,通过上述DMA系统,能够极大地降低处理器生成一维结构数据的搬运的指令所需要的时间,并且降低处理器将上述指令下发至DMA系统执行的配置开销,从而提高数据搬运的效率。What the processor sends to the DMA system is a macro instruction for instructing the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read-write transmission path, and the data read-write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code. In the above scheme, the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate the instruction for transporting the one-dimensional structure data, and reduce the configuration overhead of the processor sending the above instruction to the DMA system for execution, thereby improving the efficiency of data transport.
图1是一种DMA系统的示意图;FIG1 is a schematic diagram of a DMA system;
图2是一种支持分散-聚集离散搬运的DMA系统的示意图;FIG2 is a schematic diagram of a DMA system supporting scatter-gather discrete transfers;
图3是一种DMA指令或描述符的格式示意图;FIG. 3 is a schematic diagram of the format of a DMA instruction or descriptor;
图4是一种DMA实现数据搬运的流程图;FIG4 is a flow chart of a DMA implementation of data transfer;
图5是本申请一个示例性实施例提供的直接内存访问系统的结构示意图;FIG5 is a schematic diagram of the structure of a direct memory access system provided by an exemplary embodiment of the present application;
图6是本申请涉及的硬件电路解析宏指令的一种举例示意图;FIG6 is a schematic diagram showing an example of a hardware circuit parsing macro instruction involved in the present application;
图7是本申请又一个示例性实施例提供的直接内存访问系统的结构示意图;FIG7 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application;
图8是本申请另一个示例性实施例提供的直接内存访问系统的结构示意图;FIG8 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application;
图9是本申请涉及的格式转换的一种举例示意图;FIG9 is a schematic diagram showing an example of format conversion involved in the present application;
图10是本申请一个示例性实施例提供的一种宏指令的结构图;FIG10 is a structural diagram of a macro instruction provided by an exemplary embodiment of the present application;
图11是对宏指令执行解析生成微操作码的过程示意图;FIG11 is a schematic diagram of the process of parsing and generating micro-op codes for macro instructions;
图12是本申请一个示例性实施例提供的一种微操作码;FIG12 is a micro-operation code provided by an exemplary embodiment of the present application;
图13是本申请涉及的一种流式处理的数据搬运通路的工作时序图; FIG13 is a timing diagram of a data transport path for stream processing involved in the present application;
图14是本申请再一个示例性实施例提供的直接内存访问系统的结构示意图;FIG14 is a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application;
图15是本申请提供的DMA系统处理数据搬运的整体处理流程图;15 is an overall processing flow chart of the DMA system for processing data transfer provided by the present application;
图16是本申请一个示例性实施例提供的一种数据搬运方法的流程图;FIG16 is a flow chart of a data transport method provided by an exemplary embodiment of the present application;
图17是本申请一个示例性实施例提供的一种数据搬运装置的示意图。FIG. 17 is a schematic diagram of a data handling device provided by an exemplary embodiment of the present application.
直接内存访问(Direct Memory Access,DMA)系统又称直接数据搬运硬件单元,是一种不需要中央处理器(Central Processing Unit,CPU)参与,就能实现数据搬运的技术。DMA传输将数据从一个地址空间复制到另一个地址空间,提供在外设和存储器之间或者存储器和存储器之间的高速数据传输。Direct Memory Access (DMA) system, also known as direct data transfer hardware unit, is a technology that can realize data transfer without the participation of the Central Processing Unit (CPU). DMA transfer copies data from one address space to another, providing high-speed data transfer between peripherals and memory or between memory and memory.
CPU具有转移数据、计算、控制程序转移等诸多功能,是系统运作的核心,CPU时刻处理着大量的事务,但有些事务没有那么重要,比如数据的复制和存储数据,如果能够把这部分的CPU资源节约出来,用于CPU处理其他的复杂事务,便能够更好地利用CPU的资源。The CPU has many functions such as transferring data, calculating, and controlling program transfer. It is the core of the system operation. The CPU is always processing a large number of transactions, but some transactions are not that important, such as data copying and storing data. If this part of the CPU resources can be saved and used by the CPU to process other complex transactions, the CPU resources can be better utilized.
其中,转移数据(尤其是转移大量数据)是可以不需要CPU参与的。比如希望计算机设备的外部设备A的数据拷贝到外部设备B,只要外部设备A和外部设备B提供一条DMA数据通道,就能够直接将数据由外部设备A拷贝到外部设备B,该过程不需要经过CPU的处理。因此,DMA能够在一定程度上解放CPU,对于实现高效嵌入式系统与加速网络数据处理有极其重要的作用。Among them, data transfer (especially large amounts of data transfer) does not require CPU involvement. For example, if you want to copy data from external device A of a computer device to external device B, as long as external device A and external device B provide a DMA data channel, the data can be directly copied from external device A to external device B, and this process does not require CPU processing. Therefore, DMA can liberate the CPU to a certain extent, which plays an extremely important role in realizing efficient embedded systems and accelerating network data processing.
在实现DMA传输时,由DMA控制器直接掌管总线,因此,存在着一个总线控制权转移问题。即DMA传输前,CPU要把总线控制权交给DMA控制器,而在结束DMA传输后,DMA控制器应把总线控制权再交回给CPU。一个完整的DMA传输过程通常需要经过DMA请求、DMA响应、DMA传输、DMA结束4个步骤。When implementing DMA transmission, the DMA controller directly controls the bus, so there is a problem of bus control transfer. That is, before DMA transmission, the CPU must hand over the bus control to the DMA controller, and after the DMA transmission ends, the DMA controller should hand over the bus control back to the CPU. A complete DMA transmission process usually requires four steps: DMA request, DMA response, DMA transmission, and DMA end.
图1示出了一种基础的DMA系统的示意图,如图1所示:FIG1 shows a schematic diagram of a basic DMA system, as shown in FIG1 :
1)处理器向DMA系统下发指令,指示DMA系统执行数据搬运操作,DMA系统与总线执行直接的数据搬运的交互,实现一段一维结构数据从一个存储地址搬运到另一个存储地址的操作;1) The processor sends instructions to the DMA system, instructing the DMA system to perform data transfer operations. The DMA system and the bus perform direct data transfer interactions to achieve the operation of transferring a one-dimensional structured data from one storage address to another storage address;
2)DMA系统由控制器、读数据接口、先进先出(First Input First Output,FIFO)缓存和写数据接口组成。其中,控制器负责接收处理器下发的指令并解析,指示读数据接口向总线产生读数据请求,并指示写数据接口向总线产生写数据请求;FIFO缓存则用于读数据操作及写数据操作异步执行的数据缓存;2) The DMA system consists of a controller, a read data interface, a First Input First Output (FIFO) cache, and a write data interface. The controller is responsible for receiving and parsing instructions from the processor, instructing the read data interface to generate a read data request to the bus, and instructing the write data interface to generate a write data request to the bus; the FIFO cache is used for data caching for asynchronous execution of read and write data operations;
3)处理器向DMA系统一条条地下发搬运指令,指示DMA系统执行搬运操作,一个DMA系统一次只对应一条指令。3) The processor sends transfer instructions to the DMA system one by one, instructing the DMA system to perform the transfer operation. One DMA system only corresponds to one instruction at a time.
图2示出了一种支持分散-聚集(Scatter-Gather)离散搬运的DMA系统的示意图,如图2所示:FIG. 2 shows a schematic diagram of a DMA system supporting scatter-gather discrete transport, as shown in FIG. 2 :
1)该方案中DMA系统用于批量搬运多个离散的数据块,降低处理器干预的次数,从而提高搬运效率,主要运用于碎片数据的搬运场景;1) In this solution, the DMA system is used to batch transfer multiple discrete data blocks, reducing the number of processor interventions, thereby improving transfer efficiency. It is mainly used in the transfer scenario of fragmented data;
2)该方案中将一串的搬运指令定义为描述符,处理器预先将指令批量地下发进DMA系统的描述符链表中,并以链表的结构表达指令间的次序关系;2) In this scheme, a series of transport instructions are defined as descriptors. The processor sends the instructions in batches to the descriptor linked list of the DMA system in advance, and expresses the order relationship between the instructions in a linked list structure;
3)处理器指示DMA系统工作的指令由明确的搬运指令变为触发指令,触发DMA系统从描述符链表中的某个位置开始执行。3) The instructions that the processor uses to instruct the DMA system to work are changed from clear transfer instructions to trigger instructions, which trigger the DMA system to start execution from a certain position in the descriptor list.
图3示出了一种DMA指令或描述符的格式示意图。FIG. 3 shows a schematic diagram of the format of a DMA instruction or descriptor.
如上述图1和图2所示,处理器下发的DMA指令或描述符,可以指示DMA直接发起数据读写操作。它们的共同特点是,均只能处理简单的一维数据结构的搬运,其中,上述一维结构数据是指通过起始点(比如起始地址)和长度指示的数据块。也就是说,上述一维结构数据是地址连续的数据。 As shown in Figures 1 and 2 above, the DMA instructions or descriptors issued by the processor can instruct the DMA to directly initiate data read and write operations. Their common feature is that they can only handle the transfer of simple one-dimensional data structures, where the one-dimensional structure data refers to a data block indicated by a starting point (such as a starting address) and a length. In other words, the one-dimensional structure data is data with continuous addresses.
基于上述图1和图2所示的DMA系统结构,请参考图4,其示出了一种DMA实现数据搬运的流程图,如图4所示,上述图1和图2所示的DMA系统结构实现数据搬运的步骤都是一致的:Based on the DMA system structures shown in FIG. 1 and FIG. 2 , please refer to FIG. 4 , which shows a flow chart of DMA implementing data transfer. As shown in FIG. 4 , the steps of implementing data transfer in the DMA system structures shown in FIG. 1 and FIG. 2 are consistent:
步骤401,接收指令,根据DMA系统的状态判断是否执行指令;Step 401, receiving an instruction and determining whether to execute the instruction according to the state of the DMA system;
步骤402,在确定执行指令的情况下,根据指令中的读地址及数据大小向总线发起读操作,如高级可扩展接口(Advanced eXtensible Interface,AXI)读传输、高级外围总线(Advanced Peripheral Bus,APB)、高级高性能总线(Advanced High-performance Bus,AHB)、AXI缓存扩展接口(AXI Coherence Extension,ACE)等读操作;Step 402, when it is determined that the instruction is to be executed, a read operation is initiated to the bus according to the read address and data size in the instruction, such as Advanced eXtensible Interface (AXI) read transmission, Advanced Peripheral Bus (APB), Advanced High-performance Bus (AHB), AXI Coherence Extension (ACE) and other read operations;
步骤403,接收总线返回的数据并缓存入FIFO;Step 403, receiving the data returned by the bus and buffering it into FIFO;
步骤404,根据指令中的写地址及数据大小向总线发起写操作,如AXI写传输、APB、AHB、ACE等写操作;Step 404, initiating a write operation to the bus according to the write address and data size in the instruction, such as AXI write transfer, APB, AHB, ACE, etc. write operation;
步骤405,将FIFO中的数据写入总线。Step 405, write the data in the FIFO into the bus.
然而,上述图1和图2所示的DMA系统结构,是针对一维结构数据执行搬运的DMA系统结构,也就是说,需要预先由处理器确定要搬迁的一维结构数据的读写地址和数据长度,然后通知给DMA系统。由于处理器确定要搬迁的一维结构数据的读写地址和数据长度的过程是通过程序代码来实现的,对于一条一维结构数据的搬运指令的生成过程,需要消耗较长的时间,因此,处理器下发数据搬运的指令的效率,已经成为影响数据搬运效率的瓶颈。However, the DMA system structure shown in FIG. 1 and FIG. 2 is a DMA system structure for performing transfer of one-dimensional structure data, that is, the processor needs to determine the read/write address and data length of the one-dimensional structure data to be transferred in advance, and then notify the DMA system. Since the process of the processor determining the read/write address and data length of the one-dimensional structure data to be transferred is implemented through program code, it takes a long time to generate a transfer instruction for a one-dimensional structure data. Therefore, the efficiency of the processor issuing data transfer instructions has become a bottleneck affecting the data transfer efficiency.
请参考图5,其示出了本申请一个示例性实施例提供的直接内存访问系统的结构示意图,该系统可以包括:宏指令存储器501、宏指令控制器502、宏指令解析引擎503以及数据读写传输通路504。Please refer to FIG. 5 , which shows a schematic diagram of the structure of a direct memory access system provided by an exemplary embodiment of the present application. The system may include: a macro instruction memory 501 , a macro instruction controller 502 , a macro instruction parsing engine 503 , and a data read and write transmission path 504 .
其中,宏指令存储器501和宏指令控制器502分别与计算机设备中的处理器相连,且宏指令存储器501与宏指令控制器502相连。The macro instruction memory 501 and the macro instruction controller 502 are respectively connected to the processor in the computer device, and the macro instruction memory 501 is connected to the macro instruction controller 502 .
其中,上述处理器可以是标量处理器。The above processor may be a scalar processor.
上述处理器可以是通过运行计算机指令或代码来实现运算功能的计算器件,比如,上述处理器可以是计算机设备的中央处理器CPU等。The processor may be a computing device that implements computing functions by running computer instructions or codes. For example, the processor may be a central processing unit (CPU) of a computer device.
上述宏指令存储器501和宏指令控制器502分别与计算机设备中的处理器相连,是指宏指令存储器501和宏指令控制器502分别与计算机设备中的处理器电性相连,宏指令存储器501和宏指令控制器502可以分别接收处理器发送的指令/信号/数据。The above-mentioned macroinstruction memory 501 and macroinstruction controller 502 are respectively connected to the processor in the computer device, which means that the macroinstruction memory 501 and macroinstruction controller 502 are respectively electrically connected to the processor in the computer device, and the macroinstruction memory 501 and macroinstruction controller 502 can respectively receive instructions/signals/data sent by the processor.
上述宏指令存储器501与宏指令控制器502相连,是指宏指令存储器501与宏指令控制器502电性相连,宏指令控制器502可以从宏指令存储器501读取指令/数据。The above-mentioned macroinstruction memory 501 is connected to the macroinstruction controller 502 , which means that the macroinstruction memory 501 is electrically connected to the macroinstruction controller 502 , and the macroinstruction controller 502 can read instructions/data from the macroinstruction memory 501 .
上述宏指令存储器501可以是随机存取记忆体(Random Access Memory,RAM),比如电阻式随机存取记忆体(Resistance Random Access Memory,ReRAM)或者动态随机存取存储器(Dynamic Random Access Memory,DRAM)等等。The above-mentioned macroinstruction memory 501 can be a random access memory (Random Access Memory, RAM), such as resistance random access memory (ReRAM) or dynamic random access memory (DRAM), etc.
宏指令控制器502与宏指令解析引擎503相连,宏指令解析引擎503与数据读写传输通路504相连,数据读写传输通路504与计算机设备的总线相连。The macro instruction controller 502 is connected to the macro instruction parsing engine 503, the macro instruction parsing engine 503 is connected to the data reading and writing transmission path 504, and the data reading and writing transmission path 504 is connected to the bus of the computer device.
其中,上述宏指令控制器502与宏指令解析引擎503相连,可以是指宏指令控制器502与宏指令解析引擎503电性相连,宏指令控制器502可以向宏指令解析引擎503发送指令/数据。The above-mentioned macro instruction controller 502 is connected to the macro instruction parsing engine 503 , which may mean that the macro instruction controller 502 is electrically connected to the macro instruction parsing engine 503 , and the macro instruction controller 502 can send instructions/data to the macro instruction parsing engine 503 .
上述宏指令解析引擎503与数据读写传输通路504相连,是指宏指令解析引擎503与数据读写传输通路504电性相连,宏指令解析引擎503可以向数据读写传输通路504发送指令/数据。The above-mentioned macroinstruction parsing engine 503 is connected to the data read/write transmission path 504 , which means that the macroinstruction parsing engine 503 is electrically connected to the data read/write transmission path 504 , and the macroinstruction parsing engine 503 can send instructions/data to the data read/write transmission path 504 .
上述数据读写传输通路504与计算机设备的总线,是指数据读写传输通路504与计算机设备的总线接口电性相连,数据读写传输通路504可以从计算机设备的总线读取数据(或者说,通过总线向计算机设备的其它存储器/与计算机设备相连的存储器读取数据),以及,向计算机设备的总线写入数据(或者说,通过总线向计算机设备的其它存储器/与计算机设备相 连的存储器写入数据)。The data read/write transmission path 504 and the bus of the computer device refer to that the data read/write transmission path 504 is electrically connected to the bus interface of the computer device, and the data read/write transmission path 504 can read data from the bus of the computer device (or, read data from other memory of the computer device/memory connected to the computer device through the bus), and write data to the bus of the computer device (or, write data to other memory of the computer device/memory connected to the computer device through the bus). data to the connected memory).
宏指令存储器501,用于存储处理器下发的宏指令;宏指令用于指示搬运多维结构数据。The macroinstruction memory 501 is used to store the macroinstructions issued by the processor; the macroinstructions are used to instruct the transport of multi-dimensional structured data.
在本申请实施例中,当需要搬运复杂结构(也就是多维结构)的数据时,计算机设备的处理器可以不需要对多维结构数据执行拆解,而是直接生成指示该多维结构数据的读写地址以及数据量等信息的宏指令,并将该宏指令下发给宏指令存储器501,由宏指令存储器501存储。In an embodiment of the present application, when it is necessary to move data of a complex structure (that is, a multi-dimensional structure), the processor of the computer device may not need to disassemble the multi-dimensional structure data, but directly generate a macro instruction indicating information such as the read and write address and data volume of the multi-dimensional structure data, and send the macro instruction to the macro instruction memory 501 for storage by the macro instruction memory 501.
宏指令控制器502,用于接收处理器下发的触发指令,根据触发指令从宏指令存储器501中读取宏指令,并将宏指令发送给宏指令解析引擎503。The macro instruction controller 502 is used to receive the trigger instruction sent by the processor, read the macro instruction from the macro instruction storage 501 according to the trigger instruction, and send the macro instruction to the macro instruction parsing engine 503.
其中,计算机设备需要搬运某一多维结构数据时,可以通过触发指令触发宏指令控制器502指定对应该多维结构数据的搬运操作的宏指令。When the computer device needs to transfer certain multi-dimensional structure data, the trigger instruction can be used to trigger the macro instruction controller 502 to specify a macro instruction corresponding to the transfer operation of the multi-dimensional structure data.
其中,该触发指令可以指示宏指令存储器501中已存储的宏指令,比如,该触发指令可以包含宏指令存储器501中已存储的宏指令的地址。The trigger instruction may indicate a macro instruction stored in the macro instruction memory 501 . For example, the trigger instruction may include an address of a macro instruction stored in the macro instruction memory 501 .
在一些实施例中,一条触发指令可以指示宏指令控制器502提取单条宏指令,或者,一条触发指令可以指示宏指令控制器502提取多条宏指令。In some embodiments, one trigger instruction may instruct the macroinstruction controller 502 to extract a single macroinstruction, or one trigger instruction may instruct the macroinstruction controller 502 to extract multiple macroinstructions.
当一条触发指令指示宏指令控制器502提取多条宏指令时,该触发指令中可以包含多条宏指令各自的地址;宏指令控制器502按照多条宏指令各自的地址分别提取多条宏指令,该方案对多条宏指令在宏指令存储器501中的存储位置没有限制,可以扩展多条宏指令的指示场景。When a trigger instruction instructs the macroinstruction controller 502 to extract multiple macroinstructions, the trigger instruction can include the addresses of the multiple macroinstructions respectively; the macroinstruction controller 502 extracts the multiple macroinstructions respectively according to the addresses of the multiple macroinstructions. This solution has no restrictions on the storage locations of the multiple macroinstructions in the macroinstruction memory 501, and can expand the indication scenarios of multiple macroinstructions.
或者,如果该多条宏指令是在宏指令存储器501中连续存储的宏指令,则该触发指令中可以包含多条宏指令的起始地址和指令条数,宏指令控制器502可以从起始地址开始,依次读取多条宏指令,该方案中,触发指令只需要携带一个地址和一个指令条数,即可以指示宏指令控制器502提取多条宏指令,能够节约触发指令的传输资源,提高宏指令的指示效率。Alternatively, if the multiple macroinstructions are macroinstructions stored continuously in the macroinstruction memory 501, the trigger instruction may include the starting addresses and instruction numbers of the multiple macroinstructions, and the macroinstruction controller 502 may read the multiple macroinstructions in sequence starting from the starting address. In this scheme, the trigger instruction only needs to carry an address and an instruction number, that is, it can instruct the macroinstruction controller 502 to extract multiple macroinstructions, which can save the transmission resources of the trigger instruction and improve the indication efficiency of the macro instruction.
宏指令解析引擎503,用于将宏指令解析为微操作码,并将微操作码发送给数据读写传输通路504;微操作码用于指示搬运多维结构数据中的一维结构数据。The macroinstruction parsing engine 503 is used to parse the macroinstructions into micro-operation codes and send the micro-operation codes to the data read/write transmission path 504; the micro-operation codes are used to instruct the transfer of one-dimensional structure data in the multi-dimensional structure data.
在本申请实施例中,宏指令解析引擎503,可以通过硬件电路来解析宏指令,以将一条宏指令解析为多条微操作码,并将微操作码发送给数据读写传输通路504。In the embodiment of the present application, the macroinstruction parsing engine 503 can parse the macroinstructions through hardware circuits to parse one macroinstruction into multiple micro-operation codes, and send the micro-operation codes to the data read and write transmission path 504.
其中,多维结构数据是由多条一维结构数据组成的数据。其中,该多维结构数据的地址可以是连续的;或者,上述多维结构数据的地址也可以是不连续的,比如,组成多维结构数据的多条一维结构数据中,每条一维结构数据的地址是连续的,不同的一维结构数据的地址是不连续的。The multidimensional structure data is composed of multiple one-dimensional structure data. The addresses of the multidimensional structure data may be continuous; or the addresses of the multidimensional structure data may be discontinuous, for example, among the multiple one-dimensional structure data constituting the multidimensional structure data, the addresses of each one-dimensional structure data are continuous, and the addresses of different one-dimensional structure data are discontinuous.
比如,假设多维结构数据是一个n×m的二维结构数据,其可以视为n个一维结构数据组成的矩阵,每个一维结构数据的长度为m;或者,该n×m的二维矩阵数据也可以视为m个一维结构数据组成的矩阵,每个一维结构数据的长度为n。For example, assuming that the multidimensional structure data is an n×m two-dimensional structure data, it can be regarded as a matrix composed of n one-dimensional structure data, and the length of each one-dimensional structure data is m; or, the n×m two-dimensional matrix data can also be regarded as a matrix composed of m one-dimensional structure data, and the length of each one-dimensional structure data is n.
请参考图6,其示出了本申请涉及的硬件电路解析宏指令的一种举例示意图,如图6所示,一条二维数据块的填充指令,其指令配置包括:Please refer to FIG. 6 , which shows an example schematic diagram of a hardware circuit parsing macro instruction involved in the present application. As shown in FIG. 6 , a filling instruction of a two-dimensional data block, the instruction configuration of which includes:
基础配置:Basic configuration:
data_type=INT32,dimen_0_size=0xA,dimen_1_size=0x2,read_addr=0x0,write_addr=0x0;data_type=INT32, dimen_0_size=0xA, dimen_1_size=0x2, read_addr=0x0, write_addr=0x0;
扩展配置:Extended configuration:
op_type=pad_op,left_pad_size=0x1,right_pad_size=0x1,top_pad_size=0x1,buttom_pad_size=0x1,internal_pad_size=0x1;op_type=pad_op, left_pad_size=0x1, right_pad_size=0x1, top_pad_size=0x1, buttom_pad_size=0x1, internal_pad_size=0x1;
其中,0x表示16进制数值。Among them, 0x represents a hexadecimal value.
宏指令解析引擎503对一条二维数据块执行解析,解析为五条一维搬运指令,即三条一维常数填充微操作码(micro operations,uOP)和两条一维搬运微操作码,具体包括:The macro instruction parsing engine 503 performs parsing on a two-dimensional data block and parses it into five one-dimensional transport instructions, namely, three one-dimensional constant filling micro operation codes (micro operations, uOPs) and two one-dimensional transport micro operation codes, specifically including:
uOP 1: uOP 1:
uop_type=constant_filling,data_type=INT32,data_size=left_pad_size+dimen_0_size+right_pad_size+right_pad_size=0xD,read_addr=NA,write_addr=0x0;uop_type=constant_filling, data_type=INT32, data_size=left_pad_size+dimen_0_size+right_pad_size+right_pad_size=0xD, read_addr=NA, write_addr=0x0;
uOP 2:uOP 2:
uop_type=read_write,data_type=INT32,data_size=dimen_0_size=0xA,read_addr=0x0,write_addr=uop1_write_addr+uop1_data_size=0xD;uop_type=read_write, data_type=INT32, data_size=dimen_0_size=0xA, read_addr=0x0, write_addr=uop1_write_addr+uop1_data_size=0xD;
uOP 3:uOP 3:
uop_type=constant_filling,data_type=INT32,data_size=left_pad_size+left_pad_size+dimen_0_size+right_pad_size+right_pad_size=0xE,read_addr=NA,write_addr=uop2_write_addr+uop2_data_size=0xD+0xA=0x17;uop_type=constant_filling, data_type=INT32, data_size=left_pad_size+left_pad_size+dimen_0_size+right_pad_size+right_pad_size=0xE, read_addr=NA, write_addr=uop2_write_addr+uop2_data_size=0xD+0xA=0x17;
uOP 4:uOP 4:
uop_type=read_write,data_type=INT32,data_size=dimen_0_size=10,read_addr=uop2_read_addr+uop2_data_size=0xA,write_addr=uop3_write_addr+uop3_data_size=0x17+0xE=0x25;uop_type=read_write, data_type=INT32, data_size=dimen_0_size=10, read_addr=uop2_read_addr+uop2_data_size=0xA, write_addr=uop3_write_addr+uop3_data_size=0x17+0xE=0x25;
uOP 5:uOP 5:
uop_type=constant_filling,data_type=INT32,data_size=left_pad_size+left_pad_size+dimen_0_size+right_pad_size=0xD,read_addr=NA,write_addr=uop4_write_addr+uop4_data_size=0x25+0xA=0x2F;uop_type=constant_filling, data_type=INT32, data_size=left_pad_size+left_pad_size+dimen_0_size+right_pad_size=0xD, read_addr=NA, write_addr=uop4_write_addr+uop4_data_size=0x25+0xA=0x2F;
其中,0x表示16进制数值。Among them, 0x represents a hexadecimal value.
上述宏指令解析引擎503,可以通过硬件电路来解析宏指令,根据宏指令中的多维结构数据的数据结构、读写地址以及数据大小,确定多维结构数据中各个一维结构数据的读写地址以及数据大小(数据结构固定为一维),从而生成对应的微操作码。The above-mentioned macroinstruction parsing engine 503 can parse macroinstructions through hardware circuits, and determine the read and write addresses and data sizes (the data structure is fixed to one dimension) of each one-dimensional structure data in the multidimensional structure data according to the data structure, read and write addresses and data size of the multidimensional structure data in the macroinstructions, thereby generating corresponding micro-operation codes.
数据读写传输通路504,用于根据微操作码向总线读取和写入一维结构数据。The data read/write transmission path 504 is used to read and write one-dimensional structured data to the bus according to the micro-operation code.
比如,对于一个微操作码,其中包含一维结构数据的读地址、写地址和数据大小,其中,上述读地址可以是数据读取的起始地址,上述写地址可以是数据写入的起始地址;数据读写传输通路504可以先根据读地址,通过总线从读地址开始读取上述数据大小对应的长度的一维结构数据,然后将一维结构数据缓存在数据读写传输通路504中,然后,再根据写地址,通过总线从写地址开始依次写入上述一维结构数据,从而完成将一维结构数据从读地址向写地址的搬运;其中,对于一个宏指令解析得到的多个微操作码,当上述多个微操作码都处理完成后,即可以认为上述宏指令对应的多维结构数据搬运完成。For example, for a micro-operation code, which includes the read address, write address and data size of one-dimensional structure data, the above-mentioned read address can be the starting address for data reading, and the above-mentioned write address can be the starting address for data writing; the data read and write transmission path 504 can first read the one-dimensional structure data of the length corresponding to the above-mentioned data size through the bus starting from the read address according to the read address, and then cache the one-dimensional structure data in the data read and write transmission path 504, and then, according to the write address, write the above-mentioned one-dimensional structure data in sequence starting from the write address through the bus, thereby completing the transfer of the one-dimensional structure data from the read address to the write address; wherein, for multiple micro-operation codes obtained by parsing a macro instruction, when the above-mentioned multiple micro-operation codes are processed, it can be considered that the transfer of the multi-dimensional structure data corresponding to the above-mentioned macro instruction is completed.
综上所述,本申请实施例所示的方案,处理器下发给DMA系统的是一种宏指令,用于指示搬运多维结构数据;DMA系统中的宏指令控制器将宏指令发送给宏指令解析引擎;宏指令解析引擎将宏指令解析为用于指示搬运多维结构数据中的一维结构数据的微操作码,并发送给数据读写传输通路,由数据读写传输通路根据微操作码执行一维结构数据的搬迁。在上述方案中,处理器直接向DMA系统指示了多维结构数据的搬运,后续由DMA系统中的宏指令解析引擎将多维结构数据的搬运操作拆解为多个一维结构数据的搬运操作,由于DMA系统对宏指令执行解析的速度要远快于处理器拆解多维结构数据的搬运操作的速度,因此,通过上述DMA系统,能够极大地降低处理器生成一维结构数据的搬运的指令所需要的时间,以及将上述指令下发至DMA系统执行的配置开销,从而提高数据搬运的效率。In summary, the scheme shown in the embodiment of the present application is that the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code. In the above scheme, the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate the instruction for transporting the one-dimensional structure data, as well as the configuration overhead of sending the above instruction to the DMA system for execution, thereby improving the efficiency of data transport.
基于上述图1或图2所示的DMA系统结构,计算系统中复杂数据结构的搬运一般先由处理器拆解成一条条具体的搬运数据大小和源宿地址(对应上述读地址和写地址),再下发给DMA去搬运,这种系统无法满足异构的处理器(比如AI处理器、图像/视频编码芯片)对并行数据传输的存算带宽需求,以及复杂数据结构的搬运需求。Based on the DMA system structure shown in Figure 1 or Figure 2 above, the transfer of complex data structures in the computing system is generally broken down by the processor into specific transfer data sizes and source and destination addresses (corresponding to the above read addresses and write addresses), and then sent to the DMA for transfer. This system cannot meet the storage and computing bandwidth requirements of heterogeneous processors (such as AI processors, image/video encoding chips) for parallel data transmission, as well as the transfer requirements of complex data structures.
具体来说,图1或图2所示的DMA系统结构只能处理简单的一维数据结构的搬运,即只有起始点和长度的数据块。这种特性在当下流行的AI推理训练处理器中,面对复杂且多样的数据结构时,DMA无法直接处理,需要通过软件的方法,即内部的处理器使用一条条的标 量指令数据结构执行拆解,拆解为一段段确切长度,读写地址的数据块后才能下发给DMA执行搬运。这往往需要占用处理器大量的硬件资源和时间,导致处理器执行的其他业务程序的有效性能被拉低。Specifically, the DMA system structure shown in Figure 1 or Figure 2 can only handle the transfer of simple one-dimensional data structures, that is, data blocks with only a starting point and a length. This feature cannot be directly processed by DMA in the current popular AI inference training processors when faced with complex and diverse data structures. It needs to be processed by software, that is, the internal processor uses a line of labels to The instruction data structure is disassembled into segments of exact length and read/write address data blocks before being sent to DMA for transfer. This often requires a large amount of processor hardware resources and time, resulting in a decrease in the effective performance of other business programs executed by the processor.
同时,人工智能(Artificial Intelligence,AI)推理训练处理器中强大的算力需要非常高的数据搬运带宽支撑才能得到发挥,而上述问题带来的更严重影响是,常规的处理器执行数据结构拆解,然后才产生搬运指令的处理过程耗时过大,导致数据搬运带宽无法达到预期,瓶颈不在于DMA的搬运性能,而在于处理器拆解复杂数据结构再生成DMA指令这个环节。具体的原因是,一个复杂数据结构的拆解过程可能是处理器通过一个程序段来实现的,这需要经历指令取指、指令解析、指令发射、指令执行流水、指令写回、指令退休等一系列环节,并且需要处理前后指令的依赖关系,这就导致一个复杂数据结构的拆解并生成DMA指令的程序段的执行过程需要处理器经过几十甚至上百个时钟周期来完成。At the same time, the powerful computing power in artificial intelligence (AI) reasoning training processors requires very high data transfer bandwidth support to be fully utilized. The more serious impact of the above problem is that the conventional processor executes the data structure disassembly and then generates the transfer instruction processing process, which takes too long, resulting in the data transfer bandwidth failing to meet expectations. The bottleneck is not the DMA transfer performance, but the processor disassembles the complex data structure and then generates DMA instructions. The specific reason is that the disassembly process of a complex data structure may be implemented by the processor through a program segment, which requires a series of links such as instruction fetch, instruction parsing, instruction emission, instruction execution pipeline, instruction write back, instruction retirement, etc., and it is necessary to deal with the dependency relationship between the previous and next instructions. This results in the disassembly of a complex data structure and the execution process of the program segment that generates DMA instructions, which requires the processor to go through dozens or even hundreds of clock cycles to complete.
而基于本申请上述图5所示实施例的技术方案,提供了一种可自行解析复杂指令序列并高效搬运的DMA系统,并具备与处理器及其他子系统执行数据同步的功能:Based on the technical solution of the embodiment shown in FIG. 5 of the present application, a DMA system is provided which can parse complex instruction sequences by itself and efficiently transfer them, and has the function of executing data synchronization with the processor and other subsystems:
1)该DMA系统可以应用于AI推理和训练芯片中,复杂数据结构的搬运,以及高算力对于高搬运带宽的需求场景中;1) The DMA system can be applied to AI reasoning and training chips, the handling of complex data structures, and scenarios where high computing power requires high handling bandwidth;
2)该DMA系统也可以应用在图像编解码芯片中,其中,图像编解码芯片也存在与AI推理和训练同样的使用场景,即复杂数据结构的搬运操作的场景。2) The DMA system can also be used in image codec chips, where image codec chips also have the same usage scenarios as AI reasoning and training, namely, the scenario of moving complex data structures.
基于图5所示的实施例,请参考图7,其示出了本申请又一个示例性实施例提供的直接内存访问系统的结构示意图,如图7所示,在图5所示的DMA系统中,数据读写传输通路504中包含多个处理单元,多个处理单元至少包括读数据接口单元504a、缓存单元504b以及写数据接口单元504c;Based on the embodiment shown in FIG5 , please refer to FIG7 , which shows a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application. As shown in FIG7 , in the DMA system shown in FIG5 , a data read and write transmission path 504 includes a plurality of processing units, and the plurality of processing units at least include a read data interface unit 504 a, a cache unit 504 b, and a write data interface unit 504 c;
其中,宏指令解析引擎503与读数据接口单元504a相连,且宏指令解析引擎503用于将微操作码发送给读数据接口单元504a;The macro instruction parsing engine 503 is connected to the read data interface unit 504a, and the macro instruction parsing engine 503 is used to send the micro operation code to the read data interface unit 504a;
上述微操作码,用于在多个处理单元之间依次传递。The above-mentioned micro-operation codes are used to be transmitted sequentially among multiple processing units.
在本申请实施例中,宏指令解析引擎503与读数据接口单元504a相连,是指宏指令解析引擎503与读数据接口单元电性504a相连,使得宏指令解析引擎503可以向读数据接口单元504a发送信号,从而能够将微操作码发送给读数据接口单元504a。In the embodiment of the present application, the macroinstruction parsing engine 503 is connected to the read data interface unit 504a, which means that the macroinstruction parsing engine 503 is electrically connected to the read data interface unit 504a, so that the macroinstruction parsing engine 503 can send a signal to the read data interface unit 504a, thereby being able to send the micro-operation code to the read data interface unit 504a.
在本申请实施例中,上述数据读写传输通路504中的多个处理单元依次相连,并且,在数据搬运过程中,从读数据接口单元504a开始,读数据接口单元504a接收宏指令解析引擎503发送的微操作码,根据微操作码从总线读取一维结构数据,并将一维结构数据和微操作码一起发送给下一处理单元,一个处理单元接收到前一处理单元发送的一维结构数据和微操作码后,将一维结构数据和微操作码传递给再下一个处理单元,如果微操作码指示当前处理单元对一维结构数据执行某种处理,则当前处理单元还可以对该一维结构数据执行对应的处理后,将处理后的一维结构数据和微操作码传递给下一个处理单元,写数据接口单元504c接收上一处理单元发送的一维结构数据和微操作码后,根据微操作码,将一维结构数据写入总线。在上述过程中,宏指令解析引擎503只需要将微操作码发送给读数据接口单元504a即可,不需要发送给每一个需要该微操作码的处理单元,降低了DMA系统的设计及布线难度,同时也降低了宏指令解析引擎503与数据读写传输通路504之间的带宽需求,提高了宏指令解析引擎503向数据读写传输通路504传递微操作码的效率。In an embodiment of the present application, multiple processing units in the above-mentioned data read and write transmission path 504 are connected in sequence, and, during the data transfer process, starting from the read data interface unit 504a, the read data interface unit 504a receives the micro-operation code sent by the macro instruction parsing engine 503, reads the one-dimensional structure data from the bus according to the micro-operation code, and sends the one-dimensional structure data and the micro-operation code together to the next processing unit. After a processing unit receives the one-dimensional structure data and micro-operation code sent by the previous processing unit, it passes the one-dimensional structure data and the micro-operation code to the next processing unit. If the micro-operation code instructs the current processing unit to perform a certain processing on the one-dimensional structure data, the current processing unit can also perform corresponding processing on the one-dimensional structure data, and then pass the processed one-dimensional structure data and the micro-operation code to the next processing unit. After the write data interface unit 504c receives the one-dimensional structure data and micro-operation code sent by the previous processing unit, it writes the one-dimensional structure data to the bus according to the micro-operation code. In the above process, the macroinstruction parsing engine 503 only needs to send the micro-operation code to the read data interface unit 504a, and does not need to send it to each processing unit that needs the micro-operation code, thereby reducing the design and wiring difficulty of the DMA system. At the same time, it also reduces the bandwidth requirement between the macroinstruction parsing engine 503 and the data read and write transmission path 504, and improves the efficiency of the macroinstruction parsing engine 503 in transmitting micro-operation codes to the data read and write transmission path 504.
基于图5或图7所示的实施例,在一些实施例中,宏指令解析引擎503,用于在将宏指令解析为微操作码时,解析多维结构数据的数据结构,确定多维结构数据中的一维结构数据的读写地址和数据大小;根据一维结构数据的读写地址和数据大小,生成微操作码。Based on the embodiments shown in FIG. 5 or 7 , in some embodiments, the macroinstruction parsing engine 503 is used to parse the data structure of the multidimensional structure data when parsing the macroinstruction into a micro-operation code, and determine the read and write addresses and data size of the one-dimensional structure data in the multidimensional structure data; and generate a micro-operation code based on the read and write addresses and data size of the one-dimensional structure data.
其中,上述宏指令中可以包含多维结构数据的读地址、写地址、数据结构、数据大小等信息,结合上述多维结构数据的数据结构和数据大小,宏指令解析引擎503可以确定多维结构数据中每条一维结构数据的数据大小,然后再结合多维结构数据的读地址、写地址,可以 确定每条一维结构数据的读地址和写地址,然后,宏指令解析引擎503可以生成包含一维结构数据的读地址、写地址以及数据大小的微操作码,每条一维结构数据可以对应一条微操作码。The above-mentioned macro instruction may include information such as the read address, write address, data structure, and data size of the multi-dimensional structure data. Combined with the data structure and data size of the multi-dimensional structure data, the macro instruction parsing engine 503 may determine the data size of each one-dimensional structure data in the multi-dimensional structure data, and then combine the read address and write address of the multi-dimensional structure data to determine the data size of each one-dimensional structure data. The read address and write address of each one-dimensional structure data are determined, and then the macro instruction parsing engine 503 can generate a micro operation code including the read address, write address and data size of the one-dimensional structure data, and each one-dimensional structure data can correspond to a micro operation code.
其中,上述多维结构数据的数据结构,可以包括多维结构数据的原始的数据结构(也称为源端数据结构),还可以包括多维结构数据搬运之后的数据结构(也称为宿端数据结构);其中,如果多维结构数据的数据结构只包含源端数据结构,则可以认为该多维结构数据在搬运之后的数据结构不变,如果多维结构数据的数据结构还包括宿端数据结构,则认为该多维结构数据在搬运之后的数据结构发生改变。Among them, the data structure of the above-mentioned multidimensional structured data may include the original data structure of the multidimensional structured data (also called the source-end data structure), and may also include the data structure of the multidimensional structured data after being moved (also called the destination-end data structure); among them, if the data structure of the multidimensional structured data only includes the source-end data structure, it can be considered that the data structure of the multidimensional structured data after being moved remains unchanged; if the data structure of the multidimensional structured data also includes the destination-end data structure, it can be considered that the data structure of the multidimensional structured data after being moved changes.
基于图7所示的实施例,请参考图8,其示出了本申请另一个示例性实施例提供的直接内存访问系统的结构示意图,如图8所示,在图7所示的DMA系统中,宏指令还用于指示对多维结构数据执行格式转换处理,微操作码还用于指示对一维结构数据执行格式转换处理;Based on the embodiment shown in FIG. 7 , please refer to FIG. 8 , which shows a schematic diagram of the structure of a direct memory access system provided by another exemplary embodiment of the present application. As shown in FIG. 8 , in the DMA system shown in FIG. 7 , the macro instruction is also used to instruct to perform format conversion processing on multi-dimensional structure data, and the micro operation code is also used to instruct to perform format conversion processing on one-dimensional structure data;
多个处理单元中还包括格式转换前处理单元504d以及格式转换后处理单元504e中的至少一个,格式转换前处理单元504d位于缓存单元之前,且格式转换后处理单元504e位于缓存单元之后;其中,上述格式转换前处理单元504d位于缓存单元之前,可以是指在多个处理单元依次连接的顺序中,格式转换前处理单元位于缓存单元之前;相应的,格式转换后处理单元504e位于缓存单元之后,可以是指在多个处理单元依次连接的顺序中,格式转换后处理单元位于缓存单元之后;The multiple processing units also include at least one of a pre-format conversion processing unit 504d and a post-format conversion processing unit 504e, the pre-format conversion processing unit 504d is located before the cache unit, and the post-format conversion processing unit 504e is located after the cache unit; wherein the pre-format conversion processing unit 504d is located before the cache unit, which may mean that in the order in which the multiple processing units are connected in sequence, the pre-format conversion processing unit is located before the cache unit; correspondingly, the post-format conversion processing unit 504e is located after the cache unit, which may mean that in the order in which the multiple processing units are connected in sequence, the post-format conversion processing unit is located after the cache unit;
格式转换前处理单元504d,用于根据微操作码指示的格式转换处理的处理方式,对一维结构数据执行格式转换前处理,并将格式转换前处理的结果数据写入缓存单元504b;The format conversion pre-processing unit 504d is used to perform format conversion pre-processing on the one-dimensional structure data according to the format conversion processing processing mode indicated by the micro-operation code, and write the result data of the format conversion pre-processing into the cache unit 504b;
格式转换后处理单元504e,用于根据微操作码指示的格式转换处理的处理方式,对缓存单元504b中的数据执行格式转换后处理后,发送给下一个处理单元。The format conversion post-processing unit 504e is used to perform format conversion post-processing on the data in the cache unit 504b according to the format conversion processing method indicated by the micro-operation code, and then send it to the next processing unit.
数据的格式转换可以包括数据类型转换、移位、拼接、剔除、间插、镜像、填充等数据的空间转换中的一种或多种的组合,这些在传统的DMA系统中,通常由处理器通过执行程序指令的组合来实现;而在本申请实施例中,通过在缓存单元之前或者之后的至少一处,设置格式转换前处理单元和格式转换后处理单元中的至少一个处理单元,用于通过硬件电路来实现数据的格式转换,从而提高数据格式转换的效率。The format conversion of data may include one or more combinations of data space conversions such as data type conversion, shifting, splicing, deletion, interleaving, mirroring, and filling. In traditional DMA systems, these are usually implemented by the processor through a combination of executing program instructions; in an embodiment of the present application, at least one processing unit of the format conversion pre-processing unit and the format conversion post-processing unit is set at least one place before or after the cache unit to implement data format conversion through hardware circuits, thereby improving the efficiency of data format conversion.
有些格式转换操作需要分几个步骤完成的,因此,在本申请实施例中,在数据读写传输通路504中设置“前处理—缓存—后处理”的结构去分步处理数据的格式转换过程。比如,数据镜像+尾部填充的格式转换操作,需要在前处理中将一片数据分段并反向,存入缓存中,后处理过程中再倒序地取出数据,最后再在尾部填充后输出给下游。数据格式的转换的操作种类有多种,每种格式转换操作可以对应不同的格式转换前处理单元504d或者格式转换后处理单元504e,一般来说,一种DMA结构可以支持有限种类的转换类型。Some format conversion operations need to be completed in several steps. Therefore, in the embodiment of the present application, a "pre-processing-cache-post-processing" structure is set in the data read and write transmission path 504 to process the data format conversion process step by step. For example, the format conversion operation of data mirroring + tail padding requires that a piece of data be segmented and reversed in the pre-processing, stored in the cache, and then the data is taken out in reverse order during the post-processing process, and finally output to the downstream after the tail is filled. There are many types of data format conversion operations, and each format conversion operation can correspond to a different format conversion pre-processing unit 504d or format conversion post-processing unit 504e. Generally speaking, a DMA structure can support a limited number of conversion types.
请参考图9,其示出了本申请涉及的格式转换的一种举例示意图,如图9所示,数据的格式转换操作是将一组一维数据(7,8,9),广播为一组二维数组((7,7,7),(8,8,8),(9,9,9)),涉及的硬件电路至少包括以下电路单元:多路器(Multiplexer)、计数器和复制单元实现。Please refer to Figure 9, which shows an example schematic diagram of the format conversion involved in the present application. As shown in Figure 9, the data format conversion operation is to broadcast a group of one-dimensional data (7, 8, 9) into a group of two-dimensional arrays ((7, 7, 7), (8, 8, 8), (9, 9, 9)), and the hardware circuit involved includes at least the following circuit units: a multiplexer, a counter and a replication unit.
其中,计数器从0累加至N-1(N为输入的一维数组元素个数),作为多路器的选通信号,依次将读入的数据每个元素选出,并通过复制单元的广播次数将每个元素复制后写入缓存单元,复制的个数为指令配置的广播次数。Among them, the counter accumulates from 0 to N-1 (N is the number of elements in the input one-dimensional array) as the selection signal of the multiplexer, selects each element of the read-in data in turn, and copies each element through the broadcast number of the replication unit and writes it into the cache unit. The number of copies is the broadcast number configured by the instruction.
如图9所示,将一组一维数据(7,8,9)广播为一组二维数组((7,7,7),(8,8,8),(9,9,9)),需要通过一次读操作和三次复制操作完成,具体包括以下步骤:As shown in FIG9 , broadcasting a set of one-dimensional data (7, 8, 9) into a set of two-dimensional arrays ((7, 7, 7), (8, 8, 8), (9, 9, 9)) requires one read operation and three copy operations, which specifically include the following steps:
步骤1:执行一次读操作将一维数据(7,8,9)输入格式转换前处理单元,再执行一次复制操作得到(7,7,7);Step 1: Perform a read operation to input the one-dimensional data (7, 8, 9) into the format conversion pre-processing unit, and then perform a copy operation to obtain (7, 7, 7);
步骤2:执行第二次复制操作得到((7,7,7),(8,8,8));Step 2: Perform the second copy operation to obtain ((7, 7, 7), (8, 8, 8));
步骤3:执行第三次复制操作,最终得到二维数组((7,7,7),(8,8,8),(9,9,9))。 Step 3: Perform the third copy operation and finally obtain the two-dimensional array ((7, 7, 7), (8, 8, 8), (9, 9, 9)).
在一些实施例中,宏指令中包含第一格式转换配置,第一格式转换配置是对多维结构数据执行格式转换的配置;比如,上述第一格式转换配置,可以指示对多维结构数据的格式转换的方式;In some embodiments, the macro instruction includes a first format conversion configuration, which is a configuration for performing format conversion on the multi-dimensional structure data; for example, the first format conversion configuration may indicate a method for performing format conversion on the multi-dimensional structure data;
宏指令解析引擎503,用于在根据一维结构数据的读写地址和数据大小,生成微操作码时,根据第一格式转换配置生成第二格式转换配置,第二格式转换配置是对一维结构数据执行格式转换的配置;比如,上述第二格式转换配置,可以指示对一维结构数据的格式转换的方式;The macro instruction parsing engine 503 is used to generate a second format conversion configuration according to the first format conversion configuration when generating a micro operation code according to the read/write address and data size of the one-dimensional structure data, wherein the second format conversion configuration is a configuration for performing format conversion on the one-dimensional structure data; for example, the second format conversion configuration may indicate a method for format conversion of the one-dimensional structure data;
根据一维结构数据的读写地址、数据大小、以及第二格式转换配置,生成微操作码。A micro-operation code is generated according to the read and write addresses, data size, and the second format conversion configuration of the one-dimensional structure data.
在本申请实施例中,当DMA系统支持数据格式的转换时,可以在微操作码中设置格式转换配置,以支持对格式转换操作的控制,提高格式转换的应用灵活性。In an embodiment of the present application, when the DMA system supports data format conversion, a format conversion configuration can be set in the micro-operation code to support control of the format conversion operation and improve the application flexibility of the format conversion.
而要实现在微操作码中设置格式转换配置,则需要在宏指令中指示,而由于宏指令是针对多维结构数据设置的指令,其中的格式转换配置也是对应多维结构数据设置的格式转换配置,因此,宏指令解析引擎503在解析宏指令生成微操作码时,需要根据宏指令中针对多维结构数据设置格式转换配置,生成微操作码中针对一维结构数据的格式转换配置,或者说,根据宏指令中针对多维结构数据设置格式转换配置,执行格式转换配置的更新,得到微操作码中一维结构数据的格式转换配置。In order to implement the format conversion configuration set in the micro-op code, it is necessary to indicate it in the macro instruction. Since the macro instruction is an instruction set for multi-dimensional structure data, the format conversion configuration therein is also the format conversion configuration corresponding to the multi-dimensional structure data setting. Therefore, when the macro instruction parsing engine 503 parses the macro instruction to generate the micro-op code, it is necessary to generate the format conversion configuration for the one-dimensional structure data in the micro-op code according to the format conversion configuration set for the multi-dimensional structure data in the macro instruction. In other words, according to the format conversion configuration set for the multi-dimensional structure data in the macro instruction, the format conversion configuration is updated to obtain the format conversion configuration for the one-dimensional structure data in the micro-op code.
在一种可能的实现方式中,根据宏指令中针对多维结构数据设置格式转换配置生成微操作码中针对一维结构数据的格式转换配置的过程中,宏指令解析引擎503可以查询预先存储的,第一格式转换配置与第二格式转换配置之间的对应关系,确定与第一格式转换配置相对应的第二格式转换配置。In one possible implementation, in the process of generating a format conversion configuration for one-dimensional structure data in a micro-operation code according to a format conversion configuration set for multi-dimensional structure data in a macro instruction, the macro instruction parsing engine 503 can query a pre-stored correspondence between a first format conversion configuration and a second format conversion configuration, and determine a second format conversion configuration corresponding to the first format conversion configuration.
在一些实施例中,如图8所示,在图7所示的DMA系统中,宏指令还用于指示对多维结构数据执行运算处理,微操作码还用于指示对一维结构数据执行运算处理;In some embodiments, as shown in FIG8 , in the DMA system shown in FIG7 , the macroinstruction is further used to instruct the execution of an operation process on multi-dimensional structure data, and the micro-operation code is further used to instruct the execution of an operation process on one-dimensional structure data;
多个处理单元中还包括数据运算单元504f;The plurality of processing units also include a data operation unit 504f;
数据运算单元,用于根据微操作码指示的运算处理方式,对输入的数据执行运算处理后,发送给下一个处理单元。The data operation unit is used to perform operation processing on the input data according to the operation processing method indicated by the micro-operation code, and then send it to the next processing unit.
上述的数据运算可以包括数据类型转换,比如,32位整数类型(32bit integer,INT32)转FP32,以及数据计算,比如加减乘除的算术逻辑单元(Arithmetic and Logic Unit,ALU),这些计算操作可以通过硬件电路来实现,从而在DMA系统中预先执行一些数据运算,以便后续的数据处理,比如便于后续的AI推理或训练,或者,便于后续的图像编解码。The above-mentioned data operations may include data type conversion, for example, 32-bit integer type (32bit integer, INT32) to FP32, and data calculation, such as addition, subtraction, multiplication and division of the arithmetic and logic unit (Arithmetic and Logic Unit, ALU). These calculation operations can be implemented through hardware circuits, so that some data operations are pre-executed in the DMA system to facilitate subsequent data processing, such as subsequent AI reasoning or training, or subsequent image encoding and decoding.
在一些实施例中,宏指令和微操作码中分别包含运算配置,运算配置是对运算处理的配置。比如,上述运算配置可以用于指示运算处理的方式,比如,上述运算配置可以指示运算处理的方式为加法运算、减法运算、乘法运算、除法运算等运算方式中的一种或多种的组合。In some embodiments, the macroinstructions and micro-operation codes respectively include operation configurations, which are configurations of operation processing. For example, the operation configurations can be used to indicate the operation processing method, for example, the operation configurations can indicate that the operation processing method is a combination of one or more operation methods such as addition, subtraction, multiplication, and division.
在本申请实施例中,当DMA系统支持数据运算时,可以在微操作码中设置运算配置,以支持对运算操作的控制,提高运算操作的应用灵活性。而要实现在微操作码中设置运算配置,则需要在宏指令中指示。In the embodiment of the present application, when the DMA system supports data operations, the operation configuration can be set in the micro-operation code to support the control of the operation operation and improve the application flexibility of the operation operation. To achieve the setting of the operation configuration in the micro-operation code, it is necessary to indicate it in the macro instruction.
在一些实施例中,如图8所示,在图7所示的DMA系统中,宏指令和微操作码中还包含同步配置,同步配置用于指示对目标地址执行状态检查,目标地址是数据的读地址和写地址中的至少一个;In some embodiments, as shown in FIG8 , in the DMA system shown in FIG7 , the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration is used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of data;
宏指令控制器502,用于在将宏指令发送给宏指令解析引擎503时,根据宏指令中的同步配置检查目标地址的状态,并在目标地址的状态满足指定条件时,将宏指令发送给宏指令解析引擎503;指定条件包括:当目标地址包括读地址时,读地址的状态为可读;当目标地址包括写地址时,写地址的状态为可写;The macro instruction controller 502 is used to check the state of the target address according to the synchronization configuration in the macro instruction when sending the macro instruction to the macro instruction parsing engine 503, and send the macro instruction to the macro instruction parsing engine 503 when the state of the target address meets the specified conditions; the specified conditions include: when the target address includes a read address, the state of the read address is readable; when the target address includes a write address, the state of the write address is writable;
读数据接口单元504a,用于在目标地址包括读地址,且读地址的状态为可读时,从读地址读取一维结构数据;A read data interface unit 504a, configured to read one-dimensional structure data from the read address when the target address includes the read address and the state of the read address is readable;
写数据接口单元504c,用于在目标地址包括写地址,且写地址的状态为可写时,向写地 址写入数据。The write data interface unit 504c is used to send a write address to the write address when the target address includes the write address and the state of the write address is writable. address to write data.
上述的同步的状态,指的是DMA将要读取的源端(即数据读取的外部设备或存储器),以及将要写入宿端(即数据写入的外部设备或存储器)的状态,即源端是否可读,宿端是否可写,一般用1比特的状态位表示,这些状态位存在处理器中,是全局可见的,一般包含多组,如32组,以满足多路DMA或运算单元读写需求。可选的,每一组状态可以有一个编号(Identity Document,ID)。The above-mentioned synchronization status refers to the source end (i.e., the external device or memory from which data is read) to be read by DMA, and the state of the destination end (i.e., the external device or memory from which data is written), that is, whether the source end is readable and whether the destination end is writable, which is generally represented by a 1-bit status bit. These status bits are stored in the processor and are globally visible, generally containing multiple groups, such as 32 groups, to meet the read and write requirements of multiple DMA or computing units. Optionally, each group of status can have a number (Identity Document, ID).
例如,当一条宏指令配置了“ID0,检查其读状态,不检查写状态”,即DMA系统对该宏指令对应的地址执行读取前,会先检查ID0是否可读,检查到其可读,才发起读操作。读完之后则通知处理器更新读状态。对于写操作,由于配置了不检查,则直接发起写操作。通过上述处理,可以优先下发能够立即执行数据搬运的宏指令,提高数据搬运的效率。For example, when a macro instruction is configured with "ID0, check its read status, do not check the write status", that is, before the DMA system executes the read operation of the address corresponding to the macro instruction, it will first check whether ID0 is readable, and only initiate the read operation after checking that it is readable. After reading, the processor is notified to update the read status. For the write operation, since it is configured not to check, the write operation is directly initiated. Through the above processing, macro instructions that can immediately execute data transfer can be issued first, thereby improving the efficiency of data transfer.
其中,上述同步检查的操作可以在宏指令控制器502、读数据接口单元504a以及写数据接口单元504c中分别执行,这就需要在宏指令和微操作码中都设置上述同步配置,相应的,宏指令控制器502、读数据接口单元504a以及写数据接口单元504c在执行宏指令下发、读数据以及写数据之前,可以先检查读地址/写地址的状态,并根据读地址/写地址的状态确定是否执行相应的操作。Among them, the above-mentioned synchronization check operation can be executed respectively in the macroinstruction controller 502, the read data interface unit 504a and the write data interface unit 504c, which requires setting the above-mentioned synchronization configuration in both the macroinstruction and the micro-operation code. Accordingly, the macroinstruction controller 502, the read data interface unit 504a and the write data interface unit 504c can first check the status of the read address/write address before executing the macroinstruction issuance, reading data and writing data, and determine whether to perform the corresponding operation according to the status of the read address/write address.
其中,对于宏指令控制器502,当同步检查的结果为不满足条件时,可以将当前宏指令存到指令等待队列,并对下一条宏指令执行处理,在等待一段时间后,再对指令等待队列中的宏指令再次执行同步检查,以提高宏指令的并发处理能力。Among them, for the macroinstruction controller 502, when the result of the synchronization check is that the condition is not met, the current macroinstruction can be stored in the instruction waiting queue, and the next macroinstruction can be processed. After waiting for a period of time, the macroinstructions in the instruction waiting queue can be synchronized and checked again to improve the concurrent processing capability of the macroinstructions.
请参考图10,其示出了本申请一个示例性实施例提供的一种宏指令的结构图。如图10所示,本申请提供了一种DMA宏指令,用于表达多维度的复杂数据结构及其搬运、格式转换、运算等数据操作。Please refer to Figure 10, which shows a structure diagram of a macro instruction provided by an exemplary embodiment of the present application. As shown in Figure 10, the present application provides a DMA macro instruction for expressing multi-dimensional complex data structures and data operations such as transport, format conversion, and calculation.
其中除了读写地址与普通的DMA指令一样描述搬运的起始地址和目的地址,其他的域段描述的内容如下。Except for the read and write addresses, which describe the start and destination addresses of the transfer like ordinary DMA instructions, the contents of other fields are described as follows.
1)源端数据结构:1) Source data structure:
a.源端数据块结构信息,其包含数据类型,如半精度浮点(Floating Point 16,FP16)、8位整数类型(8bit integer,INT8)、单精度浮点(Floating Point 32,FP32)等;a. Source data block structure information, which includes data types, such as half-precision floating point (Floating Point 16, FP16), 8-bit integer type (8bit integer, INT8), single-precision floating point (Floating Point 32, FP32), etc.
b.数据维度信息,即维度0大小、维度1大小、维度2大小、维度3大小等,即不再以传统的字节为单位去描述一段数据块,而是以一种数据结构去描述数据块。b. Data dimension information, i.e., the size of dimension 0, dimension 1, dimension 2, dimension 3, etc., that is, a data block is no longer described in traditional bytes, but in a data structure.
2)宿端数据结构:宿端数据结构信息,同样是包含数据类型和数据维度信息,但其类型和大小可以与源端不一样,即表达了从一个多维度数据块从取出一部分多维数据块并执行数据类型转换的操作,这种操作在AI推理和训练中尤为常见。2) Destination data structure: The destination data structure information also includes data type and data dimension information, but its type and size may be different from those of the source. That is, it expresses the operation of taking out a part of a multidimensional data block from a multidimensional data block and performing data type conversion. This operation is particularly common in AI reasoning and training.
3)格式转换操作配置:包括但不限于矩阵转置、数据广播、数据镜像、数据填充、数据移位、数据拼接等操作。3) Format conversion operation configuration: including but not limited to matrix transposition, data broadcasting, data mirroring, data padding, data shifting, data splicing and other operations.
4)运算操作配置:包括但不限于数据类型转换(如FP32转FP16)、数据累加、激活(RELU)运算等操作。4) Operation configuration: including but not limited to data type conversion (such as FP32 to FP16), data accumulation, activation (RELU) operations, etc.
5)循环次数:即一条宏指令可循环执行一次或多次。5) Number of loops: a macro instruction can be executed once or multiple times.
6)读步进、写步进:与循环操作配套,每一次循环的读写地址步进。6) Read stepping and write stepping: in conjunction with the loop operation, the read and write address are stepped in each loop.
7)同步配置:主要包括源端和宿端的同步检查使能,同步更新使能,同步关系ID。7) Synchronization configuration: mainly includes synchronization check enable, synchronization update enable, and synchronization relationship ID of the source and sink.
基于上述图10所示的宏指令,对该宏指令解析生成微操作码的过程示意图可以如图11所示,图11示出了对宏指令执行解析生成微操作码的过程示意图,具体包括以下步骤:Based on the macro instruction shown in FIG. 10 , a schematic diagram of the process of parsing the macro instruction to generate a micro operation code may be shown in FIG. 11 . FIG. 11 shows a schematic diagram of the process of parsing the macro instruction to generate a micro operation code, which specifically includes the following steps:
步骤1101:宏指令循环操作解析,根据读写步进更新读写起始地址;Step 1101: parse the macro instruction loop operation and update the read and write start address according to the read and write step;
步骤1102:多维数据结构解析(对应上述多维结构数据),提取一维向量(对应上述一维结构数据),计算向量读写地址及大小;Step 1102: parse the multi-dimensional data structure (corresponding to the multi-dimensional structure data), extract the one-dimensional vector (corresponding to the one-dimensional structure data), and calculate the vector read and write address and size;
步骤1103:根据解析后的一维向量,更新其他格式转换配置;Step 1103: Update other format conversion configurations according to the parsed one-dimensional vector;
步骤1104:生成DMA微操作码(uOP)。 Step 1104: Generate DMA micro-operation code (uOP).
上述步骤1103中,更新其他格式转换配置指的是参与格式转换的数据大小的更新,以及维度的更新,因为此时的复杂数据已经从多维降为一维。In the above step 1103, updating other format conversion configurations refers to updating the size of the data involved in the format conversion, as well as updating the dimension, because the complex data at this time has been reduced from multi-dimensional to one-dimensional.
请参考图12,其示出了本申请一个示例性实施例提供的一种微操作码。如图12所示,本申请描述了一种DMA微操作码,它与传统DMA的指令或描述符的内容类似,由“读写地址+数据大小”组成,并新增了格式转换配置和运算配置域段,对应的用于指示格式转换前/后处理模块和数据运算模块。Please refer to Figure 12, which shows a micro-op code provided by an exemplary embodiment of the present application. As shown in Figure 12, the present application describes a DMA micro-op code, which is similar to the content of the instruction or descriptor of the traditional DMA, consisting of "read and write address + data size", and newly adds format conversion configuration and operation configuration fields, which are used to indicate the format conversion pre/post processing module and data operation module respectively.
基于上述图12所示的微操作码,请参考图13,其示出了本申请涉及的一种流式处理的数据搬运通路的工作时序图。如图13所示,解析引擎生成的微操作码只需要传递给第一级环节的读数据接口模块,在后续的每个环节中模块只提取本模块需要使用的域段,处理完后将微操作码及数据(Date,D)同步传递给下一级模块。以此解决了传统DMA中控制单元需要分别将控制字传递数据通道中的各模块并协调各模块工作节拍的前后依赖问题。Based on the micro-op code shown in FIG. 12 above, please refer to FIG. 13, which shows a working timing diagram of a data handling path for streaming processing involved in the present application. As shown in FIG. 13, the micro-op code generated by the parsing engine only needs to be passed to the read data interface module of the first-level link. In each subsequent link, the module only extracts the domain segment that the module needs to use, and after processing, the micro-op code and data (Date, D) are synchronously passed to the next-level module. This solves the problem of the front-to-back dependency of the control unit in the traditional DMA that it needs to pass the control word to each module in the data channel and coordinate the working rhythm of each module.
基于图5、图7或图8所示的实施例,上述DMA系统还可以支持多个虚拟通道,以支持多个处理器的同时调度。Based on the embodiments shown in FIG. 5 , FIG. 7 or FIG. 8 , the DMA system may further support multiple virtual channels to support simultaneous scheduling of multiple processors.
在一些实施例中,基于图8所示的实施例,请参考图14,其示出了本申请再一个示例性实施例提供的直接内存访问系统的结构示意图,如图14所示,在图8所示的DMA系统中,宏指令控制器502,用于通过多个虚拟通道505接收多个处理器分别发送的宏指令,根据多个虚拟通道505的优先级,将多个处理器分别发送的宏指令依次发送给宏指令解析引擎503。In some embodiments, based on the embodiment shown in Figure 8, please refer to Figure 14, which shows a structural diagram of a direct memory access system provided by another exemplary embodiment of the present application. As shown in Figure 14, in the DMA system shown in Figure 8, the macroinstruction controller 502 is used to receive macroinstructions respectively sent by multiple processors through multiple virtual channels 505, and send the macroinstructions respectively sent by multiple processors to the macroinstruction parsing engine 503 in sequence according to the priorities of the multiple virtual channels 505.
如图14所示,本申请实施例涉及的DMA系统也可以增加虚拟通道,支持多用户同时使用。如图14所示,多个处理可同时对DMA系统下发指令,每个处理器可以注册一个DMA虚拟通道,DMA系统的控制器可以根据各虚拟通道的优先级,依次执行相应的触发指令。As shown in Figure 14, the DMA system involved in the embodiment of the present application can also add virtual channels to support simultaneous use by multiple users. As shown in Figure 14, multiple processors can issue instructions to the DMA system at the same time, each processor can register a DMA virtual channel, and the controller of the DMA system can execute the corresponding trigger instructions in sequence according to the priority of each virtual channel.
如图5、图7、图8以及图14任一所示,对于本申请提供的DMA系统,有如下特点。As shown in any one of FIG. 5 , FIG. 7 , FIG. 8 and FIG. 14 , the DMA system provided in the present application has the following characteristics.
1)本申请中处理器给DMA系统下发的是一种宏指令,描述了多维度的复杂数据结构及其搬运、格式转换、运算等数据操作。1) In this application, the processor sends a macro instruction to the DMA system, which describes a multi-dimensional complex data structure and its data operations such as transportation, format conversion, and calculation.
2)处理器可以批量地下发DMA系统的宏指令,写入DMA系统的宏指令存储器中。2) The processor can issue macro instructions of the DMA system in batches and write them into the macro instruction memory of the DMA system.
3)处理器下发触发指令,指示DMA系统开始执行宏指令,触发指令的内容可以为DMA系统的宏指令存储器中的起始执行地址,及指令的个数。3) The processor issues a trigger instruction to instruct the DMA system to start executing the macro instruction. The content of the trigger instruction may be the starting execution address in the macro instruction memory of the DMA system and the number of instructions.
4)DMA系统可批量执行多条宏指令,由于每条宏指令可能与芯片中的其他部件存在数据依赖关系,则每条宏指令可以配置同步信息,DMA系统通过同步关系检查,实现DMA系统自行决策每条指令执行的时机。4) The DMA system can execute multiple macroinstructions in batches. Since each macroinstruction may have data dependencies with other components in the chip, each macroinstruction can be configured with synchronization information. The DMA system checks the synchronization relationship and makes its own decisions on the timing of executing each instruction.
5)DMA系统的宏指令控制器接收处理器的触发指令,从宏指令存储器读取宏指令,执行预解析,检查同步关系,并判断是否将指令下发给解析引擎,还是存入等待队列中。5) The macroinstruction controller of the DMA system receives the trigger instruction of the processor, reads the macroinstruction from the macroinstruction memory, performs pre-analysis, checks the synchronization relationship, and determines whether to send the instruction to the analysis engine or store it in the waiting queue.
6)DMA系统内置宏指令解析引擎,通过硬件电路解析复杂指令,而不是通过处理器执行一段程序解析复杂指令的方式实现指令的解析。宏指令解析引擎主要处理数据结构的解析,将抽象的多维数据结构分解为多个下游数据通路可识别的微操作码,即“读写地址+数据大小+配置”。6) The DMA system has a built-in macro-instruction parsing engine that parses complex instructions through hardware circuits, rather than having the processor execute a program to parse complex instructions. The macro-instruction parsing engine mainly handles the parsing of data structures, decomposing abstract multi-dimensional data structures into multiple micro-operation codes that can be recognized by downstream data paths, namely "read and write address + data size + configuration".
7)本申请的DMA系统的数据读写传输通路的组成单元中,除了传统DMA的读写接口模块和缓存模块以外,可以新增格式转换模块和数据运算模块,目的是为了在数据传输路径上随路执行格式转换和一些运算操作,以达到数据处理无带宽损失。7) In the component units of the data read and write transmission path of the DMA system of the present application, in addition to the traditional DMA read and write interface module and cache module, a format conversion module and a data operation module can be newly added. The purpose is to perform format conversion and some calculation operations along the data transmission path to achieve data processing without bandwidth loss.
8)本申请提供的是一种流式处理的DMA系统,每个环节的模块处理完成后将负载(数据或指令)及控制字(同步关系、指令或微操作码)都传递给下一环节模块,然后就可以接收新的处理操作,前后环节中的模块没有依赖。8) The present application provides a streaming processing DMA system. After the module processing of each link is completed, the load (data or instruction) and control word (synchronization relationship, instruction or micro-operation code) are passed to the next link module, and then new processing operations can be received. There is no dependency between the modules in the previous and next links.
其中,本申请提供的DMA系统处理数据搬运的整体处理流程图可以如图15所示。请参考图15,具体流程如下:The overall processing flow chart of the DMA system for processing data transfer provided by the present application may be shown in FIG15. Please refer to FIG15, the specific process is as follows:
步骤1501:接收DMA触发指令;Step 1501: receiving a DMA trigger instruction;
步骤1502:根据触发指令携带的地址读取DMA宏指令; Step 1502: Read the DMA macro instruction according to the address carried by the trigger instruction;
步骤1503:宏指令预解析,根据同步配置执行同步关系检查;Step 1503: pre-analyze the macro instruction and perform synchronization relationship check according to the synchronization configuration;
若同步失败,将宏指令转入指令等待队列,执行同步检查,直至同步成功;If synchronization fails, the macro instruction is transferred to the instruction waiting queue and synchronization check is performed until synchronization succeeds;
若同步成功,继续执行下一步;If the synchronization is successful, proceed to the next step;
步骤1504:将宏指令解析为微操作码;Step 1504: Parse the macroinstruction into micro-operation code;
步骤1505:根据指令读地址及数据大小向总线产生读操作,如AXI读传输、APB、AHB、ACE等读操作;Step 1505: Generate a read operation to the bus according to the instruction read address and data size, such as AXI read transfer, APB, AHB, ACE and other read operations;
步骤1506:根据格式转换配置执行格式转换前处理;Step 1506: Perform format conversion pre-processing according to the format conversion configuration;
步骤1507:将数据及微操作码写入缓存中;Step 1507: Write the data and micro-op code into the cache;
步骤1508:根据格式转换配置执行格式转换后处理;Step 1508: Perform format conversion post-processing according to the format conversion configuration;
步骤1509:根据运算配置执行数据运算操作;Step 1509: Execute data computing operations according to the computing configuration;
步骤1510:根据指令写地址及数据大小向总线产生写操作,如AXI写传输、APB、AHB、ACE等写操作;Step 1510: Generate a write operation to the bus according to the instruction write address and data size, such as AXI write transfer, APB, AHB, ACE, etc. write operation;
最后,返回步骤1503,执行指令完成回告,用于指示控制器更新同步关系。Finally, return to step 1503 and execute the instruction completion report to instruct the controller to update the synchronization relationship.
请参考图16,其示出了本申请一个示例性实施例提供的一种数据搬运方法的流程图。方法用于上述直接内存访问系统,且方法由直接内存访问系统中的宏指令控制器执行,方法包括:Please refer to FIG. 16 , which shows a flow chart of a data transfer method provided by an exemplary embodiment of the present application. The method is used in the above-mentioned direct memory access system, and the method is executed by a macro instruction controller in the direct memory access system, and the method includes:
步骤1601,接收处理器下发的触发指令;Step 1601, receiving a trigger instruction sent by a processor;
步骤1602,根据触发指令从宏指令存储器中读取宏指令;宏指令用于指示搬运多维结构数据;Step 1602, reading a macro instruction from a macro instruction memory according to a trigger instruction; the macro instruction is used to instruct the transfer of multi-dimensional structure data;
步骤1603,将宏指令发送给宏指令解析引擎,以便宏指令解析引擎将宏指令解析为微操作码,并将微操作码发送给数据读写传输通路,由数据读写传输通路根据微操作码向总线读取和写入一维结构数据;微操作码用于指示搬运多维结构数据中的一维结构数据。Step 1603, sending the macroinstruction to the macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstruction into a micro-operation code and sends the micro-operation code to the data read and write transmission path, and the data read and write transmission path reads and writes one-dimensional structure data to the bus according to the micro-operation code; the micro-operation code is used to indicate the transfer of one-dimensional structure data in multi-dimensional structure data.
在一些实施例中,宏指令和微操作码中还包含同步配置,同步配置用于指示对目标地址执行状态检查,目标地址是数据的读地址和写地址中的至少一个;In some embodiments, the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration being used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of the data;
上述将宏指令发送给宏指令解析引擎,包括:The above sending of macro instructions to the macro instruction parsing engine includes:
根据同步配置检查目标地址的状态;Check the status of the target address according to the synchronization configuration;
在目标地址的状态满足指定条件时,将宏指令发送给宏指令解析引擎;When the state of the target address meets the specified condition, the macro instruction is sent to the macro instruction parsing engine;
其中,指定条件包括:The specified conditions include:
当目标地址包括读地址时,读地址的状态为可读;When the target address includes a read address, the state of the read address is readable;
当目标地址包括写地址时,写地址的状态为可写。When the target address includes a write address, the state of the write address is writable.
在一些实施例中,接收处理器下发的触发指令,包括:In some embodiments, receiving a trigger instruction sent by a processor includes:
通过多个虚拟通道接收多个处理器分别发送的宏指令;receiving macro instructions respectively sent by a plurality of processors through a plurality of virtual channels;
将宏指令发送给宏指令解析引擎,包括:Send the macro to the macro parsing engine, including:
根据多个虚拟通道的优先级,将多个处理器分别发送的宏指令依次发送给宏指令解析引擎。According to the priorities of the multiple virtual channels, the macro instructions respectively sent by the multiple processors are sent to the macro instruction parsing engine in sequence.
综上所述,本申请实施例所示的方案中,处理器下发给DMA系统的是一种宏指令,用于指示搬运多维结构数据;DMA系统中的宏指令控制器将宏指令发送给宏指令解析引擎;宏指令解析引擎将宏指令解析为用于指示搬运多维结构数据中的一维结构数据的微操作码,并发送给数据读写传输通路,由数据读写传输通路根据微操作码执行一维结构数据的搬迁。在上述方案中,处理器直接向DMA系统指示了多维结构数据的搬运,后续由DMA系统中的宏指令解析引擎将多维结构数据的搬运操作拆解为多个一维结构数据的搬运操作,由于DMA系统对宏指令执行解析的速度要远快于处理器拆解多维结构数据的搬运操作的速度,因此,通过上述DMA系统,能够极大的降低处理器生成一维结构数据的搬运的指令所需要的时间,从而提高数据搬运的效率。In summary, in the scheme shown in the embodiment of the present application, the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code. In the above scheme, the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate instructions for transporting one-dimensional structure data, thereby improving the efficiency of data transport.
请参考图17,其示出了本申请一个示例性实施例提供的一种数据搬运装置的示意图。该 装置用于上述直接内存访问系统中的宏指令控制器,该装置包括:Please refer to FIG. 17 , which shows a schematic diagram of a data handling device provided by an exemplary embodiment of the present application. The device is used for a macro instruction controller in the above direct memory access system, and the device comprises:
触发指令接收模块1701,用于接收处理器下发的触发指令;The trigger instruction receiving module 1701 is used to receive the trigger instruction sent by the processor;
宏指令读取模块1702,用于根据触发指令从宏指令存储器中读取宏指令;宏指令用于指示搬运多维结构数据;The macro instruction reading module 1702 is used to read the macro instruction from the macro instruction storage according to the trigger instruction; the macro instruction is used to instruct the transport of multi-dimensional structure data;
宏指令发送模块1703,用于将宏指令发送给宏指令解析引擎,以便宏指令解析引擎将宏指令解析为微操作码,并将微操作码发送给数据读写传输通路,由数据读写传输通路根据微操作码向总线读取和写入一维结构数据;微操作码用于指示搬运多维结构数据中的一维结构数据。The macroinstruction sending module 1703 is used to send macroinstructions to the macroinstruction parsing engine so that the macroinstruction parsing engine parses the macroinstructions into micro-operation codes and sends the micro-operation codes to the data read and write transmission path, and the data read and write transmission path reads and writes one-dimensional structure data to the bus according to the micro-operation codes; the micro-operation codes are used to indicate the movement of one-dimensional structure data in multi-dimensional structure data.
在一些实施例中,宏指令和微操作码中还包含同步配置,同步配置用于指示对目标地址执行状态检查,目标地址是数据的读地址和写地址中的至少一个;In some embodiments, the macroinstruction and the micro-operation code further include a synchronization configuration, the synchronization configuration being used to instruct to perform a status check on a target address, the target address being at least one of a read address and a write address of the data;
所述宏指令发送模块1703,用于根据同步配置检查目标地址的状态;The macro instruction sending module 1703 is used to check the status of the target address according to the synchronization configuration;
在目标地址的状态满足指定条件时,将宏指令发送给宏指令解析引擎;When the state of the target address meets the specified condition, the macro instruction is sent to the macro instruction parsing engine;
其中,指定条件包括:The specified conditions include:
当目标地址包括读地址时,读地址的状态为可读;When the target address includes a read address, the state of the read address is readable;
当目标地址包括写地址时,写地址的状态为可写。When the target address includes a write address, the state of the write address is writable.
在一些实施例中,触发指令接收模块1701,用于通过多个虚拟通道接收多个处理器分别发送的宏指令;In some embodiments, the trigger instruction receiving module 1701 is used to receive macro instructions respectively sent by multiple processors through multiple virtual channels;
宏指令发送模块1703,用于根据多个虚拟通道的优先级,将多个处理器分别发送的宏指令依次发送给宏指令解析引擎。The macro instruction sending module 1703 is used to send the macro instructions sent by the multiple processors to the macro instruction parsing engine in sequence according to the priorities of the multiple virtual channels.
综上所述,本申请实施例所示的方案中,处理器下发给DMA系统的是一种宏指令,用于指示搬运多维结构数据;DMA系统中的宏指令控制器将宏指令发送给宏指令解析引擎;宏指令解析引擎将宏指令解析为用于指示搬运多维结构数据中的一维结构数据的微操作码,并发送给数据读写传输通路,由数据读写传输通路根据微操作码执行一维结构数据的搬迁。在上述方案中,处理器直接向DMA系统指示了多维结构数据的搬运,后续由DMA系统中的宏指令解析引擎将多维结构数据的搬运操作拆解为多个一维结构数据的搬运操作,由于DMA系统对宏指令执行解析的速度要远快于处理器拆解多维结构数据的搬运操作的速度,因此,通过上述DMA系统,能够极大的降低处理器生成一维结构数据的搬运的指令所需要的时间,从而提高数据搬运的效率。In summary, in the scheme shown in the embodiment of the present application, the processor sends a macro instruction to the DMA system, which is used to instruct the transport of multi-dimensional structure data; the macro instruction controller in the DMA system sends the macro instruction to the macro instruction parsing engine; the macro instruction parsing engine parses the macro instruction into a micro operation code for instructing the transport of one-dimensional structure data in the multi-dimensional structure data, and sends it to the data read and write transmission path, and the data read and write transmission path executes the relocation of the one-dimensional structure data according to the micro operation code. In the above scheme, the processor directly instructs the DMA system to transport the multi-dimensional structure data, and then the macro instruction parsing engine in the DMA system disassembles the transport operation of the multi-dimensional structure data into multiple transport operations of one-dimensional structure data. Since the speed at which the DMA system performs the analysis of the macro instruction is much faster than the speed at which the processor disassembles the transport operation of the multi-dimensional structure data, the above DMA system can greatly reduce the time required for the processor to generate instructions for transporting one-dimensional structure data, thereby improving the efficiency of data transport.
本申请的实施例还提供了一种计算机设备,计算机设备包括上述图5、图7、图8以及图14任一所示的直接内存访问系统。An embodiment of the present application further provides a computer device, which includes a direct memory access system as shown in any one of FIG. 5 , FIG. 7 , FIG. 8 and FIG. 14 .
本申请的实施例还提供了一种芯片,该芯片包括上述图5、图7、图8以及图14任一所示的直接内存访问系统。其中,上述芯片可以是处理器之外的,用于执行直接内存访问过程的芯片,比如,上述芯片可以实现为DMA控制器,或者,上述芯片可以包括一个或者多个DMA控制器;该DMA控制器中包含上述图5、图7、图8以及图14任一所示的直接内存访问系统。The embodiment of the present application further provides a chip, which includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14. The chip can be a chip other than a processor for executing a direct memory access process, for example, the chip can be implemented as a DMA controller, or the chip can include one or more DMA controllers; the DMA controller includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14.
本申请的实施例还提供了一种计算机设备,该计算机设备可以包括上述芯片,该芯片包括上述图5、图7、图8以及图14任一所示的直接内存访问系统。An embodiment of the present application further provides a computer device, which may include the above chip, and the chip includes the direct memory access system shown in any one of Figures 5, 7, 8 and 14.
本申请的实施例还提供了一种计算机可读存储介质,存储介质中存储有至少一段程序,至少一段程序由控制器加载并执行以实现上述数据搬运方法;控制器是上述图5、图7、图8以及图14任一所示直接内存访问系统中的宏指令控制器。An embodiment of the present application also provides a computer-readable storage medium, in which at least one program is stored, and the at least one program is loaded and executed by a controller to implement the above-mentioned data transfer method; the controller is a macro instruction controller in the direct memory access system shown in any of the above-mentioned Figures 5, 7, 8 and 14.
可选地,该计算机可读存储介质可以包括:只读存储器(Read Only Memory,ROM)、随机存取记忆体(Random Access Memory,RAM)、固态硬盘(Solid State Drives,SSD)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(Resistance Random Access Memory,ReRAM)和动态随机存取存储器(Dynamic Random Access Memory,DRAM)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。 Optionally, the computer readable storage medium may include: a read-only memory (ROM), a random access memory (RAM), a solid state drive (SSD) or an optical disk, etc. Among them, the random access memory may include a resistance random access memory (ReRAM) and a dynamic random access memory (DRAM). The serial numbers of the above embodiments of the present application are only for description and do not represent the advantages and disadvantages of the embodiments.
本申请的实施例还提供了一种计算机程序产品,包括计算机程序,计算机程序被控制器执行时实现上述数据搬运方法;控制器可以是上述图5、图7、图8以及图14任一所示直接内存访问系统中的宏指令控制器。 An embodiment of the present application also provides a computer program product, including a computer program, which implements the above-mentioned data transfer method when the computer program is executed by a controller; the controller can be a macro instruction controller in the direct memory access system shown in any of the above-mentioned Figures 5, 7, 8 and 14.
Claims (19)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311177686.7A CN116909628B (en) | 2023-09-13 | 2023-09-13 | Direct memory access system, data handling method, apparatus and storage medium |
| CN202311177686.7 | 2023-09-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025055399A1 true WO2025055399A1 (en) | 2025-03-20 |
Family
ID=88356935
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/097048 Pending WO2025055399A1 (en) | 2023-09-13 | 2024-06-03 | Direct memory access system, data transport method, device and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116909628B (en) |
| WO (1) | WO2025055399A1 (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116909628B (en) * | 2023-09-13 | 2023-12-26 | 腾讯科技(深圳)有限公司 | Direct memory access system, data handling method, apparatus and storage medium |
| CN117971496B (en) * | 2024-03-14 | 2025-04-25 | 上海壁仞科技股份有限公司 | Operator task execution method, artificial intelligence chip and electronic device |
| CN118170702B (en) * | 2024-05-13 | 2024-08-30 | 北京壁仞科技开发有限公司 | DMA controller and data handling method for broadcasting |
| CN119292768B (en) * | 2024-09-13 | 2025-11-14 | 山东云海国创云计算装备产业创新中心有限公司 | A chip configuration method, apparatus, computer device, and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4878174A (en) * | 1987-11-03 | 1989-10-31 | Lsi Logic Corporation | Flexible ASIC microcomputer permitting the modular modification of dedicated functions and macroinstructions |
| US20060161720A1 (en) * | 2005-01-17 | 2006-07-20 | Vimicro Corporation | Image data transmission method and system with DMAC |
| CN101329622A (en) * | 2008-02-08 | 2008-12-24 | 威盛电子股份有限公司 | Microprocessor and macro instruction execution method |
| CN108416431A (en) * | 2018-01-19 | 2018-08-17 | 上海兆芯集成电路有限公司 | Neural network microprocessor and macro instruction processing method |
| CN116909628A (en) * | 2023-09-13 | 2023-10-20 | 腾讯科技(深圳)有限公司 | Direct memory access system, data handling method, apparatus and storage medium |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113127177B (en) * | 2019-12-30 | 2023-11-14 | 澜起科技股份有限公司 | Processing device and distributed processing system |
| US20220171622A1 (en) * | 2020-11-27 | 2022-06-02 | Electronics And Telecommunications Research Institute | Multi-dimension dma controller and computer system including the same |
-
2023
- 2023-09-13 CN CN202311177686.7A patent/CN116909628B/en active Active
-
2024
- 2024-06-03 WO PCT/CN2024/097048 patent/WO2025055399A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4878174A (en) * | 1987-11-03 | 1989-10-31 | Lsi Logic Corporation | Flexible ASIC microcomputer permitting the modular modification of dedicated functions and macroinstructions |
| US20060161720A1 (en) * | 2005-01-17 | 2006-07-20 | Vimicro Corporation | Image data transmission method and system with DMAC |
| CN101329622A (en) * | 2008-02-08 | 2008-12-24 | 威盛电子股份有限公司 | Microprocessor and macro instruction execution method |
| CN108416431A (en) * | 2018-01-19 | 2018-08-17 | 上海兆芯集成电路有限公司 | Neural network microprocessor and macro instruction processing method |
| CN116909628A (en) * | 2023-09-13 | 2023-10-20 | 腾讯科技(深圳)有限公司 | Direct memory access system, data handling method, apparatus and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116909628B (en) | 2023-12-26 |
| CN116909628A (en) | 2023-10-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2025055399A1 (en) | Direct memory access system, data transport method, device and storage medium | |
| US12235773B2 (en) | Two address translations from a single table look-aside buffer read | |
| TWI747933B (en) | Hardware accelerators and methods for offload operations | |
| US6272616B1 (en) | Method and apparatus for executing multiple instruction streams in a digital processor with multiple data paths | |
| CN113743599B (en) | Computing device and server of convolutional neural network | |
| US11403104B2 (en) | Neural network processor, chip and electronic device | |
| CN1230740C (en) | Digital signal processing apparatus | |
| US20220043770A1 (en) | Neural network processor, chip and electronic device | |
| CN108205448B (en) | Streaming engine with selectable multidimensional circular addressing in each dimension | |
| CN102804135A (en) | A data processing apparatus and method for handling vector instructions | |
| TW200403583A (en) | Controlling compatibility levels of binary translations between instruction set architectures | |
| US20140143524A1 (en) | Information processing apparatus, information processing apparatus control method, and a computer-readable storage medium storing a control program for controlling an information processing apparatus | |
| CN112148251A (en) | System and method for skipping meaningless matrix operations | |
| CN110806900A (en) | Memory access instruction processing method and processor | |
| US12341534B2 (en) | Butterfly network on load data return | |
| CN115640052A (en) | Multi-core multi-pipeline parallel execution optimization method for graphics processors | |
| US20210166156A1 (en) | Data processing system and data processing method | |
| CN117616407A (en) | Computing architecture | |
| JP2884831B2 (en) | Processing equipment | |
| WO2021115149A1 (en) | Neural network processor, chip and electronic device | |
| TWI722009B (en) | Hardware mechanism for performing atomic actions on remote processors | |
| US7143268B2 (en) | Circuit and method for instruction compression and dispersal in wide-issue processors | |
| CN114327634A (en) | Apparatus and method for low latency decompression acceleration via a single job descriptor | |
| JPH02306361A (en) | Microprocessor | |
| JP2007034392A (en) | Information processor and data processing method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24864124 Country of ref document: EP Kind code of ref document: A1 |