WO2022266842A1 - Procédé et appareil de traitement de données multi-fil - Google Patents
Procédé et appareil de traitement de données multi-fil Download PDFInfo
- Publication number
- WO2022266842A1 WO2022266842A1 PCT/CN2021/101533 CN2021101533W WO2022266842A1 WO 2022266842 A1 WO2022266842 A1 WO 2022266842A1 CN 2021101533 W CN2021101533 W CN 2021101533W WO 2022266842 A1 WO2022266842 A1 WO 2022266842A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- thread
- data
- threads
- source
- src1
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the technical field of parallel computing, and in particular to a multi-thread data processing method and device.
- SIMD single instruction multiple data
- the traditional parallel processor solutions are divided into software solutions and hardware solutions.
- the software solution is to use the shared on-chip storage, store the data in the shared on-chip storage, then modify the thread address and then grab the data back to the core register to realize the exchange of data between threads.
- the software solution involves frequent memory access operations, resulting in inefficient execution and higher power consumption.
- the hardware solution is generally through a complex cross bar. For example, the data of each output thread of the cross network can come from any input thread, so as to achieve the ability of thread data exchange.
- hardware solutions require higher hardware costs.
- the present application provides a multi-thread data processing method and device, which can improve execution performance and realize cross-thread operations involved in parallel computing at a lower hardware cost.
- an embodiment of the present application provides a multi-threaded data processing method, the method including: acquiring a first operation instruction.
- the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode. Moving the first source data of the N threads according to the first operation instruction to obtain the moved first data on each of the N threads.
- the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
- the data moving method is the first moving method
- the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
- the data moving method is the second moving method
- the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of 2 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is i and SRC1 XOR value of , SRC1 represents the second source operand, and SRC1 is a positive integer.
- the data moving method is a third offset method
- the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread of I 3 is moved to the thread numbered as i; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 for SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
- the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
- each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
- the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
- the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
- the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes: obtaining The second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the N threads
- the second source data, the second source data of the N threads come from the remaining N continuous threads in the parallel computing processor; perform the second source data of the N threads according to the second operation instruction moving to obtain the moved second data on each of the N threads.
- the method further includes: exchanging the moved first data on the third thread with the second data moved on the third thread; wherein, the three threads are The thread numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N -SRC1%N).
- an embodiment of the present application provides a multi-threaded data processing device, which includes: an instruction acquisition module, configured to acquire a first operation instruction, and the first operation instruction includes the following parameters: the first operation code, the The first operation code is used to indicate the data movement mode between N threads, and N is an integer greater than or equal to 2; the first source operand is used to indicate the first source operand of the N threads. Source data; a second source operand, the second source operand is used to determine the thread offset corresponding to the data movement mode; a processing module is used to process the N threads according to the first operation instruction The first source data is moved to obtain the moved first data on each of the N threads.
- the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
- the data transfer method is a first transfer method
- the processing module is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i ; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i is taken over 0 ⁇ (N-1), and I 1 is (i+SRC1) the value that N gets remainder; Wherein, SRC1 represents described first Two source operands, SRC1 is a positive integer.
- the data transfer method is the second transfer method, and the processing module is specifically used to: transfer the first source data of the thread numbered 12 to the thread numbered i ; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is a positive integer.
- the data moving method is a third offset method, and the processing module is specifically used to: move the first source data of the thread numbered I3 to the thread numbered i Among them, the numbers of N threads are 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
- the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type;
- the processing module is further configured to: for the N threads The first thread in the first thread executes an operation corresponding to the operation type based on the first source data of the first thread and the first data moved on the first thread.
- each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
- the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
- the first operation instruction further includes a destination operand, where the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
- the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads;
- the instruction acquisition module further For acquiring a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N
- the second source data of the threads, the second source data of the N threads are from the remaining N continuous threads in the parallel computing processor;
- the second source data of the N threads are moved to obtain the moved second data on each of the N threads.
- the processing module is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r Less than (N-SRC1%N).
- the present application provides a communication device, including a processor, the processor is coupled to a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions to perform Various implementation methods of any one of the above-mentioned first aspect to the fourth aspect.
- the memory may be located within the device or external to the device.
- the number of the processors is one or more.
- the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and each executable method of the first aspect. The method described in Selected Implementations.
- the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
- the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and execute the above-mentioned first aspect and each possible function of the first aspect.
- FIG. 1 is a schematic diagram of a circuit structure coupled by a crossover network
- Fig. 2 is a schematic diagram of internal element shifting of a thread
- FIG. 3 is a schematic diagram of a SIMD parallel computing processor system architecture provided by an embodiment of the present application.
- Fig. 4 is a schematic diagram of a cycle transfer provided by the embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a CROSSDOWN cross-thread processing unit provided by an embodiment of the present application.
- FIG. 6 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
- Fig. 7 is a schematic diagram of cross transfer provided by the embodiment of the present application.
- FIG. 8 is a schematic structural diagram of a CROSS QUAD BUTTERFLY cross-thread processing unit provided by the embodiment of the present application.
- FIG. 9 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
- FIG. 10 is a schematic diagram of one-to-many transfer provided by the embodiment of the present application.
- FIG. 11 is a schematic structural diagram of a CROSS QUAD-BROADCAST cross-thread processing unit provided by the embodiment of the present application.
- FIG. 12 is one of the schematic diagrams of thread flag bits provided by the embodiment of the present application.
- FIG. 13 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
- FIG. 14 is a schematic diagram of a source data source provided by the embodiment of the present application.
- Fig. 15 is another schematic diagram of circular transfer provided by the embodiment of the present application.
- Fig. 16 is a schematic diagram of data replacement provided by the embodiment of the present application.
- FIG. 17 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
- FIG. 18 is one of the schematic diagrams of the cross-thread data processing flow provided by the embodiment of the present application.
- FIG. 19 is one of the schematic flowcharts of the multi-threaded data processing method provided by the embodiment of the present application.
- FIG. 20 is a schematic structural diagram of a multi-threaded data processing device provided by an embodiment of the present application.
- FIG. 21 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
- the thread data is read from the core registers by software, and the thread data is stored in memory such as shared on-chip storage. Modify the thread address of the data, and fetch the data back to the kernel register according to the modified thread address. The same thread address corresponds to the original data read from the kernel register, and the data fetched back to the kernel register, so as to realize the exchange of data between threads.
- Such a method involves frequent memory access operations, resulting in low execution efficiency and high power consumption.
- each quadrant contains one vector processor (Execution Pipelines) and two crossover networks for performing cross-thread data movement operations. (cross bar) chip.
- the four quadrants are respectively recorded as the first quadrant, the second quadrant, the third quadrant and the fourth quadrant.
- the first quadrant comprises vector processor 455, cross network chip 410A (or claims, cross bar410A), cross network chip 410B;
- the second quadrant comprises vector processor 460, cross network chip 420A, cross network chip 420B;
- the third quadrant comprises vector The processor 465, the cross network chip 430A, and the cross network chip 430B;
- the fourth quadrant includes the vector processor 470, the cross network chip 440A, and the cross network chip 440B.
- cross bar 410A, cross bar 410B, cross bar 420A, cross bar 420B, cross bar 430A, cross bar 430B, cross bar 440A and cross bar 440B and various vector processors can be achieved with a small number of threads
- the combination of cross bar to achieve cross-thread operation with a large number of threads is shown in Table 1 below.
- vector processor Available cross bar 455 410A, 420A, 430B, 440A 460 410B, 420B, 430A, 440B 465 410B, 420B, 430A, 440B 470 410A, 420A, 430B, 440A
- Each of the aforementioned cross bars has 8 input channels and 8 output channels, that is, an 8 ⁇ 8 cross network.
- the combination of 4 cross bars can realize 16 input channels and 16 output channels.
- One cross-thread operation instruction can control the replacement of 16 channels.
- two back-to-back cross-thread operation instructions can be used, that is, two cross-thread operation instructions continuous in time to perform 32 ⁇ 32 replacement.
- two cross-thread operation instructions are recorded as the first replacement instruction and the second replacement instruction.
- the first replacement instruction controls the combined cross network to input 16 thread data to perform the replacement operation and then output it, and write it back to the vector register file.
- the second replacement instruction The control-combined cross network inputs 16 thread data and outputs them after replacement operation.
- the output of the first permutation instruction will be read and the output of the second permutation instruction will be combined with the output of the first permutation instruction to produce the final result of the 32x32 permutation.
- each cross bar is shared by two vector processors, and can only be used by one of the vector processors at a time. Therefore, when a vector processor is in use, if another vector processor also needs to use the same cross bar, it will cause processor blockage.
- two back-to-back cross-thread instructions need to be used. The first instruction needs to be written into the register and then read out, which will consume additional power consumption.
- VADDREDUCEPS Vector reduction instruction
- 310 is a vector register containing 4 threads, and each thread contains 4 elements.
- the data in each thread is shifted to the right by the bit width of 1 element unit , the rightmost element in each thread is not shifted, and is added, subtracted or multiplied with the shifted element, the leftmost element in each thread is filled with 0, and the shift operation will not cross the thread boundary.
- 310 is changed to 320, the details are as follows:
- using the vector reduction instruction can only perform shift operations on the data in each thread, and does not involve real cross-thread operations. Although the reduction calculation can be realized, the efficiency is low. Moreover, it is only applicable to processors with fewer threads. For SIMD processors, since the number of threads in SIMD processors is large and the bit width of registers in threads is small, this technology cannot operate across threads and has poor applicability. In addition, this technique can only do partial reduction calculations, and cannot be applied to differential calculations in graphics.
- the embodiments of the present application provide a multi-threaded data processing method and device, which can improve execution performance, realize cross-thread operations involved in parallel computing at a lower hardware cost, and effectively accelerate data processing of parallel computing.
- the multi-threaded data processing method provided in the embodiment of the present application can be applied to reduction algorithms in parallel computing, differential computing in graphics processing, and the like.
- FIG. 3 it shows a schematic diagram of a SIMD parallel computing processor system architecture.
- the multi-thread data processing method provided by the embodiment of the present application can be applied to the SMID parallel computing processor system.
- SIMD parallel computing processor systems can be deployed in devices such as personal computers, laptops, smart phones, smart set-top boxes, in-vehicle smart systems, smart wearable devices, and more.
- the SIMD parallel computing processor system is mainly used to process applications with a large amount of data, input the compiled binary instruction code, and the corresponding data to be processed, and finally output the data processed by the program to the external storage.
- a typical example is a graphics processing unit (GPU), which inputs a large amount of 3D model vertex data and the rendering program instruction code compiled by the compiler, and finally outputs the rendered data to the video memory.
- GPU graphics processing unit
- the SIMD parallel computing processor system mainly includes one or more processor cores, and one SIMD processor core is schematically shown in FIG. 3 .
- Each processor core contains multiple arithmetic logic units (arithmetic logic unit, ALU), general purpose register (GPR) units, and instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple.
- ALU arithmetic logic unit
- GPR general purpose register
- instruction processing related units such as instruction scheduler, instruction decoder, one of the source operand collection units or Multiple.
- the instruction scheduler is used to read the instruction code compiled by the compiler from the memory, and distribute the instruction code according to the degree of idleness of the arithmetic logic unit (ALU) and the degree of resource usage.
- the instruction encoding is an encoding in a binary format, and the instruction encoding; optionally, the instruction encoding may also be referred to as an operation instruction.
- An instruction encoding may contain one or more of the following parameters: one or more opcodes used to indicate the behavior of the instruction encoding; source operands, used to indicate the source data required by the opcode, such as the The source can be register address encoding or immediate number encoding; the destination operand, used to indicate the storage location of the result after the instruction opcode is executed, can be register address encoding.
- the embodiment of the present application will describe the instruction encoding in detail in the following content.
- a general-purpose register (GPR) unit is used to store data corresponding to operands involved in instruction calculation, such as data corresponding to source operands and data corresponding to destination operands.
- the general purpose register unit (GPR) uses static random access memory (SRAM).
- the initial data may come from external storage, corresponding to the multithreading of the parallel computing processor, and the initial data may be the data of the multithreading of the SIDM processor core.
- the instruction decoder is configured to receive and parse the instruction code, and instruct the general purpose register unit (GPR) to prepare for reading the source data according to the instruction code.
- GPR general purpose register unit
- the source operand collector is used to receive multiple source data returned by the general-purpose register, and based on the multiple source data returned by the general-purpose register, perform a cross-thread data movement operation and then output the data to the arithmetic logic unit. Specifically, a set number of threads are deployed in the source operand collector, and the source operand collector can use multiple source data returned by the general register as the source data of the aforementioned set number of threads, one thread corresponds to one source data, Perform data movement operations among the set number of threads. In the embodiment of the present application, the source operand collector may also output multiple source data to the ALU; or, the ALU may also directly receive multiple source data returned by the general-purpose register.
- Arithmetic logic unit including multi-stage pipelines, can complete instruction calculations of various types of operations, such as floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer Type addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND Floating-point, integer and logical operations such as AND, logical or OR.
- floating-point addition FADD floating-point multiplication FMUL
- floating-point comparison FMIN/FMAX floating-point comparison
- signed integer addition IADDS unsigned integer Type addition IADDU
- signed integer subtraction ISUBS signed integer subtraction ISUBU
- signed integer multiplication IMULS unsigned integer multiplication IMULU
- signed comparison IMINS unsigned comparison IMINU
- logical XOR operation XOR logical AND
- each SIMD processor core can contain multiple ALUs to achieve high computing throughput.
- an independent 1-bit flag can be set for each ALU unit, and the value of the flag indicates whether the ALU unit participates in instruction calculation. For example, if the flag bit is 1, it means that the ALU participates in the instruction calculation, and if the flag bit is 0, it means that the ALU does not participate in the instruction calculation, and there is no need for clock inversion, which can save power consumption.
- the above-mentioned system provided by the embodiment of the present application does not need to use a complex cross network or access storage to obtain data, execute a single instruction code, read data from a general-purpose register at one time, complete cross-thread data movement and calculation, and can improve cross-thread operations. execution performance.
- the instruction encoding may specifically include the following parameters.
- the first operation code is used to indicate the data transfer mode among the set number of threads deployed in the source operand collector, and the data transfer mode includes one or more types, which can be defined according to actual requirements. Operation types include floating-point addition FADD, floating-point multiplication FMUL, floating-point comparison FMIN/FMAX, signed integer addition IADDS, unsigned integer addition IADDU, signed integer subtraction ISUBS, unsigned integer subtraction ISUBU, signed Integer multiplication IMULS, unsigned integer multiplication IMULU, signed comparison IMINS, unsigned comparison IMINU, logical XOR operation XOR, logical AND operation AND, logical OR operation and other floating point, integer and logical operations.
- the first operation code may also be called a main operation code
- the second operation code may also be called a secondary operation code.
- the data migration mode may include the following types: circular migration, cross migration, and one-to-many migration.
- the second opcode is used to indicate the operation type.
- Loop moving can be understood as moving the data of each thread according to the same thread offset and the same thread number sorting direction (such as the direction from high thread to low thread); cross moving can be understood as data between two threads Mutual exchange; one-to-many transfer, also known as diffusion transfer, can be understood as moving the data of one thread to other or multiple threads including this thread.
- the first operation code can be CROSS-DOWN, and CROSS-DOWN can be used to indicate circular transfer, or the first operation code can be CROSS-QUAD -BUTTERFLY, use CROSS-QUAD-BUTTERFLY to indicate cross transfer; or, the first opcode can be CROSS-QUAD-BROADCAST, use CROSS-QUAD-BROADCAST to indicate one-to-many transfer.
- the aforementioned cyclic transfer, crossover transfer, and one-to-many transfer can also be replaced by other names, as long as they can be identified so that the source operand collector can determine which transfer operation to perform according to the first operation code,
- the embodiments of the present application do not limit this.
- the first data transfer method, the second data transfer method, and the third data transfer method can be used to distinguish the types of the above data transfer methods.
- the first data transfer method indicates circular transfer
- the second data transfer method indicates cross transfer.
- the third data transfer mode indicates one-to-many transfer.
- Source operand 1 is used to indicate the source data of the set number of threads; wherein, the source data of the set number of threads can come from parallel computing processors such as SIMD processors, the sources of different threads in the aforementioned set number The data comes from different threads in the SIMD processor.
- the set number of deployment threads in the aforementioned source operand collector can be consistent with the number of threads of a parallel computing processor such as a SIMD processor, for example, both are N, and N is an integer greater than or equal to 2; or, the aforementioned source operand collector
- the set number of deployment threads in can also be less than the number of threads of parallel computing processors such as SIMD processors.
- the set number of deployment threads in the source operand collector is N, and the number of threads of parallel computing processors such as SIMD processors is 2N.
- the source operand 1 may specifically be a general-purpose register address or a special-purpose register address.
- the source operand 2 is used to determine the thread offset corresponding to the data movement mode, and the source operation data 2 can be an immediate value set according to actual computing requirements.
- the destination operand is used to indicate the storage location of the operation result, specifically, it may be a general-purpose register address or a special-purpose register address.
- the instruction encoder can obtain the first operand, the second operand, the destination operand, the source operand 1 and the source operand 2 from the instruction encoding according to the format of the instruction encoding. Instruct the general-purpose register to prepare corresponding source data according to the first operand, and the general-purpose register returns the source data of the aforementioned set number of threads to the source operand collector, and the source operand collector can encode the source data of the set number of threads according to the instruction code The data is moved to obtain the moved data on each thread.
- the source operand collector can send the source data and the moved data on some or all threads in the set number to the arithmetic logic unit, and the arithmetic logic unit can execute the second operation code in parallel (simultaneously) for some or all threads
- the indicated operation type gets the corresponding operation result, which is stored according to the destination operand.
- schemes 1 to 4 are combined to describe in detail the schemes of cross-thread data movement and calculation under different data movement methods.
- the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
- One instruction code can be used to realize the circular transfer of data among N threads.
- the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
- the first operation code is CROSS-DOWN, which indicates that the data transfer method between N threads in this solution is circular transfer or the first data transfer method;
- the second operation code indicates the operation type, such as floating-point addition FADD;
- the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
- the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is
- the initial data is the data of N threads of the parallel computing processor;
- the second source operand can be an immediate value, and the thread offset corresponding to the first data movement method is the immediate value.
- the thread offset here The amount can be understood as the degree of thread crossing involved in moving data. For example, if the immediate value is 2, there are 2 threads between the thread where a certain data is moved and the thread where the data is moved.
- the expression of the first operation instruction may be recorded as: CROSSDOWN.FADD R0, R1, 2.
- the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered I +1 is moved to the thread numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), i is taken over 0 ⁇ (N-1), and I 1 is the value of (i+SRC1) taking remainder of N; wherein, SRC1 represents the second Source operand, SRC1 is a positive integer.
- Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2.
- FIG. 4 a schematic diagram of circular transfer, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow.
- the data moved on the thread numbered 0 is the first source data on the thread numbered 2;
- the data moved on the thread numbered 2 is the first source data on the thread numbered 4;
- the data moved on the thread is the first source data on the thread numbered 27;
- the data moved on the thread numbered 30 is the first source data on the thread numbered 0, and so on.
- a CROSSDOWN cross-thread processing unit can be deployed in the source operation data collector, such as the CROSSDOWN cross-thread processing unit structure shown in Figure 5, which can be implemented by using multiple selectors MUX Cyclic transfer operation. Assuming that the data bit width in each thread of N threads is M bits, then a cascade circuit can be constructed by log2(N) binary selectors with a bit width of 2*M*N bits to perform cascaded data selection .
- the input of the first selector in the cascade circuit is generated based on the first source data of N threads: the first source data of each thread in the N threads is copied to double the bit width (2M) as the selector
- the first input remember that the first source data of the thread is SRC0.
- 2 ⁇ SRC0,SRC0 ⁇ represents the data copied to double the bit width; the data copied to the double bit width is shifted to the right by M bits as the second input.
- bit 0 of SRC1 selects the aforementioned data output that only copies twice the bit width; if bit 0 of SRC1 If it is 1, then select the data output that is double the bit width copied and shifted to the right; vice versa.
- One of the inputs to the i-th stage selector thereafter comes from the output of the selector at the previous stage, and the other input is the data shifted to the right by (i+1)*M bits from the output of the selector at the previous stage.
- the bit i converted from SRC1 to binary is used as the selection bit.
- bit i of SRC1 For example, if the bit i of SRC1 is 0, then select the aforementioned data output that only copies twice the bit width; if the bit i of SRC1 is 1, then select to copy twice the bit Wide and right-shifted data output; vice versa, but it should be noted that the definition of the value of bit i in each stage of selector is the same.
- the selector of the last stage uses the bit log(N)-1 converted from SRC1 into binary as the selection bit, and the output data is sent to the arithmetic logic unit ALU as an operand, and the ALU is indicated according to the second operation code of the aforementioned first operation instruction Operation type, which can calculate the operands before and after cross-thread movement, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical and, logical or, Logical XOR etc.
- the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
- the first thread may include some or all of the N threads.
- a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
- the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%N) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%N) .
- lanemask[i] represents the value of the original thread flag of thread i
- lanemask[(i+src1)%N] is the value of the flag of the thread whose number is ((i+src1)%N).
- the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
- the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
- the thread flag bit indicates that the first source data of the first thread participates in an operation operation
- the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
- Figure 6 a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0, and the cross-thread data move operation is the updated thread 1, thread 1 after the circular move. 26, thread 28, and thread 31 have thread flags of 0. Data before and after migration on thread 1, thread 26, thread 28, and thread 31 does not participate in the calculation.
- the solution 1 uses a single instruction to move data circularly across threads, which can be applied to reduction calculations in parallel computing.
- This solution one can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
- the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
- One instruction code can be used to realize the cross movement of data of N threads.
- the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
- the first operation code is CROSS QUAD BUTTERFLY, indicating that the data transfer method between N threads in the second solution is cross transfer or the second data transfer method;
- the second operation code indicates the operation type, such as floating-point addition FADD;
- the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
- the first source operand is the general-purpose register R1, and the first source data of N threads is indicated by the general-purpose register address, and the general-purpose register R1 is
- the initial data is the data of N threads of the parallel computing processor;
- the second source operand may be an immediate value, such as 2.
- the expression of the first operation instruction may be recorded as: CROSS QUAD BUTTERFLY.FADD R0,R1,2.
- the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread that is numbered 12 is moved to the thread that is numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, and SRC1 is positive integer.
- Both the source operand collector deployment and the parallel computing processor have 32 threads, and SRC1 is an example of 2.
- SRC1 is an example of 2.
- the data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 2 is the first source data on the thread numbered 0;
- the data moved on the thread is the first source data on the thread numbered 31; the data moved on the thread numbered 31 is the first source data on the thread numbered 29, and so on.
- the CROSS QUAD BUTTERFLY of the second scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and realize the data exchange between two threads in each QUAD.
- the CROSS QUAD BUTTERFLY cross-thread processing unit can be deployed in the source operation data collector, and this unit uses multiple four-selector MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits.
- Figure 8 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD BUTTERFLY. The input of the i-th four-selector MUX is the first source data of the four threads of the QUAD to which the thread belongs.
- the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3.
- the i-th selector uses the XOR result of SRC1 and i as the selection bit, selects one of the four inputs and outputs it to the arithmetic logic unit ALU, and the ALU can perform operations according to the operation type indicated by the second operation code of the aforementioned first operation instruction. Calculate the operands before and after moving across threads, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
- the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
- the first thread may include some or all of the N threads.
- a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
- the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( The thread flag bit of the thread of i ⁇ SRC1) determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, the thread flag of thread i is updated according to the original thread flag of the thread i and the thread flag of the thread numbered (i ⁇ SRC1) which is the source of the moved data.
- lanemask[i] represents the value of the original thread flag of thread i
- lanemask[i ⁇ SRC1] is numbered as the value of the flag of the thread of (i ⁇ SRC1).
- the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
- the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
- the thread flag bit indicates that the first source data of the first thread participates in an operation operation
- the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
- Figure 9 a schematic diagram of a thread flag bit, the thread 1 before the move is filled with black, and the original thread flag bit of thread 28 is 0.
- the updated thread 1 After the cross-thread data move operation, that is, the cross-movement of the price difference, the updated thread 1,
- the thread flag bits of thread 3, thread 28, and thread 30 are all 0. Data before and after migration on thread 1, thread 3, thread 28, and thread 30 does not participate in the calculation.
- the second solution uses a single instruction to achieve cross-thread data transfer, and can lock the data exchange between threads within a small range of QUAD, which can be applied to difference calculations in image processing, such as the difference between two pixels that are located close to each other. Pixel comparison etc.
- the source operand collector deploys the same number of threads as the parallel computing processors, such as N threads.
- One instruction code can be used to realize the cross movement of data of N threads.
- the instruction is coded as the first operation instruction, and the first operation instruction may include the following parameters: the first operation code, the second operation code, the destination operand, the first source operand (that is, the source operand 1 in the first operation instruction ) and the second source operand (ie, source operand 2 in the first operation instruction).
- the first operation code is CROSS QUAD-BROADCAST, indicating that the data transfer method between N threads in the third scheme is one-to-many transfer or diffusion transfer, or the third data transfer method;
- the second operation The code indicates the operation type, such as floating-point addition FADD;
- the destination operand is the general-purpose register R0, and the storage location of the operation result is indicated by the general-purpose register address;
- the first source operand is the general-purpose register R1, and the first source operand of the N threads is indicated by the general-purpose register address
- One source data, the initial data in the general register R1 is the data of N threads of the parallel computing processor;
- the second source operand can be an immediate value, such as 2.
- the expression of the first operation instruction may be recorded as: CROSS QUAD-BROADCAST.FADD R0,R1,2.
- the source operand collector moves the first source data of N threads according to the first operation instruction, specifically, it can be implemented with reference to the following manner: the first source data of the thread numbered 13 is moved to the thread numbered i; Wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, n is a positive integer capable of dividing N, Indicates rounding down.
- the data moved on the thread numbered i can be represented by SRC0[i], and the result after cyclic moving satisfies the expression: i ⁇ [0,N-1].
- n is 4.
- Both the source operand collector deployment and the parallel computing processor have 32 threads, the SRC1 is 2, and n is 4 examples.
- FIG. 10 a schematic diagram of one-to-many offset, the first source data of the thread corresponding to the end of the arrow is moved to the thread corresponding to the head of the arrow.
- the first source data of the thread numbered 2 is moved to the thread numbered 0, the thread numbered 1, the thread numbered 2, and the thread numbered 3.
- the data moved on the thread numbered 0 is the first source data on the thread numbered 2; the data moved on the thread numbered 1 is the first source data on the thread numbered 2; The data moved on the thread is still the first source data on the thread numbered 2; the data moved on the thread numbered 3 is the first source data on the thread numbered 2, and so on.
- the CROSS QUAD-BROADCAST of the third scheme can make N threads, such as 32 threads, grouped into a group of 4 consecutive threads to form a QUAD, and then for each thread, select the thread number in the QUAD to which it belongs as SRC1 The first source data of the thread is moved to this thread.
- the CROSS QUAD-BROADCAST cross-thread processing unit can be deployed in the source operation data collector, and this unit uses a plurality of four selectors MUX to realize the circular transfer operation. Assuming that the data bit width in each of the N threads is M bits, parallel data selection can be performed through N four selectors with a bit width of M bits.
- Figure 11 shows the i-th four-selector MUX in the cross-thread processing unit of CROSS QUAD-BROADCAST
- the input of the i-th four-selector MUX is the first source of the four threads of the QUAD to which the thread belongs Data
- the numbers of the aforementioned four threads are respectively: i%4, (i%4)+1, (i%4)+2 and (i%4)+3.
- the i-th selector uses SRC1 as the selection bit, and selects one of the four inputs to output to the arithmetic logic unit ALU.
- the ALU can perform cross-thread transfer before and after Operand calculations, including but not limited to floating-point multiplication, floating-point addition, integer multiplication, integer addition, floating-point comparison, integer comparison, logical AND, logical OR, logical XOR, etc.
- the arithmetic logic unit ALU may, for the first thread among the N threads, perform the operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread. arithmetic operations.
- the first thread may include some or all of the N threads.
- a thread flag bit may be configured for each of the N threads, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation. Specifically, when the thread flag bit takes a first value, the thread flag bit is used to indicate that the source data of the thread participates in an operation; or, when the thread flag bit takes a second value, the thread flag bit is used for Indicates that the source data of the thread does not participate in the calculation operation. Wherein, the first value may be 1, and the second value may be 0. Setting the thread flag bit marks the threads that do not need to be calculated, which can save calculation power consumption in this way.
- the arithmetic logic unit ALU can determine whether the data before and after the movement on the thread participates in the calculation according to the thread flag bit. Specifically, for the thread numbered i in N (referred to as thread i), the thread flag bit and number of the thread i can be ( (i+src1)%4) The thread flag bit of the thread determines whether the data before and after moving on the thread i participates in the operation. It can also be understood as after the data is moved, update the thread flag of the thread i according to the original thread flag of the thread i and the source of the moved data, that is, the thread flag of the thread numbered ((i+src1)%4) .
- lanemask[i] represents the value of the original thread flag of thread i
- lanemask[(i+src1)%4] is the value of the flag of the thread whose number is ((i+src1)%4).
- the updated thread flag new_lanemask[i] of thread i is 1, which means that the data before and after moving on the thread participates in the operation.
- the data before and after the move on the first thread needs to meet the following conditions to participate in the operation: the data associated with the first thread
- the thread flag bit indicates that the first source data of the first thread participates in an operation operation
- the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in an operation operation.
- Figure 12 a schematic diagram of a thread flag bit, the thread 1 before the move is filled in black, and the original thread flag bit of thread 30 is 0.
- the updated thread 1 After the cross-thread data move operation, that is, the price difference cross move, the updated thread 1,
- the thread flag bits of threads 28-31 are all 0. The data before and after the migration on thread 1 and threads 28-31 does not participate in the calculation.
- the third solution uses a single instruction to achieve cross-thread data transfer, and can lock the data of a certain thread in a small range of QUAD, which can be applied to the difference calculation in image processing, such as four adjacent pixels in position, Smoothing based on one of the pixels, etc.
- plan 1 and plan 3 are combined and implemented, and the calculation results of each thread in plan 1 are used as the source data of the corresponding thread in plan 3 to perform one-to-many transfer operations.
- the embodiment of the present application also provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
- the instruction scheduler inputs instructions.
- the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
- the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
- Initialized into the register as the data input of the register module.
- the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
- the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
- the source operand collector judges whether the first opcode is a CROSS type instruction. If not, perform step (5) to send the data to the downstream ALU for calculation. If it is a CROSS type instruction, after judging whether it is a CROSS DOWN instruction, a CROSS QUAD BUTTERFLY instruction or a CROSS QUAD BROADCAST instruction, and processing it through the corresponding processing unit such as performing a cross-thread data movement operation, then perform step (5) to transfer the data Send it to the downstream ALU for calculation.
- the ALU performs corresponding calculations according to the second operation code, and the result is sent to the next module for processing.
- the source operand collector deploys fewer threads than the number of parallel computing processors. For example, the source operand collector deploys N threads, while the number of parallel computing processing threads is 2N. Two instruction codes can be used to realize the circular transfer of data among 2N threads of parallel computing processors.
- the codes of the two instructions can be recorded as the first operation instruction and the second operation instruction.
- the first operation instruction can refer to the definition of Scheme 1.
- the source data sources indicated by the source operand 1 in the first operation instruction and the second operation instruction are different. .
- the first source data of the N threads indicated by the first source operand in the first operation instruction comes from N consecutive threads of the parallel computing processor.
- the second source operation data indicates (in the source operation data collector) the second source data of N threads, and the N threads
- the second source data comes from the remaining N consecutive threads in the parallel computing processor.
- the second operation instruction may also include other parameters that are the same as those of the first operation instruction, such as a first operation code, a second operation code, a destination operand, and a second source operand.
- the source operand collector moves the first source data of N threads according to the first operation instruction
- the specific implementation manner of moving the second source data of N threads according to the second operation instruction can refer to Solution 1 is carried out, which will not be introduced in this embodiment of the present application.
- the source operand collector can deploy 32 threads
- the parallel computing processor includes 64 threads
- the SRC1 is 2.
- the time when the first operation instruction is issued is ahead of the time when the second operation instruction is issued; note that the time sequence interval between the two instructions is m, and when m is 1, it means that the first operation instruction and the second operation instruction are Two instructions sent back to back.
- Each instruction processes N threads.
- Figure 14 shows a schematic diagram of source data sources.
- the N threads of the source operand collector are numbered from 0 to 31.
- the first source data of the N threads indicated by the first operation instruction comes from Threads numbered 32-63 in the parallel computing processor, wherein the first source data of thread 0 in the source operation data collector comes from thread 32 in the parallel computing processor, and the first source data of thread 1 in the source operation data collector
- the source data comes from the thread 33 in the parallel computing processor, and by analogy, the first source data of the thread 31 in the source operation data collector comes from the thread 63 in the parallel computing processor; the indicated N threads of the second operation instruction
- the second source data comes from threads numbered 0-31 in the parallel computing processor, wherein the second source data of thread 0 in the source operation data collector comes from thread 0 in the parallel computing processor, and the thread 0 in the source operation data collector
- the second source data of thread 1 comes from thread 1 in the parallel computing processor, and so on, the second source data of thread 31 in the source operation data collector comes from thread 31 in the parallel computing processor.
- the embodiment of the present application provides another schematic diagram of circular transfer.
- the source operation data collector first obtains the first operation instruction, and after the source operation data collector executes the circular movement of cross-thread data according to the first operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the first source data on thread 1; the data moved on thread 1 is the first source data on thread 3; ... the data moved on thread 30 is the first source data on thread 0; the data moved on thread 31 is First source data on thread 1.
- the source operation data collector inputs the migration result corresponding to the first operation instruction, recorded as the first data of N threads, to the ALU.
- the source operation data collector obtains the second operation instruction, and after the source operation data collector executes the circular transfer of cross-thread data according to the second operation instruction, in the source operation data collector: the data moved on thread 0 is on thread 2 the second source data on thread 1; the data moved on thread 1 is the second source data on thread 3; ... the data moved on thread 30 is the second source data on thread 0; the data moved on thread 31 is Second source data on thread 1.
- the source operation data collector inputs the moving result corresponding to the second operation instruction, recorded as the second data of N threads, to the ALU.
- the first operation instruction arrives earlier than the second operation instruction, assuming that the second operation instruction arrives at the stage I of the ALU, and the first operation instruction arrives at the stage I+m of the ALU, I is Arbitrary stages in the ALU. Then the arithmetic logic unit ALU can exchange the first data after moving on the third thread and the second data after moving on the third thread; wherein, the third thread is numbered r among the N threads in the source operation data collector the rout.
- r can be determined as follows: if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N- SRC1%N).
- FIG. 16 for a schematic diagram of data replacement.
- FIG. 15 shows that when the first operation instruction arrives at stage 0 of the ALU, the first data and the thread after being moved on the thread 30 are shown. Replace the second data moved on the thread 31; and replace the first data moved on the thread 31 with the second data moved on the thread 31, so far realize the circular movement operation of the data between the threads of the parallel computing processor 64 .
- the ALU implements the corresponding calculation operation based on the result of circular transfer of data between threads of the parallel computing processor 64 according to the second operation code in the first operation instruction/second operation instruction.
- the specific computing operations may be performed according to actual requirements, which is not limited in this embodiment of the present application.
- a thread flag bit can also be configured for each of the N threads deployed by the source operand collector, and the thread flag bit is used to indicate whether the first source data of the thread participates in the calculation operation .
- the specific implementation manner can be carried out with reference to the manner in Solution 1, which will not be repeated in this embodiment of the present application. As an example, in FIG. 16 , it is filled with black to indicate that the data before and after migration on the thread numbers 30 and 31 does not participate in the calculation operation.
- the fourth solution fewer threads in the source operand collector are combined with ALU data exchange processing to realize efficient cross-thread data circulation with higher SIDM width, which can be applied to reduction calculation in parallel computing.
- This solution four can also be used to realize multi-threaded data accumulation and multiplication operations, such as constructing multi-level instructions, each level of instruction uses the aforementioned first operation instruction, and the output result of each level of instruction can be used as the input of the next level of instruction, finally making Move multi-threaded data to the same thread to realize accumulation, multiplication and other operations.
- the embodiment of the present application provides a cross-thread data processing flow, which can be executed cooperatively by various units in a parallel computing processor. Mainly include the following steps.
- the instruction scheduler inputs instructions.
- the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
- the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
- Initialized into the register as the data input of the register module.
- the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
- the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
- step (10) The source operand collector judges whether the first opcode is a CROSS type instruction. If not, then execute step (10) to send the data to the next module. If so, when it is determined to be a CROSS DOWN instruction, step (5) is performed.
- the source operand collector performs CROSS DOWN data processing such as circular transfer operations.
- step (10) judge whether it is the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction) in ALU stage I. If not, then execute step (10) to send the data to the next module; if yes, then instruct step (7).
- the ALU judges whether the value of SRC1 is smaller than N; if yes, execute step (8); if not, execute step (9).
- FIG. 18 illustrate a cross-thread data processing flow, which mainly includes the following steps.
- the instruction scheduler inputs instructions.
- the parallel computing program or graphics rendering program is compiled into a binary instruction code by a compiler and then configured into the SIMD processor.
- the instruction code is used as the input of the instruction scheduler, and the data to be processed is configured into the storage through the software and is processed before the instruction starts to be issued.
- Initialized into the register as the data input of the register module.
- the instruction scheduler judges whether it is SIMD 2N mode, that is, whether the number of threads of SIMD is twice the number of threads in the source operand collector. If (3) is executed, the instruction is issued only once; if (4) is not executed, the instruction is issued twice.
- the instruction decoder parses the instruction code to obtain operands (such as source operand 1, source operand 2, destination operand, etc.) and operation codes, such as the first operation code and the second operation code.
- the instruction decoder parses the source operand, it sends a source operand read request to the general register, and the general register returns the data corresponding to the source operand to the collector.
- the source operand collector executes the moving operation of the first source data between the N threads indicated by the first operation instruction according to the above schemes 1 to 4.
- Scheme 1, Plan 2, Plan 3, and Plan 4 for carrying out SIMD N are indicated.
- step (6) judge whether it is the CROSS DOWN instruction of SIMD 2N by ALU, or judge whether to receive the second instruction of CROSS DOWN (i.e. the aforementioned second operation instruction). If not, execute step (8) to send the data to the next module; if yes, execute step (7).
- the data between the threads whose thread number is greater than or equal to N-SRC1 to the thread number less than N is exchanged.
- exchange data between threads whose thread numbers are greater than or equal to 0 and whose thread numbers are less than (N ⁇ SRC1%N) in ALU stage I and ALU stage I+m is exchanged.
- an embodiment of the present application provides a multi-thread data processing method, as shown in FIG. 19 .
- the method mainly includes the following processes.
- the first operation instruction includes the following parameters: a first operation code, the first operation code is used to indicate the data transfer mode between N threads, and N is an integer greater than or equal to 2; the first source operand, the The first source operand is used to indicate the first source data of the N threads; the second source operand is used to determine the thread offset corresponding to the data movement mode;
- the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
- the data moving method is the first moving method
- the moving the first source data of the N threads according to the first operation instruction includes: The first source data of the thread 1 is moved to the thread numbered i; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is (i+ SRC1) the value of the remainder of N; wherein, SRC1 represents the second source operand, and SRC1 is a positive integer.
- the data moving method is a second moving method
- the moving the first source data of the N threads according to the first operation instruction includes:
- the data moving method is a third offset method
- the moving the first source data of the N threads according to the first operation instruction includes:
- the first operation instruction further includes a second operation code, and the second operation code is used to indicate an operation type; the method further includes:
- each of the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
- the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
- the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
- the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads, and the method further includes:
- the second operation instruction including the following parameters: the first operation code; the second source operand; a third source operand, the third source operand indicating the N threads
- the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
- the method also includes:
- the three threads are threads numbered r among the N threads; if SRC1 is less than N , r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
- the embodiment of the present application also provides a multi-thread data processing device 2000, the multi-thread data processing device includes:
- the instruction acquisition module 2001 is configured to acquire a first operation instruction, the first operation instruction includes the following parameters: a first operation code, and the first operation code is used to indicate a data transfer mode between N threads, where N is greater than Or an integer equal to 2; the first source operand, the first source operand is used to indicate the first source data of the N threads; the second source operand, the second source operand is used to determine the The thread offset corresponding to the above data movement mode.
- the processing module 2002 is configured to move the first source data of the N threads according to the first operation instruction, and obtain the moved first data on each of the N threads.
- the efficient cross-thread operation of the parallel computing processor is realized by a single instruction, which is simpler than the cross bar of the cross network, and does not require frequent memory access, and can realize high-performance parallelism with lower hardware or signaling overhead Accelerated processing of applications operating across threads in a computing processor.
- the data transfer method is the first transfer method
- the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered I +1 to the thread numbered i Among them; wherein, the numbering of N threads is 0 ⁇ (N-1), i takes 0 ⁇ (N-1), and I 1 is the value that (i+SRC1) gets remainder to N; Wherein, SRC1 represents described
- the second source operand, SRC1, is a positive integer.
- the data transfer method is the second transfer method
- the processing module 2002 is specifically configured to: transfer the first source data of the thread numbered 12 to the thread numbered i Among them; wherein, the numbering of N threads is 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); I 2 is the XOR value of i and SRC1, and SRC1 represents the second source operand, SRC1 is a positive integer.
- the data moving method is a third offset method
- the processing module 2002 is specifically configured to: move the first source data of the thread numbered I3 to the thread numbered i Among the threads; wherein, the numbers of N threads are 0 ⁇ (N-1), and i takes 0 ⁇ (N-1); the value of I 3 is SRC1 represents the second source operand, SRC1 is a positive integer, and n is a positive integer that can divide N.
- the first operation instruction further includes a second operation code, and the second operation code is used to indicate the operation type;
- the processing module 2002 is further configured to: for the N A first thread among the threads executes an operation corresponding to the operation type based on the first source data of the first thread and the moved first data on the first thread.
- each thread in the N threads is associated with a thread flag bit, and the thread flag bit is used to indicate whether the first source data of the thread participates in an operation.
- the first data moved on the first thread comes from the second thread among the N threads; the thread flag associated with the first thread indicates that the first The first source data of the thread participates in the computing operation, and the thread flag bit associated with the second thread indicates that the first source data of the second thread participates in the computing operation.
- the first operation instruction further includes a destination operand, and the destination operand is used to indicate a storage location of a corresponding operation result of the first thread.
- the first source data of the N threads comes from N consecutive threads of a parallel computing processor, and the parallel computing processor includes 2N threads;
- the instruction acquisition module 2001 It is also used to obtain a second operation instruction, the second operation instruction includes the following parameters: the first operation code; the second source operand; the third source operand, the third source operand indicates the The second source data of N threads, the second source data of the N threads comes from the remaining N continuous threads in the parallel computing processor;
- the processing module 2002 is further configured to perform the second operation according to the The instruction moves the second source data of the N threads to obtain the moved second data on each of the N threads.
- the processing module 2002 is further configured to: exchange the moved first data on the third thread with the second data moved on the third thread; wherein, the The three threads are the threads numbered r among the N threads; if SRC1 is less than N, r is greater than or equal to (N-SRC1), and r is less than N; if SRC1 is greater than or equal to N, r is greater than or equal to 0, and r is less than (N-SRC1%N).
- the communication device 2100 may be a chip or a chip system.
- the system-on-a-chip may be composed of chips, or may include chips and other discrete devices.
- the communication device 2100 may include at least one processor 2110, and the processor 2110 is coupled to a memory.
- the memory may be located within the device, the memory may be integrated with the processor, or the memory may be located outside the device.
- the communication device 2100 may further include at least one memory 2120 .
- the memory 2120 stores necessary computer programs, configuration information, computer programs or instructions and/or data for implementing any of the above embodiments; the processor 2110 may execute the computer programs stored in the memory 2120 to complete the method in any of the above embodiments.
- the coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
- Processor 2110 may cooperate with memory 2120 .
- the specific connection medium between the processor 2110 and the memory 2120 is not limited in this embodiment of the present application.
- the communication device 2100 may further include a communication interface 2130, and the communication device 2100 may perform information exchange with other devices through the communication interface 2130.
- the communication interface 2130 may be a transceiver, a circuit, a bus, a module or other types of communication interfaces.
- the communication interface 2130 in the device 2100 can also be an input and output circuit, which can input information (or call it, receive information) and output information (or call it, send information),
- the processor is an integrated processor or a microprocessor or an integrated circuit or a logic circuit, and the processor can determine output information according to input information.
- the communication interface 2130 , the processing module 2110 and the memory 2120 are connected to each other through a bus 2140 .
- the bus 2140 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like.
- PCI peripheral component interconnect
- EISA extended industry standard architecture
- the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in Fig. 21, but it does not mean that there is only one bus or one type of bus.
- the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
- a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
- the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM).
- a memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
- the memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
- an embodiment of the present application further provides a computer program, which, when the computer program is run on a computer, causes the computer to execute the above multi-threaded data processing method.
- the embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a computer, the computer executes the method described in the above-mentioned method embodiments.
- the storage medium may be any available medium that can be accessed by a computer.
- computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or may be used to carry or store information in the form of instructions or data structures desired program code and any other medium that can be accessed by a computer.
- the embodiment of the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, so as to realize the multiple functions provided in the above method embodiments. Thread data processing method.
- an embodiment of the present application provides a chip system
- the chip system includes a processor, configured to support a computer device to implement the functions of the multi-threaded data processing method in the above method embodiments.
- the chip system further includes a memory, and the memory is used to store necessary programs and data of the computer device.
- the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
- the technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof.
- software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
- the computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.
- the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.
- These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
- the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
La présente invention concerne un procédé et un appareil de traitement de données multi-fil, qui sont utilisés pour résoudre le problème de calcul entre fils qui est compliqué et qui présente des surdébits importants. Le procédé comprend : l'acquisition d'une première instruction d'opération, la première instruction d'opération comprenant les paramètres suivants : un premier code d'opération, qui est utilisé pour indiquer un mode de transfert de données entre N fils, N étant un nombre entier supérieur ou égal à 2, un premier opérande source, qui est utilisé pour indiquer des premières données source des N fils, et un second opérande source, qui est utilisé pour déterminer un décalage de fil correspondant au mode de transfert de données ; et le transfert des premières données de source des N fils selon la première instruction d'opération, de façon à obtenir les premières données transférées sur chacun des N fils.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202180099704.7A CN117561501A (zh) | 2021-06-22 | 2021-06-22 | 一种多线程数据处理方法及装置 |
| PCT/CN2021/101533 WO2022266842A1 (fr) | 2021-06-22 | 2021-06-22 | Procédé et appareil de traitement de données multi-fil |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/101533 WO2022266842A1 (fr) | 2021-06-22 | 2021-06-22 | Procédé et appareil de traitement de données multi-fil |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022266842A1 true WO2022266842A1 (fr) | 2022-12-29 |
Family
ID=84543861
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/101533 Ceased WO2022266842A1 (fr) | 2021-06-22 | 2021-06-22 | Procédé et appareil de traitement de données multi-fil |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN117561501A (fr) |
| WO (1) | WO2022266842A1 (fr) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117389731A (zh) * | 2023-10-20 | 2024-01-12 | 上海芯高峰微电子有限公司 | 数据处理方法和装置、芯片、设备及存储介质 |
| CN119697069A (zh) * | 2025-02-21 | 2025-03-25 | 山东浪潮科学研究院有限公司 | 一种通信错误检测方法、装置、设备及介质 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120723467B (zh) * | 2025-08-15 | 2025-11-11 | 上海壁仞科技股份有限公司 | 线程束内规约计算的优化方法、装置、计算机设备和可读存储介质 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105094749A (zh) * | 2009-12-22 | 2015-11-25 | 英特尔公司 | Simd向量的同步化 |
| CN105302749A (zh) * | 2015-10-29 | 2016-02-03 | 中国人民解放军国防科学技术大学 | Gpdsp中面向单指令多线程模式的dma传输方法 |
| US20180157598A1 (en) * | 2016-12-05 | 2018-06-07 | Intel Corporation | Apparatuses, methods, and systems to share translation lookaside buffer entries |
| US10761741B1 (en) * | 2016-04-07 | 2020-09-01 | Beijing Baidu Netcome Science and Technology Co., Ltd. | Method and system for managing and sharing data using smart pointers |
-
2021
- 2021-06-22 CN CN202180099704.7A patent/CN117561501A/zh active Pending
- 2021-06-22 WO PCT/CN2021/101533 patent/WO2022266842A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105094749A (zh) * | 2009-12-22 | 2015-11-25 | 英特尔公司 | Simd向量的同步化 |
| CN105302749A (zh) * | 2015-10-29 | 2016-02-03 | 中国人民解放军国防科学技术大学 | Gpdsp中面向单指令多线程模式的dma传输方法 |
| US10761741B1 (en) * | 2016-04-07 | 2020-09-01 | Beijing Baidu Netcome Science and Technology Co., Ltd. | Method and system for managing and sharing data using smart pointers |
| US20180157598A1 (en) * | 2016-12-05 | 2018-06-07 | Intel Corporation | Apparatuses, methods, and systems to share translation lookaside buffer entries |
Non-Patent Citations (1)
| Title |
|---|
| CHEN YILE LELE: "Hystrix of Spring Cloud passes data across threads", CSDN BLOG, CSDN, CN, CN, XP009542253, Retrieved from the Internet <URL:https://blog.csdn.net/myle69/article/details/83512576> * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117389731A (zh) * | 2023-10-20 | 2024-01-12 | 上海芯高峰微电子有限公司 | 数据处理方法和装置、芯片、设备及存储介质 |
| CN117389731B (zh) * | 2023-10-20 | 2024-04-02 | 上海芯高峰微电子有限公司 | 数据处理方法和装置、芯片、设备及存储介质 |
| CN119697069A (zh) * | 2025-02-21 | 2025-03-25 | 山东浪潮科学研究院有限公司 | 一种通信错误检测方法、装置、设备及介质 |
| CN119697069B (zh) * | 2025-02-21 | 2025-05-09 | 山东浪潮科学研究院有限公司 | 一种通信错误检测方法、装置、设备及介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117561501A (zh) | 2024-02-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112069459B (zh) | 用于稀疏-密集矩阵乘法的加速器 | |
| CN112099852B (zh) | 可变格式、可变稀疏矩阵乘法指令 | |
| US7042466B1 (en) | Efficient clip-testing in graphics acceleration | |
| CN105960630B (zh) | 用于执行分段操作的数据处理设备和方法 | |
| CN109062608B (zh) | 用于独立数据上递归计算的向量化的读和写掩码更新指令 | |
| TW202125287A (zh) | 用於矩陣運算加速器之指令的裝置,方法和系統 | |
| TWI517038B (zh) | 用於在多維度陣列中之元件偏移計算的指令 | |
| US20120254592A1 (en) | Systems, apparatuses, and methods for expanding a memory source into a destination register and compressing a source register into a destination memory location | |
| WO2022266842A1 (fr) | Procédé et appareil de traitement de données multi-fil | |
| CN107992329A (zh) | 一种计算方法及相关产品 | |
| TWI603262B (zh) | 緊縮有限脈衝響應(fir)濾波器處理器,方法,系統及指令 | |
| CN117785480A (zh) | 处理器、归约计算方法及电子设备 | |
| CN112148251A (zh) | 跳过无意义的矩阵运算的系统和方法 | |
| US20210166156A1 (en) | Data processing system and data processing method | |
| CN117130578A (zh) | 位矩阵乘法 | |
| CN115543254A (zh) | 一种排序电路、排序方法及电子设备 | |
| US7769981B2 (en) | Row of floating point accumulators coupled to respective PEs in uppermost row of PE array for performing addition operation | |
| WO2016024508A1 (fr) | Dispositif multiprocesseur | |
| CN102012802B (zh) | 面向向量处理器数据交换的方法及装置 | |
| CN113407351A (zh) | 执行运算的方法、装置、芯片、设备、介质和程序产品 | |
| CN112130899A (zh) | 一种堆栈计算机 | |
| CN116257208A (zh) | 用于矩阵乘法阵列上的可分离卷积过滤器操作的方法和装置 | |
| US6895424B2 (en) | Method and circuit for alignment of floating point significants in a SIMD array MPP | |
| CN112148373A (zh) | 用于具有完全相同电路块阵列的向量处理器架构的设备、方法和系统 | |
| CN111831338B (zh) | 临时寄存器中的按通道动态索引 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21946344 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202180099704.7 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21946344 Country of ref document: EP Kind code of ref document: A1 |