WO2024250758A1 - Procédé de traitement de données pour données complexes et dispositif associé - Google Patents
Procédé de traitement de données pour données complexes et dispositif associé Download PDFInfo
- Publication number
- WO2024250758A1 WO2024250758A1 PCT/CN2024/079961 CN2024079961W WO2024250758A1 WO 2024250758 A1 WO2024250758 A1 WO 2024250758A1 CN 2024079961 W CN2024079961 W CN 2024079961W WO 2024250758 A1 WO2024250758 A1 WO 2024250758A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- data
- complex
- storage
- substructure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/4806—Computations with complex numbers
Definitions
- the present application relates to the field of computer technology, and in particular to a complex data processing method and related equipment.
- FFT Fast Fourier Transform
- the present application provides a complex data processing method, which is applied to an electronic device that performs fast Fourier transform operations, the method comprising: obtaining a first storage instruction; in response to the first storage instruction, storing a first data sequence to a memory according to a first structural format, wherein the first data sequence includes two or more complex numbers related to the fast Fourier operation;
- the first structural format includes at least one storage structure, the storage structure corresponds to K adjacent complex numbers in the data sequence to be stored, the storage structure includes a first substructure and a second substructure, the first substructure corresponds to K real parts of the adjacent K complex numbers, the second substructure corresponds to K imaginary parts of the adjacent K complex numbers, the K real parts in the first substructure correspond to a continuous address space on the memory, the K imaginary parts in the second substructure correspond to a continuous address space on the memory, and K is an integer greater than or equal to 2.
- the first data sequence (i.e., the complex data sequence to be stored) is stored, that is, for the adjacent K complex numbers in the first data sequence, the K real parts of the K complex numbers are packed together, and the K imaginary parts of the K complex numbers are packed together, and the packed K real parts are continuously present in the memory (memory), and the packed K imaginary parts are continuously present in the memory.
- the storage efficiency of the data can be improved, and the reading efficiency of the data can be improved by reading the packed K real parts (imaginary parts) together.
- the two or more complex numbers included in the first data sequence can be derived from one or more of the following data: the input data required for the FFT operation (calculation), the intermediate data output during the FFT operation, the butterfly factor matrix, and the output data finally output after the FFT operation, which can improve the processing efficiency of the complex data involved in the FFT operation, especially the reading and writing efficiency of the complex numbers involved in the FFT operation.
- the parts of the data sequence corresponding to each storage structure are different. That is, all complex numbers in the complex data sequence to be stored are stored using the first structural format to improve the read and write efficiency of the complex data sequence.
- the first structural format includes N first substructures and N second substructures, and the first substructures and second substructures in the first structural format are arranged alternately.
- the first substructure and the second substructure are arranged alternately, that is, the first substructure and the second structure belonging to the same storage structure are adjacent, the first substructure is first and the second substructure is later, or the second substructure is first and the first substructure is later. Storing complex numbers in the first structural format can improve the read and write efficiency of data.
- storing a first data sequence in a memory according to a first structure format includes: determining K adjacent complex numbers in the first data sequence corresponding to the storage structure; based on the K adjacent complex numbers in the first data sequence, storing K real parts corresponding to a first substructure of the storage structure in a continuous address space on the memory; and based on the K adjacent complex numbers in the first data sequence, storing K imaginary parts corresponding to a second substructure of the storage structure in another continuous address space on the memory.
- the K complex numbers exist continuously in the memory.
- the K real parts of the K complex numbers are used as the first substructure of the storage structure, and the first substructure indicates that the corresponding K real parts exist continuously in the memory.
- the K imaginary parts of the K complex numbers are used as the second substructure of the storage structure, and the first substructure indicates that the corresponding K imaginary parts exist continuously in the memory.
- storing data by the first substructure and the second substructure can improve the reading and writing efficiency of data.
- the address space in the memory and the address space corresponding to the K imaginary parts are adjacent or non-adjacent in the address space.
- storing the first data sequence to the memory according to the first structure format includes: determining K adjacent complex numbers in the first data sequence corresponding to each storage structure and a continuous first address space on the memory corresponding to each storage structure; for each storage structure, storing K real parts of the K complex numbers corresponding to the storage structure to a continuous sub-address space in the corresponding first address space, and storing K imaginary parts of the K complex numbers corresponding to the storage structure to another continuous sub-address space in the corresponding first address space.
- the first address space is used to store the complex numbers corresponding to the storage structure.
- Each storage structure has its corresponding first address space, and the first address space corresponding to each storage structure is used to store its corresponding K complex numbers.
- the storage structure indicates that the K complex numbers corresponding to it are continuously present in the first address space corresponding to it, specifically, the K real parts of the K complex numbers are continuously present in a continuous sub-address space in the first address space, and the K imaginary parts of the K complex numbers are continuously present in another continuous sub-address space in the first address space. Storing complex numbers in the first structure format can improve the reading and writing efficiency of data.
- the size of K is set according to the length of the register of the electronic device and the bit width of the data to be stored, so as to ensure that the K real parts or K imaginary parts corresponding to a substructure can be stored in the same register, thereby improving the efficiency of data reading and writing.
- the complex numbers related to the fast Fourier operation come from one or more of the following data: input data, intermediate data in the operation process, output data, and butterfly factor matrix.
- the method when the complex numbers related to the fast Fourier operation come from the butterfly factor matrix, and the butterfly factor matrix is L rows and M columns, the method also includes: dividing the adjacent K butterfly factors in each column of the butterfly factor matrix into a first data sequence, and each data sequence obtained corresponds to a different part of the butterfly factor matrix, and the number of the first data sequence obtained is M*L/K, L is an integer greater than 2, and M is an integer greater than 1.
- L is an integer greater than 2
- M is an integer greater than 1.
- the method further includes: obtaining a first read instruction; in response to the first read instruction, storing the K real parts in the memory into a register of the electronic device, and loading the K imaginary parts in the memory into another register of the electronic device.
- the K real parts continuously stored in the memory can be subsequently loaded into a register based on the first read instruction (such as the LD1 instruction), and the K imaginary parts continuously stored in the memory can be stored into a register based on another LD1 instruction, thereby improving the data reading efficiency.
- the present application provides an electronic device, which includes a memory, a register, and at least one processor, wherein the memory is used to store instructions and complex numbers, the register is used to store complex numbers, and the processor is used to execute instructions to implement any of the above methods.
- the present application provides a computer program product, which includes computer-readable instructions, and implements any of the above methods when the computer-readable instructions are executed by one or more processors.
- FIG5 is a schematic diagram of a vectorized addition operation provided in an embodiment of the present application.
- FIG13A is a schematic diagram of a butterfly factor matrix provided in an embodiment of the present application.
- FIG. 13B is a schematic diagram of a storage of a butterfly factor matrix provided in an embodiment of the present application.
- FIG. 13C is another storage diagram of the butterfly factor matrix provided in an embodiment of the present application.
- FIG. 16 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
- Fourier transform is a tool for converting signals between time domain and frequency domain. It is widely used in different fields such as physics, medicine, wireless communication and integrated circuits. Exemplarily, the frequency components in the one-dimensional signal shown in Figure 1A are extracted using Fourier transform, and the extracted main frequency components are shown in Figure 1B. For another example, after the image signal of Figure 2A is subjected to a two-dimensional Fourier transform, the frequency domain result obtained is shown in Figure 2B. From the frequency domain results shown in Figure 2B, it can be seen that Fourier transform can realize functions such as filtering and noise reduction in the frequency domain.
- Discrete Fourier Transform DFT is a special form in which the input signal is a discrete value, and its mathematical expression is as follows:
- Cooley-Tukey algorithm based on the divide-and-conquer strategy, which successfully reduces the computational complexity to O(N logN).
- the general form of the Cooley-Tukey algorithm can be expressed as follows:
- N is the signal length
- N N 1 N 2
- n N 2 n 1 +n 2 , 0 ⁇ n 1 ⁇ N 1 -1, 0 ⁇ n 2 ⁇ N 2 -1
- k N 1 k 2 +k 1 , 0 ⁇ k 1 ⁇ N 1 -1, 0 ⁇ k 2 ⁇ N 2 -1.
- Fast Fourier transform is widely used in various fields (such as signal processing). It is one of the most important algorithms in the field of high performance computing (HPC), but its computational efficiency is low and difficult to optimize.
- HPC high performance computing
- an 8-point radix-2 FFT calculation process is introduced as an example, in which the main operations include complex multiplication and addition operations and data handling.
- the problems involved in the calculation of fast Fourier transform can be divided into two categories: compute-bound (computational bottleneck area) and memory-bound (bandwidth bottleneck area). The two areas of compute-bound and memory-bound are introduced in detail through the roofline model shown in Figure 4.
- the difficulty of FFT optimization lies in how to efficiently transfer data between the computing module and the storage module, that is, how to efficiently read and store (write) data.
- the current mainstream research direction is to use single instruction multiple data technology to implement access and calculation operations with fewer instructions to reduce running delays.
- the vectorized addition operation under the ARM NEON (SIMD instruction set under the ARM platform) extension is introduced by way of example.
- the corresponding elements in two vectors are added by a single instruction vadd (vectorized addition) and the result is stored in the target vector, which can realize four integer additions.
- An instruction operates on multiple data at the same time in an instruction-level (SIMD) manner.
- the number of data operated at the same time is determined by the length of the vector register and the data type. If the length of the NEON SIMD register is 128 bits and the operand is a 32-bit floating point number, the SIMD instruction can operate on 4 numbers at the same time.
- the alternating manner of the real and imaginary parts is, for example, to first store the real part of complex number A and then the imaginary part of complex number A in a continuous storage block of the memory, then store the real part of complex number B and finally store the imaginary part of complex number B.
- the storage of complex numbers in the FFT calculation library is also in the format of alternating real and imaginary parts.
- the complex number storage used by the complex FFT transformation represented by FFTW is in the format of alternating real and imaginary parts, and its complex number storage format is defined as follows:
- fftw_complex consists of two doubles, where the [0] element stores the real part of the complex number and the [1] element stores the imaginary part of the complex number.
- typedef double fftw_complex[2] arranges the two double elements in memory in order, with the real part in front and the imaginary part in the back.
- the input data and output data in the FFT operation are stored in the format of the fftw_complex[2] array. For this alternating storage of the real and imaginary parts, there are two read and write implementation methods under the NEON instruction set.
- the first method uses the gather-load instruction and scatter-store instruction in the NEON instruction set for reading and writing.
- the gather-load instruction includes the LDx instruction
- the scatter-store instruction includes the STx instruction.
- the x in the LDx instruction and the STx instruction can be any number from 1 to 4.
- LDx/STx can load independent vectors from the memory data of the structure type.
- the LD1 instruction loads data from the memory to a vector register.
- the ST1 instruction stores the data in a vector register into the memory.
- the LD2 instruction loads data from the memory to two vector registers.
- the ST2 instruction stores the data in two vector registers into the memory.
- the LD2 instruction can load 128 bits of data from the memory at one time and save it to two vector registers, where the data with the memory address relative to address 0 is loaded into vector register 0, and the data with the memory address relative to address 1 is loaded into vector register 1.
- the LD1 instruction does not have a deinterleaving function, and the LD1 instruction can be used to process non-interleaved data arrays.
- LD2, LD3 and LD4 all have a deinterleaving function, the difference being that the corresponding numbers of registers are different.
- the real and imaginary parts of complex numbers are stored alternately in the memory, i.e., [real part r0, imaginary part i0, real part r1, imaginary part i1, real part r2, imaginary part i2, ...].
- the real part r0 and imaginary part i0 of complex number A exist continuously in the memory (i.e., the real part r0 and imaginary part i0 are continuously located in the memory).
- the address space corresponding to the real part r0 and imaginary part i0 of complex number A is continuous.
- the real part r1 of complex number A and the real part r1 of complex number B do not exist continuously in the memory (i.e., the real part r0 and the real part r1 are non-continuously located in the memory).
- the address space corresponding to the real part r0 of complex number A and the real part r1 of complex number B is not continuous.
- the storage block storing the real part r0 is continuous with the storage block storing the imaginary part i0
- the storage block storing the real part r0 is not continuous with the storage block storing the real part r1
- the storage block storing the real part r0 and the storage block storing the real part r1 are separated by a storage block storing the imaginary part i0.
- the real part r0 is stored in the V0 register (register) through the LD2 instruction
- the imaginary part i0 is stored in the V1 register
- the real part r1 is stored in the V0 register
- the imaginary part i1 is stored in the V1 register.
- the second method is to read and write based on continuous read and write instructions and data exchange instructions in registers.
- the loading method indicated by the LD1 instruction is continuous loading or sequential reading.
- the data exchange instructions in the register include ZIP1 instruction, ZIP2 instruction, UZIP1 instruction and UZIP2 instruction, among which ZIP1 instruction and ZIP2 instruction indicate interleaving the elements of two vectors (vector compression), and UZIP1 instruction and UZIP2 instruction indicate reverse interleaving the elements of two vectors (vector decompression).
- the complex numbers in FIG7 and FIG6 are stored alternately in memory with real and imaginary parts.
- the difference between FIG7 and FIG6 is that in FIG7, when complex numbers A and B are read from memory to registers, the real part r0 and imaginary part i0 that are continuously in memory are stored in V0 register through an LD1 instruction, and the real part r1 and imaginary part i1 that are continuously in memory are stored in V1 register through another LD1 instruction. And through ZIP1 instruction and ZIP2 instruction, the real part r0 in V0 register and the real part r1 in V1 register are stored in V2 register, that is, the real part r0 and the real part r1 are packed.
- the imaginary part i0 in V0 register and the imaginary part i1 in V1 register are stored in V3 register, that is, the imaginary part i0 and the imaginary part i1 are packed.
- the real part r0 in register V2 is stored in register V0
- the real part r1 is stored in register V1
- the imaginary part i0 in register V3 is stored in register V0
- the imaginary part i1 is stored in register V1 through UZIP2 and UZIP1 instructions.
- the real part r0 and imaginary part i0 in register V0 are stored continuously to memory through an ST1 instruction
- the real part r1 and imaginary part i1 in register V1 are stored continuously to memory through another ST1 instruction.
- the above two reading and writing methods have low reading and writing efficiency.
- the first method involves reading and writing data across strides, and the second method involves data exchange operations within registers.
- the complex number is split into real and imaginary parts, and the real and imaginary parts are stored separately. That is, the real and imaginary parts of the complex number are stored separately.
- the split format is introduced in the guru interface:
- ri and ii represent the real and imaginary arrays of the input data, respectively
- ro and io represent the real and imaginary arrays of the output data, respectively.
- the LD1 instruction and ST1 instruction can be used to perform continuous read and write operations on each split array (real array and imaginary array).
- a single complex array is split into two arrays, the real part array and the imaginary part array, which has a high splitting cost. At the same time, it causes a significant increase in the number of function parameters, reducing the robustness of the program. Since a complex array is split into two arrays of real and imaginary parts, two independent pointers are required to access them and there are problems such as memory alignment. The memory overhead of the real and imaginary arrays is greater than that of a complex array, which will aggravate the memory bound problem in large-size FFT operations. In addition, because the two arrays of real and imaginary parts are located in different locations in the memory, different cache lines may need to be accessed simultaneously during the FFT calculation process, resulting in additional read and write overhead between the cache and registers.
- the embodiment of the present application provides a complex data processing method and related equipment, which are applied to complex data processing, especially the storage and reading of complex data.
- the embodiment of the present application proposes a new structural format (i.e., the first structural format below), the first structural format includes at least one storage structure, the storage structure corresponds to K adjacent complex numbers in a complex data sequence, the storage structure includes a first substructure and a second substructure, the first substructure corresponds to K real parts of the adjacent K complex numbers, the second substructure corresponds to K imaginary parts of the adjacent K complex numbers, the first substructure indicates that the address space of the corresponding K real parts on the memory is continuous, and the second substructure indicates that the address space of the corresponding K imaginary parts on the memory is continuous, and K is an integer greater than or equal to 2.
- the FFT calculation library 800 includes an FFT plan module 801 and an FFT execute module 802.
- the input of the FFT plan module 801 is the input data length, and the output is three kinds of auxiliary data.
- the three kinds of auxiliary data include the total number of decomposition levels (n levels as shown in FIG8 ), the radix size of each level, and the butterfly factor required for each level.
- the three kinds of auxiliary data are used by the FFT execution module 802.
- the input data of the FFT execution module 802 is the complex array to be processed, and the intermediate data is generated during the operation process.
- the output data is the complex array after FFT transformation.
- the internal execution process of the FFT execution module 802 depends on the auxiliary data generated by the FFT plan module 801.
- the FFT plan module 801 can be an offline module, and the FFT execution module 802 can be an online execution module.
- the complex data processing provided by the present application is not limited to the FFTE library, and the present application does not make specific limitations on this.
- the complex data processing provided by the present application is of general significance for the construction of the FFT calculation library.
- the first structural format provided in the embodiment of the present application can be used to store complex numbers related to fast Fourier operations, that is, to store complex numbers involved in the FFT operation process, including using the first structural format to store the input data, intermediate data, output data of the FFT execution module 802, and the butterfly factors required in the FFT operation process.
- the following describes a method for processing complex data in an embodiment of the present application, which can be applied to an electronic device that performs a fast Fourier transform operation, wherein the fast Fourier transform operation includes any one or more of the above operations, or other complex operations, which are not specifically limited in the present application.
- the method can be executed by the electronic device or by a processor of the electronic device. The following description is given by taking the processor execution as an example.
- the processor After the processor executes each calculation stage, the processor obtains a storage instruction input by the user, and the storage instruction instructs that the final calculation result (i.e., output data) after performing the FFT operation be stored in the memory in a first structural format. Then the storage instruction input by the user is the first storage instruction, and the output data is the first data sequence to be stored.
- the storage instruction input by the user is the first storage instruction
- the output data is the first data sequence to be stored.
- Step S102 the processor stores the first data sequence to the memory according to the first structure format in response to the first storage instruction, wherein the first structure format includes at least one storage structure, the storage structure corresponds to K adjacent complex numbers in the data sequence to be stored, the storage structure includes a first substructure and a second substructure, the first substructure corresponds to K real parts of the adjacent K complex numbers, the second substructure corresponds to K imaginary parts of the adjacent K complex numbers, the K real parts in the first substructure correspond to the continuous address space on the memory, and the K imaginary parts in the second substructure correspond to the continuous address space on the memory.
- a continuous address space on the memory where K is an integer greater than or equal to 2.
- the size of K is set according to the length of the register of the electronic device. In some embodiments, the size of K is set according to the register of the electronic device and the length of the data type stored by the register (i.e. the bit width of the data to be stored). If the real part and the imaginary part stored in the register are both 32 bits, and the register is 128 bits, K can be set to any number from 2 to 4, that is, the setting of K can ensure that the real part or imaginary part corresponding to a substructure can be stored in the same register. In some embodiments, while ensuring that the setting of K can ensure that the real part or imaginary part corresponding to a substructure can be stored in the same register, the size of K can be configured according to actual conditions (such as user settings).
- the setting of K makes one register correspond to one substructure, that is, one register just stores the real part (or imaginary part) corresponding to a full substructure, such as the above example, preferably setting K to 4, then one register just stores four real parts (or imaginary parts) of a substructure.
- the first structure format includes storage structure a, storage structure b and storage structure c
- the first structure format is: [the first substructure of storage structure a, the second substructure of storage structure a, the first substructure of storage structure b, the second substructure of storage structure b, the first substructure of storage structure c, the second substructure of storage structure c], which is specifically expanded to: [the real part of complex number A, the real part of complex number B, the imaginary part of complex number A, the imaginary part of complex number B, the real part of complex number C, the real part of complex number D, the imaginary part of complex number C, the imaginary part of complex number D, the real part of complex number E, the real part of complex number F , the imaginary part of complex number E, the imaginary part of complex number F], or the first structure format is: [the second substructure of storage structure a, the first substructure of storage structure a, the second substructure of storage structure b, the first substructure of storage structure b, the second substructure of storage structure c
- the processor stores the first data sequence to the memory according to the first structure format, including: the processor determines the K adjacent complex numbers in the first data sequence corresponding to the storage structure, the processor stores the K real parts corresponding to the first substructure of the storage structure to a continuous address space on the memory according to the K adjacent complex numbers in the first data sequence, and the processor stores the K imaginary parts corresponding to the second substructure of the storage structure to another continuous address space on the memory according to the K adjacent complex numbers in the first data sequence.
- the K real parts corresponding to the first substructure of the storage structure are continuously stored in the memory
- the second substructure of the storage structure is stored in the memory.
- the K imaginary parts corresponding to the structure are stored continuously in the memory.
- the K complex numbers corresponding to the storage structure are stored in a continuous address space in the memory, so that the K complex numbers corresponding to the storage structure exist continuously in the memory.
- the K complex numbers are continuously located in the memory according to the first substructure and the second substructure of the storage structure, that is, the K real parts of the K complex numbers are continuously located in a storage address space of the memory according to the first substructure, and the K imaginary parts of the K complex numbers are continuously located in another storage address space of the memory according to the second substructure.
- the processor when the number N of storage structures is greater than or equal to 2, stores a first data sequence to a memory according to a first structure format, including: determining K adjacent complex numbers in the first data sequence corresponding to each storage structure and a continuous first address space on the memory corresponding to each storage structure; for each storage structure, storing the K real parts of the K complex numbers corresponding to the storage structure in a continuous sub-address space in the corresponding first address space, and storing the K imaginary parts of the K complex numbers corresponding to the storage structure in another continuous sub-address space in the corresponding first address space.
- the number of storage structures in the corresponding first structure format is determined to be N according to its length and K, and the first data sequence is divided into N parts, and each storage structure corresponds to a part of the sequence. If the first data sequence includes 10 complex numbers and K is 2, it can be determined that N is 5, that is, the first data sequence is divided into 5 partial sequences according to K.
- a corresponding first address space is allocated to each storage structure, that is, a corresponding first address space is allocated to each of the N partial sequences, and N first address spaces are allocated to the first data sequence.
- the N first address spaces may be adjacent or non-adjacent in the address space, and the sub-address space storing the real part and the sub-address space storing the imaginary part in the same first address space may be adjacent or non-adjacent in the address space.
- the first data sequence includes complex numbers A to F arranged in sequence
- the first structure format is: [the first substructure of storage structure a, the second substructure of storage structure a, the first substructure of storage structure b, the second substructure of storage structure b, the first substructure of storage structure c, the second substructure of storage structure c]
- the processor determines that the portion of the first data sequence corresponding to storage structure a includes complex numbers A and B, and the continuous first address space on the memory allocated to storage structure a is G1.
- the processor determines that the portion of the first data sequence corresponding to storage structure b includes complex numbers C and D, and the continuous first address space on the memory allocated to storage structure b is G2.
- the processor determines that the portion of the first data sequence corresponding to storage structure c includes complex numbers E and F, and the continuous first address space on the memory allocated to storage structure c is G3. According to the substructures of each storage structure, K complex numbers are continuously located in the allocated first address space. Specifically, the real part of complex number A and the real part of complex number B are used as the real part corresponding to the first substructure of storage structure a, and the real part of complex number A and the real part of complex number B are continuously located in a continuous subaddress space G11 of the first address space G1.
- the imaginary part of complex number A and the imaginary part of complex number B are used as the imaginary part corresponding to the second substructure of storage structure a, and the imaginary part of complex number A and the imaginary part of complex number B are continuously located in a continuous subaddress space G12 of the first address space G1.
- the real part of complex number C and the real part of complex number D are continuously located in a continuous subaddress space G21 of the first address space G2, and the imaginary part of complex number D and the imaginary part of complex number D are continuously located in a continuous subaddress space G22 of the first address space G2. The rest is analogous and will not be repeated here.
- the complex data processing method of the present application when the complex source butterfly factor matrix related to the fast Fourier operation, that is, the above-mentioned first data sequence belongs to the butterfly factor matrix, and the butterfly factor matrix has L rows and M columns, the complex data processing method of the present application also includes: dividing the adjacent K butterfly factors in each column of the butterfly factor matrix into a first data sequence, and the parts of the butterfly factor matrix corresponding to each obtained data sequence are different from each other, and the number of obtained first data sequences is M*L/K, L is an integer greater than 2, and M is an integer greater than 1.
- the processor obtains a first read instruction, and in response to the first read instruction, loads the K real parts in the memory into a register of the electronic device, and records the K imaginary parts in the memory into another register of the electronic device.
- the first storage instruction can be implemented as a SIMD instruction, such as an ST1 instruction
- the first read instruction can be implemented as a SIMD instruction, such as an LD1 instruction.
- the K real parts of the adjacent K complex numbers in the data sequence to be stored are stored in a continuous address space in the memory through the ST1 instruction
- the K imaginary parts of the adjacent K complex numbers in the data sequence to be stored are stored in another continuous address space in the memory through another ST1 instruction, thereby realizing efficient data storage.
- the K real parts continuously stored in the memory can be read into a register based on the LD1 instruction, and the K imaginary parts continuously stored in the memory can be read into another register based on another LD1 instruction, thereby improving the data reading efficiency.
- the one-dimensional array can be any of the input data, output data and intermediate data in the FFT operation process.
- K as 2
- the first structural format provided by the embodiment of the present application is exemplified.
- the first structural format is: [real part, real part, imaginary part, imaginary part, real part, real part, ...].
- the first structural format indicates that the two real parts corresponding to the two adjacent complex numbers in the first data sequence are packed together, followed by the two imaginary parts corresponding to the two adjacent complex numbers being packed together.
- the NEON instruction implementation corresponding to the data storage structure format of [real part, real part, imaginary part, imaginary part, real part, real part, ...] is shown in Figure 11.
- an LD1 instruction two consecutive real parts are taken from the memory and stored in the V0 register.
- another LD1 instruction two consecutive real parts are taken from the memory and stored in the V0 register.
- Two consecutive imaginary parts are taken out from the memory and stored in the V1 register.
- Two consecutive real parts in the V0 register are stored in a consecutive address space in the memory through one ST1 instruction, and two consecutive imaginary parts in the V1 register are stored in another consecutive address space in the memory through another ST1 instruction.
- the first structural format is: [real part, real part, real part, real part, imaginary part, imaginary part, imaginary part, imaginary part, ...].
- the first structural format indicates that the four real parts corresponding to the four adjacent complex numbers in the first data sequence are packed together, followed by the four imaginary parts corresponding to the four adjacent complex numbers being packed together.
- the NEON instruction implementation corresponding to the structural format of data storage of [real part, real part, real part, real part, imaginary part, imaginary part, imaginary part, ...] is shown in Figure 12.
- Four consecutive real parts are taken out from the memory and stored in the V0 register through an LD1 instruction, and four consecutive imaginary parts are taken out from the memory and stored in the V1 register through another LD1 instruction.
- the four consecutive real parts in the V0 register are stored in a continuous address space in the memory through an ST1 instruction
- the four consecutive imaginary parts in the V1 register are stored in another continuous address space in the memory through another ST1 instruction.
- data can be efficiently read using SIMD instructions such as the LD1 instruction, and data can be efficiently stored using SIMD instructions such as the ST1 instruction, completely avoiding the cross-stride reading and writing of data in the memory and the exchange of data between registers, and also avoiding the operation of splitting the complex number into two arrays.
- SIMD instructions such as the LD1 instruction
- SIMD instructions such as the ST1 instruction
- the two-dimensional array includes a butterfly factor array (matrix), that is, the complex number to be stored is a butterfly factor.
- matrix matrix
- the calculation process of the butterfly factor is introduced as an example.
- the radix in FFTE_NEON is generated according to the Stockham FFT algorithm. In Stockham FFT, the calculation process of the butterfly factor is as follows:
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Human Computer Interaction (AREA)
- Complex Calculations (AREA)
Abstract
La présente demande s'applique au domaine technique des ordinateurs. Un procédé de traitement de données en nombres complexes et un dispositif associé sont décrits, qui sont appliqués à un dispositif électronique pour effectuer une opération de transformation de Fourier rapide. Le procédé consiste à : acquérir une première instruction de stockage ; et en réponse à la première instruction de stockage, stocker une première séquence de données dans une mémoire selon un premier format de structure, la première séquence de données comprenant au moins deux nombres complexes liés à une opération de Fourier rapide, et le premier format de structure comprenant au moins une structure de stockage, qui correspond à K nombres complexes adjacents dans une séquence de données à stocker et comprend en outre une première sous-structure et une seconde sous-structure, la première sous-structure correspondant à K parties réelles des K nombres complexes adjacents, la seconde sous-structure correspondant à K parties imaginaires des K nombres complexes adjacents, les K parties réelles dans la première sous-structure correspondant à des espaces d'adresse consécutifs dans la mémoire, et les K parties imaginaires dans la seconde sous-structure correspondant à des espaces d'adresse consécutifs dans la mémoire. L'efficacité de lecture-écriture de données peut être améliorée au moyen du stockage de nombres complexes dans le premier format de structure.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310661606.9 | 2023-06-05 | ||
| CN202310661606.9A CN119088283A (zh) | 2023-06-05 | 2023-06-05 | 复数数据处理方法以及相关设备 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024250758A1 true WO2024250758A1 (fr) | 2024-12-12 |
Family
ID=93698097
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/079961 Pending WO2024250758A1 (fr) | 2023-06-05 | 2024-03-04 | Procédé de traitement de données pour données complexes et dispositif associé |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN119088283A (fr) |
| WO (1) | WO2024250758A1 (fr) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000231552A (ja) * | 1999-02-08 | 2000-08-22 | Nec Corp | 高速フーリエ変換方法 |
| US20050198473A1 (en) * | 2003-12-09 | 2005-09-08 | Arm Limited | Multiplexing operations in SIMD processing |
| US20110040822A1 (en) * | 2009-08-17 | 2011-02-17 | International Business Machines Corporation | Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture |
| CN103547328A (zh) * | 2012-05-22 | 2014-01-29 | 深圳市英威腾电气股份有限公司 | 谐波检测方法及相关装置 |
| CN115982092A (zh) * | 2021-10-15 | 2023-04-18 | 华为技术有限公司 | 一种存算一体电路、芯片系统及电子设备 |
-
2023
- 2023-06-05 CN CN202310661606.9A patent/CN119088283A/zh active Pending
-
2024
- 2024-03-04 WO PCT/CN2024/079961 patent/WO2024250758A1/fr active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000231552A (ja) * | 1999-02-08 | 2000-08-22 | Nec Corp | 高速フーリエ変換方法 |
| US20050198473A1 (en) * | 2003-12-09 | 2005-09-08 | Arm Limited | Multiplexing operations in SIMD processing |
| US20110040822A1 (en) * | 2009-08-17 | 2011-02-17 | International Business Machines Corporation | Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture |
| CN103547328A (zh) * | 2012-05-22 | 2014-01-29 | 深圳市英威腾电气股份有限公司 | 谐波检测方法及相关装置 |
| CN115982092A (zh) * | 2021-10-15 | 2023-04-18 | 华为技术有限公司 | 一种存算一体电路、芯片系统及电子设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119088283A (zh) | 2024-12-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8412917B2 (en) | Data exchange and communication between execution units in a parallel processor | |
| US9110655B2 (en) | Performing a multiply-multiply-accumulate instruction | |
| US8984043B2 (en) | Multiplying and adding matrices | |
| RU2263947C2 (ru) | Целочисленное умножение высокого порядка с округлением и сдвигом в архитектуре с одним потоком команд и множеством потоков данных | |
| US7389404B2 (en) | Apparatus and method for matrix data processing | |
| US8239438B2 (en) | Method and apparatus for implementing a multiple operand vector floating point summation to scalar function | |
| CN110321525A (zh) | 用于稀疏-密集矩阵乘法的加速器 | |
| EP3623941A2 (fr) | Systèmes et procédés d'exécution d'instructions spécifiant des operations ternaires de pavé logique | |
| CN107077334A (zh) | 从多维阵列预取多维元素块的硬件装置和方法 | |
| CN107533460B (zh) | 紧缩有限冲激响应(fir)滤波处理器、方法、系统和指令 | |
| CN107729989A (zh) | 一种用于执行人工神经网络正向运算的装置及方法 | |
| CN107077327A (zh) | 用于可扩展宽操作数指令的系统和方法 | |
| EP3623940A2 (fr) | Systèmes et procédés d'exécution d'opérations horizontales de pavé | |
| CN104115115A (zh) | 用于多精度算术的simd整数乘法累加指令 | |
| US7836116B1 (en) | Fast fourier transforms and related transforms using cooperative thread arrays | |
| US7640284B1 (en) | Bit reversal methods for a parallel processor | |
| CN112434256B (zh) | 矩阵乘法器和处理器 | |
| CN112506468B (zh) | 支持高吞吐多精度乘法运算的risc-v通用处理器 | |
| US20090182990A1 (en) | Method and Apparatus for a Pipelined Multiple Operand Minimum and Maximum Function | |
| WO2021250392A1 (fr) | Instruction de tailles d'éléments mélangées | |
| KR20230109791A (ko) | 패킹된 데이터 정렬 플러스 계산 명령어, 프로세서,방법, 및 시스템 | |
| EP4462249A2 (fr) | Transposition de matrice et multiplication | |
| US9582473B1 (en) | Instruction set to enable efficient implementation of fixed point fast fourier transform (FFT) algorithms | |
| WO2024250758A1 (fr) | Procédé de traitement de données pour données complexes et dispositif associé | |
| CN112230993A (zh) | 数据处理方法及装置、电子设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24818295 Country of ref document: EP Kind code of ref document: A1 |