CN119536819B

CN119536819B - Instruction scheduling method and device, electronic equipment and medium

Info

Publication number: CN119536819B
Application number: CN202510088774.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Yizhu Technology Chengdu Co ltd; Suzhou Yizhu Intelligent Technology Co ltd
Current assignee: Yizhu Technology Chengdu Co ltd; Suzhou Yizhu Intelligent Technology Co ltd
Priority date: 2025-01-21
Filing date: 2025-01-21
Publication date: 2025-07-11
Anticipated expiration: 2045-01-21
Also published as: CN119536819A

Abstract

The disclosure provides an instruction scheduling method and device, electronic equipment and medium, wherein the method comprises the steps of obtaining a target instruction through a first instruction scheduling unit, identifying a target unit for executing the target instruction, wherein the target unit is one of an executing unit and a first functional unit, determining a target work group corresponding to a target tensor block and a target tensor block corresponding to the target instruction when the target unit is the first functional unit, wherein the target work group is used for indicating a memory address range of a target Zhang Liangkuai, scheduling the target instruction to the target work group through the first instruction scheduling unit, scheduling the target work group to the target unit, and reading the target Zhang Liangkuai by the target unit based on the memory address range and executing the target instruction on the target tensor block. Based on the instruction scheduling method disclosed by the invention, the adaptation degree between a hardware layer and a triton and other GPU programming languages based on blocks can be improved.

Description

Instruction scheduling method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and apparatus for scheduling instructions, an electronic device, and a medium.

Background

Conventional graphics processing units are often provided with a plurality of computing units, each computing unit is respectively provided with a plurality of execution units, and the data processing of the graphics processing unit is completed through the execution units. With the development of artificial intelligence technology, in order to accelerate the training and reasoning efficiency of the neural network model, in the graphics processing unit in the prior art, a number of special hardware modules dedicated to performing specific data processing tasks, such as Tensor Core (Tensor Core), copy Engine (Copy Engine), etc., are integrated in the computing unit, and when some specific tasks need to be performed, the special hardware modules can be directly used to perform the specific hardware modules, and the special hardware can support to perform the corresponding data processing tasks with higher parallelism. Meanwhile, in order to simplify the GPU programming difficulty, some block-based GPU programming languages exist in the prior art, such as triton programming languages, based on which a user can flexibly decompose tensor data to be processed into a plurality of tensor blocks with specified sizes, and then distribute the tensors to each computing unit for execution.

In the prior art, the computing unit schedules the instruction to the corresponding thread through the instruction scheduler in the executing unit, then schedules the thread loaded with the instruction to the corresponding hardware module for execution, so as to complete the corresponding data processing task, and the executing unit often performs instruction scheduling in units of thread bundles, that is, each time the instruction scheduling is performed, a single instruction is scheduled to each thread in a single thread bundle, and then the thread bundles are scheduled to the hardware unit for execution. In the prior art, although the upper layer application decomposes the data processing task and the corresponding tensor data into the granularity specified by the user and distributes the granularity to the computing unit for execution, in the hardware layer, the computing unit still executes the task issued by the upper layer application with the thread bundles as the granularity, and the number of threads contained in a single thread bundle is always fixed, so that the granularity of instruction scheduling performed by the hardware layer is always inconsistent with the task decomposition granularity specified by the user, at this time, the hardware layer always needs to cooperatively complete the execution of the single subtask allocated by the upper layer application by respectively scheduling the same instruction to a plurality of thread bundles, and because the instruction scheduling and the execution processes of different thread bundles are mutually independent, the user needs to perform additional programming to manage the operation of the thread bundles in actual need when writing the kernel. For example, when different hardware modules need to execute different instructions on different tensor blocks respectively, the instruction scheduler needs to schedule multiple instructions to multiple thread bundles respectively, and at this time, the user needs to perform Warp specialization to indicate which instruction each thread bundle is used for executing, so that the execution unit can schedule the instructions to the corresponding thread bundles accurately. The adaptation between this instruction scheduling pattern and the triton et al block-based GPU programming language is poor.

Disclosure of Invention

The embodiment of the disclosure provides an instruction scheduling method, electronic equipment and a storage medium, which can improve the adaptation degree between a hardware layer and a triton and other block-based GPU programming languages.

According to an aspect of the present disclosure, an instruction scheduling method is presented, applied to a computing unit, the computing unit including a first instruction scheduling unit, at least one execution unit, and at least one first functional unit, the first functional unit including at least one of a replication engine, a tensor core, and a scalar processing unit, the method comprising:

acquiring a target instruction through the first instruction scheduling unit;

Identifying a target unit for executing the target instruction, the target unit being one of the execution unit or the first functional unit;

determining a target Zhang Liangkuai corresponding to the target instruction and a target workgroup corresponding to the target Zhang Liangkuai, the target workgroup being used to indicate a memory address range of the target Zhang Liangkuai, if the target unit is the first functional unit;

The target instruction is dispatched to the target work group through the first instruction dispatching unit, the target work group is dispatched to the target unit, the target unit reads the target Zhang Liangkuai based on the memory address range and executes the target instruction on the target Zhang Liangkuai.

Optionally, the first instruction scheduling unit is provided with a corresponding instruction buffer, the computing unit further includes a command processor, and before the target instruction is acquired by the first instruction scheduling unit, the method further includes:

Acquiring a target command, wherein the target command is used for indicating the computing unit to start a target kernel corresponding to the target command, and the target kernel is composed of a plurality of instructions;

Analyzing the target command through the command processor to determine the target kernel to be started, determining a plurality of instructions to be executed based on the target kernel, and writing the instructions to be executed into an instruction cache area of the first instruction scheduling unit;

the obtaining, by the first instruction scheduling unit, the target instruction includes:

and acquiring the target instruction from the corresponding instruction cache region through the first instruction scheduling unit.

Optionally, the target kernel at least includes a main function kernel portion, a scalar kernel portion and a vector kernel portion, the scalar kernel portion includes a plurality of preset scalar operation kernels, the vector kernel portion includes a plurality of preset vector operation kernels, and the main function kernel portion calls the scalar operation kernels and the vector operation kernels through a call instruction.

Optionally, the identifying a target unit for executing the target instruction includes:

Identifying an instruction type corresponding to the target instruction, wherein the instruction type is used for representing the type of a data processing task corresponding to the target instruction, and the execution unit and each first functional unit respectively correspond to different instruction types;

A target unit for executing the target instruction is determined based on the instruction type.

Optionally, before the identifying a target unit for executing the target instruction, the method further comprises:

creating a first instruction set, wherein the first instruction set comprises a plurality of preset instructions, and the target instruction is one of the preset instructions;

Dividing the preset instructions into a plurality of instruction subsets, wherein each instruction subset comprises at least one preset instruction, and preset instructions in a single instruction subset correspond to the same instruction type;

the identifying the instruction type corresponding to the target instruction comprises the following steps:

and identifying an instruction subset corresponding to the target instruction to determine the instruction type corresponding to the target instruction.

Optionally, the instruction type includes at least one of:

A tensor operation instruction indicating an instruction executed by the tensor core;

A scalar operation instruction that indicates an instruction executed by the scalar processing unit;

A vector operation instruction indicating an instruction executed by the execution unit;

and a data replication instruction, the data replication instruction being an instruction executed by the replication engine.

Optionally, before the determining the target Zhang Liangkuai corresponding to the target instruction and the target work group corresponding to the target Zhang Liangkuai, the method further includes:

Acquiring working group creation parameters, wherein the working group creation parameters comprise working group size and working group number parameters, and the working group size represents the size of Zhang Liangkuai corresponding to a single working group in each dimension;

Creating a plurality of working groups based on the working group number parameters, allocating working group identifiers to the working groups, and dividing tensors to be processed into a plurality Zhang Liangkuai based on the working group sizes, wherein the working groups and tensor blocks are in one-to-one correspondence.

Optionally, each of the execution units includes a second instruction scheduling unit, the method further comprising:

If the target unit is the execution unit, determining a target working group distributed to the target unit, and creating a plurality of threads corresponding to the target working group, wherein each thread corresponds to data in a tensor block corresponding to the target working group;

the first instruction scheduling unit sends the target instruction to a second instruction scheduling unit in the execution unit;

In the execution unit, determining the created threads as a plurality of thread bundles, wherein each thread bundle comprises a second number of threads;

and in the execution unit, the target instruction is respectively scheduled to threads in the thread bundles through the second instruction scheduling unit.

According to an aspect of the present disclosure, there is provided an instruction scheduling apparatus, the apparatus including:

the instruction acquisition unit is used for acquiring a target instruction through the first instruction scheduling unit;

A first identifying unit for identifying a target unit for executing the target instruction, the target unit being one of an executing unit or a first functional unit;

A target work group determining unit configured to determine, if the target unit is the first functional unit, a target work group corresponding to the target Zhang Liangkuai and the target Zhang Liangkuai, where the target work group is used to indicate a memory address range of the target Zhang Liangkuai;

And a first scheduling unit, configured to schedule the target instruction to the target workgroup through the first instruction scheduling unit, and schedule the target workgroup to the target unit, so that the target unit reads the target Zhang Liangkuai based on the memory address range and executes the target instruction on the target Zhang Liangkuai.

According to an aspect of the present disclosure, there is provided an electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program when executed by the processor implementing the instruction scheduling method as claimed in any one of the above.

According to an aspect of the present disclosure, there is provided a computer readable storage medium storing one or more programs executable by one or more processors to implement the instruction scheduling method as set forth in any one of the above.

According to one aspect of the disclosure, an instruction scheduling method, an electronic device and a storage medium are provided, wherein a first instruction scheduling unit and at least one first functional unit for executing a specific data processing task are directly integrated in a computing unit, then, when the computing unit is required to execute the data processing task, a target instruction to be executed is firstly obtained through the first instruction scheduling unit, a target unit for executing the target instruction is identified, then, when the target unit is the first functional unit, a target Zhang Liangkuai required to be used for executing the target instruction and a target work group corresponding to the target Zhang Liangkuai are determined, then, the target instruction is scheduled to the target work group, and the target work group loaded with the target instruction is scheduled to the target unit, so that the target unit can determine the address of a target tensor block in a memory based on the target work group, and accordingly, the target Zhang Liangkuai is read and the data processing task corresponding to the target instruction is executed on the target tensor block. In this embodiment, when the target instruction is an instruction executed by the first functional unit such as the tensor core and the replication engine, the first instruction scheduling unit is directly used for performing instruction scheduling by taking the working group as a unit, so that the target unit reads the target Zhang Liangkuai based on the memory address range indicated by the target working group and executes the target instruction on the target tensor block, so that the granularity of the computing unit serving as the bottom hardware in performing instruction scheduling is consistent with the task decomposition granularity set by the user in the kernel function, and the first instruction scheduling unit is directly used for scheduling the target instruction to the target working group instead of the thread or the thread bundle, so that the creation of multiple threads and thread bundles is not needed, that is, the operation of managing the thread bundles is not needed to be set, and Warp specialization is also not needed to be used for indicating what instructions are respectively executed by each thread bundle, so that the instruction scheduling mode of the computing unit is more suitable for the kernel written in the programming language such as triton.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is a architectural diagram of a computing unit to which embodiments of the present disclosure are applied;

FIG. 2 is a main flow diagram of an instruction scheduling method of one embodiment of the present disclosure;

Fig. 3 is a flowchart of step S201 and its associated steps in fig. 2;

FIG. 4 is a schematic diagram of the structure of a target kernel of one embodiment of the present disclosure;

FIG. 5 is a sub-flowchart of step S202 of FIG. 2;

fig. 6 is a flowchart of step S501 and its associated steps in fig. 5;

FIG. 7 is a flow chart of creating a workgroup and dividing tensors into tensor blocks corresponding to the workgroup according to one embodiment of the present disclosure;

FIG. 8 is a flow chart of an instruction scheduling method of another embodiment of the present disclosure;

fig. 9 is a architectural diagram of an instruction dispatcher of an embodiment of the present disclosure.

Fig. 10 is a architectural diagram of an electronic device of an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

Before proceeding to further detailed description of the disclosed embodiments, the terms and terms involved in the disclosed embodiments are described, which are applicable to the following explanation:

Threads (threads) are the smallest unit of execution of a graphics processing unit in performing data processing tasks, each Thread independently performing the same mode of processing on different data.

Work Group (Work Group) is a subtask entity corresponding to a tensor block of a user-specified size in a triton et al block-based programming language. Specifically, in a programming language such as triton, a user may set meta parameters such as block_size to divide a tensor to be processed into a plurality Zhang Liangkuai when writing a kernel function, where block_size is used to indicate the number of elements contained in a single Zhang Liangkuai, and each tensor block corresponds to a work group. For example, the tensor to be processed is a three-dimensional tensor with a shape of (32,256,128), and the block_size is (2,32,128), where the tensor to be processed is divided into 128 tensor blocks with a shape of (2,32,128), each tensor block contains 2×32×128 elements, and 128 working groups are created, each working group corresponds to one tensor block, and specifically, each working group may carry information for indicating an address of the corresponding tensor block in the memory, for example, an address index or an address offset of the tensor block in the memory.

Triton, a programming language for writing kernel functions of a graphics processing unit. Unlike CUDA, under triton language, a user may more flexibly and actively decompose tensors to be processed into Zhang Liangkuai of a specified size to decompose a data processing task to be performed into subtasks of a user specified size, and allocate respective tensor blocks to respective computing units to allocate respective subtasks to respective computing units for execution.

Thread bundles (Warp), which are the basic unit of data processing by an execution unit, typically include a fixed number of 32 threads, although in some graphics processing units the number of threads in a single thread bundle may be 16, 64, or 128. But for any one determined graphics processing unit the number of threads contained by a single thread bundle is fixed. The same thread often executes in Single Instruction Multiple Thread (SIMT) mode, i.e., threads in the same thread bundle often are used to execute the same instruction in parallel on different data.

Synchronization mechanism-a mechanism for managing multi-threaded operations. In a multithreading environment, a plurality of threads are often independently executed, one part of threads may be executed by a hardware unit before the other part of threads, and in an instruction scheduling mode in the related art, resources occupied by a thread that completes execution first are released in advance and used for executing a next instruction, which may cause conditions such as condition competition and the like which are difficult to predict. Therefore, in a multithreading environment, it is often necessary to manage the operation of multiple threads through a synchronization mechanism, specifically, the synchronization mechanism is to set a synchronization node for multiple threads, when some threads first execute to the synchronization node, the threads are suspended to execute, and wait for other threads to execute to the synchronization node, and when all threads set the synchronization mechanism execute to the synchronization node, the threads continue to execute subsequent processing. In the related art graphics processing unit, since instruction scheduling is performed by the instruction scheduler in the execution unit in units of thread bundles, when the data processing task is cooperatively completed by the plurality of thread bundles, a user needs to set a synchronization mechanism at the level of different threads of the same thread bundle and also needs to set a synchronization mechanism at the level of different thread bundles when programming, and the synchronization mechanism becomes extremely complex, and if the synchronization node of the synchronization mechanism is set improperly, some threads need to wait for other threads or other thread bundles in the same thread bundle to execute to the synchronization node for a long time, thereby resulting in a substantial reduction in computation efficiency, which greatly increases the setting of the synchronization mechanism.

Conditional contention (Race Condition) refers to the multiple threads attempting to modify co-located data at the same time. When the same data processing task is completed through multiple thread bundles, because each thread bundle cannot access the memory occupied by other thread bundles, the thread bundles need to cooperate with each other through the shared memory, but when the instruction scheduling is performed by taking the thread bundles as a unit, because the process of scheduling the instruction to different thread bundles is independent, and the execution process of each thread bundle is independent, a part of the thread bundles may be completed before the execution of another part of the thread bundles, the instruction scheduler may schedule the next instruction to the part of the thread bundles which have completed the previous instruction, and when the next instruction is executed through the part of the thread bundles, the shared memory may need to be accessed to modify the data generated when the other part of the thread bundles execute the previous instruction, but at this time, the other part of the thread bundles do not actually execute the previous instruction yet, which may cause the threads of the two thread bundles to write data to the same location at the same time, thereby causing condition competition.

System architecture description of the application of embodiments of the present disclosure

Fig. 1 is a system architecture diagram of a computing unit to which an instruction scheduling method of an embodiment of the present disclosure is applied, and includes a first instruction scheduling unit 110, at least one execution unit 120, and at least one first functional unit.

The first functional units may include at least one of a tensor core 131, a replication engine 132, and a scalar processing unit 133, each for executing a particular type of instruction. It will be appreciated that execution unit 120, as a general purpose hardware unit, may itself execute a variety of different instructions, with the difference between execution unit 120 and replication engine 132, tensor core 131, scalar processing unit 133 being the efficiency in performing a particular type of task.

In this embodiment, the first functional unit integrated by the computing unit may include only one of the tensor core 131, the replication engine 132, or the scalar processing unit 133. For example, the computing unit integrates only the tensor core 131 as the first functional unit, and does not integrate the scalar processing unit 133 and the replication engine 132, and an instruction of data replication and scalar operation is executed by the execution unit 120. Similarly, the computing unit may integrate only the replication engine 132 as the first functional unit, without integrating the tensor core 131 and the scalar processing unit 133. Of course, the tensor core 131, the replication engine 132, and the scalar processing unit 133 may be integrated together in the computing unit as first functional units for executing different instructions, respectively. Furthermore, the computing unit may also integrate dedicated hardware modules for performing certain specific data processing tasks, in addition to the tensor core 131, the replication engine 132 and the scalar processing unit 133, which dedicated hardware modules also serve as the first functional unit, in particular, the first functional unit in the computing unit needs to be determined depending on the actual hardware architecture of the computing unit.

The first instruction scheduling unit 110 is configured to obtain a target instruction to be executed, and identify a target unit for executing the target instruction, where the target unit may be one of the execution unit 120 or the first functional unit.

The first instruction scheduling unit 110 is further configured to determine, when the target unit is the first functional unit, a target Zhang Liangkuai corresponding to the target instruction and a target workgroup corresponding to a target tensor block, and schedule the target instruction into the target workgroup, where the target tensor block is a tensor block configured by data for executing the target instruction. For example, the target instruction is a matrix multiplication instruction, and the target tensor block is a matrix block of two matrices required for performing matrix multiplication operation. Thus, when the target unit is other special hardware except the execution unit, the instruction scheduling can be performed by the first instruction scheduling unit integrated outside the execution unit in the unit of work group, and the instruction scheduling is not required to be performed by the instruction scheduler in the execution unit in the unit of thread bundle.

Of course, referring to fig. 1, a local memory may be integrated in the computing unit, and the computing unit may be coupled to an external on-chip memory and an off-chip memory, where when the replication engine 132 is integrated in the computing unit, the replication engine 132 may be integrated in a position closer to the local memory, the on-chip memory, and the off-chip memory than the execution unit, so as to shorten a physical path length when the replication engine replicates data stored in one memory space to another memory space, and improve efficiency when the replication engine is used to execute a corresponding data replication instruction.

Integral implementation of instruction scheduling method of embodiments of the present disclosure

An embodiment of the present disclosure proposes an instruction scheduling method, applied to a computing unit as shown in fig. 1, referring to fig. 2, the instruction scheduling method includes:

Step S201, a target instruction is acquired through a first instruction scheduling unit;

Step S202, identifying a target unit for executing a target instruction, wherein the target unit is one of an execution unit or a first functional unit;

step S203, when the target unit is the first functional unit, determining a target tensor block corresponding to the target instruction and a target work group corresponding to the target tensor block, wherein the target work group is used for indicating the memory address range of the target Zhang Liangkuai;

In step S204, the first instruction scheduling unit schedules the target instruction to the target work group and schedules the target work group to the target unit, so that the target unit reads the target Zhang Liangkuai based on the memory address range and executes the target instruction on the target block.

In this embodiment, the Tensor Core (Tensor Core) is a dedicated hardware module in the computing unit dedicated to performing Tensor operations, and in some architectures, is also referred to as a Tensor engine (Tensor engine) for processing tensors in two or more dimensions. That is, for instructions executed by the tensor core, at least one of the data involved in the operation is operated on in the form of a two-dimensional or higher-dimensional tensor. In this embodiment, the two-dimensional and above dimensions refer to dimensions of data when used for performing operations, and not dimensions organized when data is stored. For example, when each matrix row of a matrix is normalized separately, although the matrix itself is two-dimensional, the actual operation is normalized separately for each matrix row, i.e. the normalization process is actually performed in the form of a one-dimensional vector, and accordingly, the instruction for normalizing each matrix row of the matrix is not an instruction executed by the tensor core, and for example, when two matrices are added, the elements at the same position in the two matrices are actually added separately, each addition operation actually involves only two elements at the same position in the two matrices, the process is actually performed in the form of a scalar of zero dimension, and accordingly, the instruction for summing the two matrices is not an instruction executed by the tensor core.

Specifically, in this embodiment, the tensor core may be used to execute at least a matrix multiply instruction and a matrix multiply accumulate (matrix multiply accumulate, MMA) instruction. The tensor kernel may include a universal matrix multiplication unit (General Matrix Multiplication, GEMM) for matrix multiplying the input matrix and an accumulation buffer for accumulating the matrix multiplication result of the universal matrix multiplication unit. For example, when executing the matrix multiply-accumulate instruction, two matrices to be subjected to matrix multiply operation are divided into a plurality of matrix blocks according to a preset rule, then the matrix blocks are input into the tensor core according to a certain sequence, the general matrix multiply unit is used for performing matrix multiply operation on the input matrix blocks, and the accumulation buffer is used for accumulating the matrix multiply operation results of the matrix blocks output by the general matrix multiply unit according to a certain period to execute the matrix multiply-accumulate operation.

The Copy Engine (Copy Engine) is a hardware module in the computing unit dedicated to data copying and handling. Graphics processing units often have three levels of memory, namely global memory, local memory, and registers, and the memory structure is relatively complex. The global memory is arranged outside the computing units and can be accessed by all the computing units, but the access speed of the global memory is slower than that of the local memory and the registers, the registers are arranged inside the executing units in the computing units and can only be accessed by threads in the corresponding executing units, and the access speed of the registers is faster than that of the global memory and the local memory. Because of this complex memory structure, when the gpu performs training and reasoning of the neural network model, in order to improve efficiency, a large number of data copying operations are performed, so that data required by the hardware modules in each computing unit when performing the corresponding computing task is copied into a memory space that can be accessed quickly in advance, for example, data required by the execution unit when performing the corresponding computing task is copied from the global memory into a register in the corresponding execution unit, and the copying engine is a hardware unit dedicated to performing such data copying instructions.

Scalar processing units (Scalar Unit) are special purpose hardware modules in the graphics processing unit that are used to perform scalar operations and may be used to perform basic arithmetic and logical operations on data. For example, a relatively simple basic arithmetic instruction such as addition, subtraction, multiplication, and division of elements in two matrices, respectively indexing, logarithmizing, etc. each element in a single matrix, and a basic logical instruction such as logical AND, logical OR, bitwise AND, bitwise OR, exclusive OR, etc. The scalar processing unit may be an arithmetic logic unit (ARITHMETIC AND logical unit, ALU).

The execution unit (execution unit) is a general data processing module in the computing unit, and the hardware structure of the execution unit itself is not optimized for certain specific data processing tasks, so that the efficiency of the execution unit in executing the specific tasks is lower than that of special hardware such as tensor cores, replication engines and the like. However, the execution unit, as a general-purpose data processing module, may execute most of the instructions in the neural network model, and for some instructions that are not suitable for execution by dedicated hardware, such as tensor cores, replication engines, etc., may be allocated to the execution unit for execution.

In step S201, the target instruction is an instruction that the computing unit needs to execute subsequently, which indicates that the computing unit needs to perform a data processing task, such as performing an arithmetic operation or a logical operation on the corresponding data, accessing a specific memory, or the like.

It can be understood that when the graphic processing unit performs training or reasoning tasks of the neural network model, each hardware module in the computing unit is driven by instructions, and the computing unit controls each executing unit or the first functional unit to execute data processing tasks corresponding to the instructions according to the instructions so as to complete reasoning or computing tasks of the neural network model. For example, the graphics processing unit needs to convolve intermediate data generated by a certain neural network layer, where the graphics processing unit needs to distribute a corresponding task to each computing unit, so that each computing unit is used to convolve a part of data, and each computing unit obtains an instruction for instructing an executing unit or a certain first functional unit in the computing units to convolve a part of data allocated to the computing unit.

In particular, the first instruction dispatch unit often maintains an instruction queue that includes a plurality of instructions to be executed that need to be executed by the computing unit, the target instruction being in fact one of the instructions in the instruction queue. When the first instruction scheduling unit detects that the execution condition of an instruction in the instruction queue is met, the instruction is acquired from the instruction scheduling unit and is used as a target instruction to perform subsequent instruction scheduling operation. Specifically, the first instruction scheduling unit may monitor whether a hardware unit for executing each instruction in the instruction queue is idle, and whether data required for executing the instruction is ready, and when the corresponding hardware unit is idle and the corresponding data is ready, the first instruction scheduling unit acquires the corresponding instruction from the instruction queue as a target instruction. For example, the first instruction dispatch unit may determine via the scoreboard mechanism whether the data required for each instruction in the instruction queue is ready and whether the hardware unit executing the instruction is idle.

In step S202, the target instruction indicates a data processing task that the computing unit subsequently needs to perform on the data, for example, indicates that the data needs to be subjected to convolution, matrix multiplication, normalization, or the like. In this embodiment, each first functional unit is used to perform a specific type of data processing task, for example, the tensor core is used to operate on tensors in two dimensions and above, and the replication engine is used to perform data replication operations. And data processing tasks other than the specific type of data processing tasks performed by the first functional units are performed by the execution units as general purpose processing units. The target instruction is used as a hardware module for driving the graphics processing unit, and is actually one of a plurality of preset instructions, and the preset instructions are respectively used for instructing the hardware in the graphics processing unit to execute different data processing tasks. Based on this, in this embodiment, a mapping relationship between each preset instruction and the execution unit or the first functional unit may be preset, and when the target instruction is acquired, the target unit for executing the target instruction may be identified based on the mapping relationship. Details of step S202 will be described later, and are not described here.

In step S203, the target tensor block is a tensor block that needs to be used to execute the data processing task corresponding to the target instruction, and the target workgroup is a workgroup corresponding to the target tensor block, where the target workgroup itself has a memory address range for characterizing the target Zhang Liangkuai. Specifically, in the gpu, elements in one tensor are stored to consecutive addresses, each address corresponds to one element in the tensor, and one tensor block often contains consecutive elements in the tensor, that is, each tensor block is actually formed by a plurality of elements stored by a segment of consecutive addresses in the memory. Based on this, in the present embodiment, the target workgroup may indicate the memory address of the target Zhang Liangkuai by indicating the section constituted by a segment of consecutive addresses corresponding to the plurality of elements constituting Zhang Liangkuai.

Specifically, when the kernel function is written in a block-based programming language such as triton, the user can set relevant parameters of a work group through a specific code statement in the kernel function to instruct the graphics processing unit how to divide the tensor to be processed into a plurality of tensor blocks, so as to divide the computing task required to be executed by the graphics processing unit into a plurality of subtasks and distribute the subtasks to the computing unit for parallel execution, for example, the size of a single tensor block, that is, the number of elements in the single tensor block can be set in triton through parameters such as block_size and grid_size, when the kernel function written in the programming language such as triton is started by the graphics processing unit, a plurality of work groups are created based on the grid_size parameters set by the user, and each work group is organized into an N-dimensional grid, and each work group is allocated with corresponding N-dimensional coordinates. The block_size parameter indicates the step size of the working group in each dimension, and the address of the element in the tensor block corresponding to the working group can be determined based on the coordinates of the working group, the block_size parameter and the base address corresponding to the first element, so that the tensor to be processed is divided into tensor blocks corresponding to each working group. Illustratively, the base address is a, the coordinates corresponding to a certain working group are (4, 0), the step size of two dimensions is (128, 1) based on the block_size parameter, and then elements stored in addresses a+512 to a+639 form Zhang Liangkuai corresponding to the working group.

It should be noted that, the working groups are only used to indicate the address range of one tensor block, the working groups may not include any thread or include only one thread, and the working groups operate in the manner of SPMD (Single-Program, multiple-Data), i.e., each working group may be used to execute the same Program (i.e., instruction) on Multiple elements included in the corresponding tensor block.

It will be appreciated that the number of workgroups is determined by the kernel function used to write based on a programming language such as triton, while the number of computing units in a single image processing unit is fixed. Based on this, when the work groups created based on the kernel function are allocated to the computing units, each computing unit may correspond to a plurality of work groups, that is, zhang Liangkuai to which each computing unit is allocated may be plural.

In one embodiment, after dividing the tensor to be processed into a plurality of tensor blocks and creating a plurality of work groups, the work groups may be directly assigned to the respective computing units such that each computing unit is configured to process a fixed number of tensor blocks.

In addition, referring to step S201, the first instruction scheduling unit maintains an instruction queue, and the first instruction scheduling unit may acquire the target instruction from the instruction queue after detecting that the hardware unit for executing an instruction in the instruction queue is idle and the data required for executing the instruction is ready for instruction scheduling. That is, in this embodiment, the first instruction scheduling unit may fetch the target instruction from the instruction queue for instruction scheduling after determining the target piece and the target unit.

In step S204, the first instruction scheduling unit is integrated outside the execution unit, and after determining the target working group corresponding to the target unit, the first instruction scheduling unit may directly schedule instructions in units of working groups, so as to schedule the target instructions into the corresponding target working groups. Specifically, in this embodiment, after creating multiple working groups, each working group is set with a corresponding working group identifier, i.e. block id. Based on this, after determining the target work group allocated to the target unit for execution, the first instruction scheduling unit may directly schedule the target instruction to the target work group based on the work group identifier corresponding to the target work group. Thus, the first instruction scheduling unit can schedule instructions by taking the work group as a unit. In this instruction scheduling mode, referring to step S203, it is known that a single working group corresponds to at most one thread and the working group operates in a single-program multi-data mode, that is, in this embodiment, the target unit is actually indicated to all elements in the tensor block corresponding to one working group through at most one thread, and the situation that the multithreading is cooperatively completed with the same computing task is not involved, so that the synchronization mechanism is not required to manage the multithreading.

Furthermore, during operation of the neural network model, the graphics processing unit often needs to execute multiple instructions simultaneously. For example, at flash attention, each computing unit needs to perform data replication tasks and matrix multiplication computing tasks on different data at the same time. In the related art, when instruction scheduling is performed by the instruction scheduler in the execution unit in thread bundles, since each thread bundle includes only 32 threads, one data processing task may need to be completed by multiple thread bundles, when the calculation unit needs to execute multiple data processing tasks simultaneously, the instruction scheduler in the same execution unit needs to schedule different instructions in different thread bundles respectively, and the number of thread bundles to be scheduled by each instruction is different according to the number of threads executing a single data processing task, the instruction scheduler itself cannot intelligently identify how many thread bundles to be scheduled by the instruction corresponding to each data processing task need to be performed by each thread bundle, which requires Warp specialization to be performed by a user during programming, that is, to additionally allocate a thread bundle identifier for each thread bundle, and preset the instruction executed by each thread bundle, and when the instruction scheduler performs instruction scheduling, the instruction scheduler detects the thread bundle identifier to determine which thread bundles are used for completing which data processing task respectively, thereby determining which instruction needs to be scheduled to the thread bundles. While Warp specialization is performed, it is often necessary to combine some of the features of the underlying hardware of the computing unit to maintain the efficiency of instruction scheduling and thread bundle execution, which greatly increases the complexity of user programming. In the instruction scheduling method according to the present embodiment, the first instruction scheduling unit performs instruction scheduling in units of work groups. The user does not need to perform extra Warp specialization to instruct the first instruction scheduling unit on which thread bundles each instruction needs to be scheduled to when programming, so that the complexity of the user in programming can be greatly reduced.

Further, in the related art, since instruction scheduling is performed in units of thread bundles, the thread bundles need to be executed following the SIMT mode, i.e., each thread in the same thread bundle can only be used to execute the same instruction on different data. And when the user uses the triton or other thread block-based programming framework to program, the size of the single work group is specified in the kernel function by setting the block_size, and the size characterizes the number of elements corresponding to the single work group. If the value of the block_size is not an integer multiple of the number of threads in the thread bundles, the data corresponding to the work group is allocated to a plurality of thread bundles, at least one thread bundle is not allocated to all threads in the thread bundles, that is, at least one thread in the thread bundles is not allocated to data to be processed, but because a single thread bundle needs to follow the SIMT mode to execute, that is, threads in one thread bundle can only execute the same instruction on different data at the same time, the threads not allocated with corresponding elements cannot be used for executing other instructions, at the same time, a part of threads in a part of thread bundles cannot be wasted, for example, the block_size is set to 158, that is, one work group contains 158 units of data, when the instruction is scheduled in a thread bundle unit, the instruction scheduler in the execution unit can allocate 158 units of data to different threads for execution, divide the threads into 5 thread bundles and schedule the instructions to the 5 thread bundles respectively, that is the threads in the 5 thread bundles, but the threads in the 5 thread bundles cannot be used for executing the other threads, and the rest of threads cannot be used for executing the other threads, and the rest of the threads cannot be synchronously executed due to the two threads, and the rest of the threads cannot be used for the rest of the thread bundles. therefore, when writing kernel functions, a user needs to ensure that data corresponding to a single working group can be divided into a plurality of thread bundles evenly, so that the utilization rate of computing resources of a graphic processing unit is ensured, the flexibility when setting block_size by the user is limited to a certain extent, in addition, in the process of writing operators based on a triton programming framework, the user often controls the computing resource allocation of the graphic processing unit in a working group unit, but in the computing unit, the instruction scheduling mode based on the thread bundles is not friendly to the triton programming framework based on thread blocks. In the method disclosed in this embodiment, the first instruction scheduling unit integrated outside the execution unit is used for performing instruction scheduling in units of a working group, when performing instruction scheduling, the target instruction is directly scheduled to the working group corresponding to the tensor block required to be used for executing the target instruction, and the working group is scheduled to the corresponding first functional unit for execution, without creating a thread by an instruction scheduler in the execution unit, and performing instruction scheduling in units of a thread bundle, even if data contained in a single working group is not an integer multiple of 32, part of threads are not idle, thereby improving flexibility of a user when setting block_size, and being more friendly to a programming framework based on thread blocks such as triton.

In the embodiment disclosed in steps S201 to S204, after integrating the first instruction scheduling unit outside the execution unit, when performing instruction scheduling, the target unit for executing the target instruction is first identified, when the target unit is dedicated hardware such as a tensor core, a replication engine or a scalar processing unit, the target Zhang Liangkuai corresponding to the target instruction and the target work group corresponding to the target tensor block are determined, the target work group indicates the memory address range of the target tensor block, after that, the first instruction scheduling unit performs instruction scheduling in units of work groups, directly schedules the target instruction to the target work group, and schedules the target work group loaded with the target instruction to the target unit, so that the target unit reads the target Zhang Liangkuai based on the memory address range indicated by the target work group and executes the target instruction for the target tensor block, and when the calculation unit as underlying hardware performs instruction scheduling, the granularity is consistent with the task decomposition granularity set by the user in the kernel function, and the first instruction scheduling unit directly schedules the target instruction to the target work group instead of the thread or the thread bundle, thereby performing instruction scheduling in units of the multiple threads and the thread bundle forming mode, that the thread forming mechanism Warp specialization does not need to be set up, and thus the execution mode is more suitable for the execution of the thread.

Detailed description of step S201

In one embodiment, prior to step S201, the method further comprises:

Step S301, a target command is obtained, wherein the target command is used for indicating a computing unit to start a target kernel corresponding to the target command, and the target kernel is composed of a plurality of instructions;

Step S302, analyzing a target command through a command processor to determine a target kernel to be started, determining a plurality of instructions to be executed based on the target kernel, and writing the instructions to be executed into an instruction cache area of a first instruction scheduling unit;

step S301 includes:

in step S303, the first instruction scheduling unit obtains the target instruction from the corresponding instruction buffer.

In step S301, the target command is a command stream generated by the host side and issued to the graphics unit based on the program code when the host side executes the program code of the neural network model, and the host side may be a central processing unit. It will be appreciated that the gpu itself is formed by a large number of relatively simple computing units which are adapted to tasks requiring massively parallel computing, but for which some relatively complex control logic and computing tasks are not processable, in the related art, the neural network model is operated by a cpu which operates the program code of the neural network model and performs the relatively complex processing thereof, and some relatively simple tasks requiring massively parallel execution are scheduled to be executed in the gpu, at which time the cpu generates a corresponding command stream based on the program code of the executed neural network model and issues to the computing unit a plurality of commands, the target command being one of the commands, the target command indicating that the gpu is required to perform the data processing task.

In particular, the program code of the neural network model is often formed by a plurality of operators for completing specific computing tasks, each operator can be actually regarded as a packaged kernel function, and the kernel functions of the operators can be converted into a series of multiple instructions which need to be executed orderly and are required for completing the computing tasks corresponding to the operators. When the central processing unit executes the operators in the program code of the neural network model, corresponding target commands are generated and issued to the graphics processing unit to indicate which operator kernel function the graphics processing unit needs to start.

The neural network model is illustratively provided with an averaging pooling layer to which program code is executed by the central processor and issues commands to the graphics processing unit to average pooling a matrix. The average pooling essence is to average each region in the corresponding matrix according to a given pooling window, and based on this, after the command parser parses the command, a data copying instruction for copying the data of each region of the matrix stored in the global memory to the local memory of each computing unit according to the pooling window, and a mean value computing instruction for instructing the computing unit to compute the mean value of the data of each region in the matrix are generated.

In step S302, the command processor (command processor) is a module for parsing a target command issued by the central processor into instructions that can be understood and executed by a hardware module in the graphics processing unit. It will be appreciated that the underlying hardware in the graphics processing unit can only understand the machine language and not directly understand the program code of the neural network model written by the user. Thus, the graphics processing unit needs to parse commands in the command stream issued by the central processing unit through the command processor, converting these commands into instructions that the hardware in the graphics processing unit can understand and execute. Specifically, the command processor determines, by parsing the target command, a target kernel corresponding to the target command, that is, a kernel function that needs to be started when the target command is executed, and referring to the detailed description of step S301, the kernel function is configured by a series of instructions required for completing a computing task of the target command, after determining the kernel function that needs to be started, a plurality of instructions that need to be executed by the computing unit, that is, instructions to be executed, are determined, and it is understood that the instructions often need to be executed sequentially, some of the instructions need to wait for execution of other instructions to begin after execution of other instructions is completed, based on this, after obtaining the plurality of instructions to be executed, the instructions to be executed are written into an instruction cache corresponding to the first instruction scheduling unit, so as to form an instruction queue, so that the first instruction scheduling unit can sequentially read each instruction to be executed from the corresponding instruction cache as the target instruction to perform instruction scheduling.

In step S303, it can be understood that, referring to the description related to step S201, the first instruction scheduling unit monitors, in real time, whether the execution condition of each instruction to be executed in the instruction cache region is satisfied, that is, whether there is an available idle work group and whether the data required for executing the respective instruction is ready. When the execution condition of a certain instruction in the instruction cache region is met, the first instruction scheduling unit reads a corresponding instruction to be executed from the instruction cache region as a target instruction to be scheduled.

In the embodiment disclosed in step S301 to step S303, the target kernel to be started is determined by acquiring the target command issued by the central processing unit and analyzing the target command by the command processor, then the target kernel corresponding to the target command is started, and each instruction contained in the target kernel is written into the instruction cache area corresponding to the first instruction scheduling unit as an instruction to be executed, so that the target command issued by the central processing unit is converted into a machine code instruction which can be understood and executed by the hardware module in the computing unit. Therefore, the first instruction scheduling unit can sequentially read each instruction to be executed from the instruction cache area as a target instruction and schedule the target instruction into the corresponding working group so as to perform instruction scheduling.

In one embodiment, the target kernel includes at least a main function kernel portion, a scalar kernel portion, and a vector kernel portion, the scalar kernel portion including a plurality of preset scalar operation kernels, the vector kernel portion including a plurality of preset vector operation kernels, the main function kernel portion including a call instruction for calling the scalar operation kernel and for calling at least one of the vector operation kernels. Wherein each scalar operation core is constituted by at least one scalar operation instruction and the vector operation core is constituted by at least one vector operation instruction.

Illustratively, referring to FIG. 4, FIG. 4 is a schematic diagram of an operator core of one embodiment of the present disclosure, the operator core including a main function core portion, a scalar core portion, and a vector core portion, wherein, in the scalar core portion, lines 0 through 12 constitute a first scalar operation core of the scalar core portion, i.e., kernel0, wherein, a ret instruction of line 12 indicates that kernel0 has been executed, returns to the main function core portion, and so on. Specifically, as shown in fig. 4, the instruction with the serial number 38 in the main function kernel, namely the launch vector 100, is a call instruction for calling the instruction with the serial number 100 in the vector kernel portion, when the main function kernel portion executes the instruction with the serial number 38, the instruction with the serial number 100 in the vector kernel portion is skipped to execute, and subsequent instructions in the vector kernel portion are sequentially executed until the ret instruction is executed, which indicates that the kernel1 in the vector kernel portion has been executed, and returns to the main function kernel, and similarly, when the instruction with the serial number 70 in the main function kernel is executed, the instruction with the serial number 13 in the scalar kernel portion (namely the kernel1 in fig. 4) is skipped to call the kernel1 in the scalar kernel portion, and subsequent instructions in the scalar kernel portion are sequentially executed until the ret instruction is executed, which indicates that the kernel1 in the scalar kernel portion has been executed, returns to the main function kernel portion, and the instruction with the serial number 71 in the main function kernel is executed.

Based on the method, the scalar kernel and the vector kernel are arranged in the target kernel, the preset scalar operation kernel and vector operation kernel are respectively added into the corresponding scalar kernel part and vector kernel part, when a user writes a kernel function of an operator, only the main function kernel part is required to be written, and when the operator needs to perform scalar operation and vector operation, the operator directly jumps to the corresponding position in the scalar kernel part or the vector kernel part through a calling instruction.

Meanwhile, when repeated scalar operations or vector operations are required to be executed at a plurality of nodes, only calling instructions for jumping to corresponding positions in a scalar kernel part or a vector kernel part are required to be added to corresponding positions in a main function kernel respectively, and the instructions for executing the same type of operations are not required to be repeatedly written in the main function kernel. For example, in the example shown in fig. 5, the 38 th instruction in the main function kernel is used to call kernel1 of the vector operation kernel, if in the operator, after the 48 th instruction in the main function kernel is executed, the vector operation task corresponding to kernel1 of the corresponding operation kernel needs to be executed again, the 49 th instruction in the main function kernel may be written as the launch vector 100, so that when the 49 th instruction in the main function kernel is executed, the vector operation kernel1 is called again. Thus, the complexity of the kernel function of the operator can be further reduced, so that the difficulty of writing the kernel function of the operator by a user is reduced.

Furthermore, it will be appreciated that since the graphics processing unit itself is often used to perform some relatively simple but massive parallel execution of computational tasks, whereas the operation of tensors of more than two dimensions itself is already relatively complex, and the operation of tensor data of more than two dimensions in the neural network model is also matrix-wise-based, the hardware structure of the tensor kernel itself is optimally designed with the aim of accelerating matrix multiplication, e.g. with reference to the above embodiments, the tensor kernel is composed of a generic matrix multiplication unit and an accumulation buffer. At this time, the tensor core itself is often only used to perform the computation tasks of matrix multiplication and matrix multiply-accumulate, that is, the instruction executed by the tensor core actually includes only the matrix multiply instruction and the matrix multiply-accumulate instruction, and the instruction executed by the tensor core is actually relatively single. Based on this, in this embodiment, the target kernel does not need to be specially provided with a portion for filling in the tensor operation kernel, and similarly, the replication engine is used for performing the data replication instruction, and the type of the instruction executed by the replication engine is relatively single.

Detailed description of step S202

In one embodiment, referring to fig. 5, step S202 includes:

step S501, identifying an instruction type corresponding to a target instruction;

In step S502, a target unit for executing the target instruction is determined based on the instruction type.

In step S501, the instruction type is used to characterize the type of the data processing task to which the target instruction corresponds. It will be appreciated that each first functional unit is actually the underlying hardware of which a hardware structure is designed for certain specific types of data processing tasks, for example, the tensor core is provided with a general matrix multiplication unit and an accumulation buffer, the tensor core is designed to perform matrix multiplication and matrix multiplication accumulation calculation tasks, and each first functional unit is actually only used to perform instructions corresponding to one or more specific types of data processing tasks, for example, the scalar processing unit is used to perform basic arithmetic operation instructions such as addition, subtraction, multiplication and division and basic logic operation instructions such as AND, OR, NOT and the like. Based on this, in this embodiment, a plurality of preset instructions may be divided into different instruction types based on the instructions executed by each of the first functional units, each of the instructions belonging to one of the instruction types.

Specifically, in one embodiment, the instruction types include at least one of a tensor operation instruction, a scalar operation instruction, a vector operation instruction, and a data replication instruction. The tensor operation instruction is an instruction to be scheduled to be executed by the tensor core, and the tensor operation instruction is an instruction used when performing an operation on a tensor of two or more dimensions, and in this embodiment, the tensor operation instruction includes at least a matrix multiplication instruction or a matrix accumulation multiplication instruction. The scalar operation instruction is an instruction to be dispatched to a scalar processing unit for execution, and is an instruction used in operating scalar data in the zero dimension, for example, a basic arithmetic instruction such as addition, subtraction, multiplication, division, exponentiation, and averaging, and an and or non-basic logical operation instruction. The data copy instruction is an instruction that needs to be dispatched into the copy engine to instruct the copy engine to copy data stored in one memory address to another memory address. For example, the data that the computing unit needs to process subsequently is copied from global memory to a register or local memory local to the computing unit, or the data generated by the computing unit's local computation is copied to global memory for access by other computing units in the graphics processing unit. Vector operation instructions are instructions that need to be dispatched to an execution unit for operation on one-dimensional vector data, such as normalization, vector dot product, etc. Based on this, the preset instructions are divided into a plurality of instruction types respectively corresponding to the execution units and the respective first functional units for the kinds of the first functional units integrated in the calculation unit, whereby the target unit for executing the target instruction can be determined based on the instruction type of the target instruction.

In one possible embodiment, it is understood that each first functional unit can only be used to execute a specific type of instruction, while the first functional units in the computing unit are at least one of comprising a tensor core, a replication engine and a scalar processing unit, i.e. the first functional units integrated in the computing unit are different according to the hardware structure of the computing unit itself. Based on this, in the present embodiment, the division of the instruction types is determined based on the first functional unit integrated in the computing unit, and the instruction types may include only one or more of the tensor operation instruction, scalar operation instruction, vector operation instruction, and data copy instruction described above, depending on the hardware configuration of the computing unit itself. For example, in the computing unit, except for the execution unit, only the tensor core is integrated as the first functional unit, and the scalar processing unit and the replication engine are not integrated, and at this time, the instruction type includes only the tensor operation instruction and the vector operation instruction, and the tensor core is determined to be the corresponding computing unit if and only if the instruction type of the target instruction is the tensor operation instruction, and the execution unit is determined to be the corresponding target unit if the target instruction is the vector operation instruction or other instructions of which the instruction type is not divided. Similarly, when a tensor core, a scalar processing unit, and a replication engine are integrated together in a computing unit, then the instruction types may include tensor operation instructions, scalar operation instructions, vector operation instructions, and data replication instructions. Of course, if the computing unit integrates other special-purpose hardware except the tensor core, the replication engine and the scalar processing unit, the instruction type corresponding to the special-purpose hardware may be determined based on the data processing task executed by the special-purpose hardware, and the instruction to be scheduled to be executed by the special-purpose hardware may be divided into instruction types corresponding to the special-purpose hardware.

In another possible embodiment, the preset instructions may be directly divided into a plurality of instruction types such as tensor operation instructions, scalar operation instructions, vector operation instructions, and data copy instructions, without considering the type of the first functional unit integrated in the computing unit when setting the instruction types. When the first instruction scheduling unit performs instruction scheduling, after the instruction type of the target instruction is identified, determining a target unit corresponding to the target instruction according to the condition of a first functional unit integrated by the computing unit, namely determining the first functional unit as a corresponding target unit when the first functional unit corresponding to the instruction type is integrated in the computing unit, and determining the executing unit as a corresponding target unit when the first functional unit corresponding to the instruction type is not integrated in the computing unit. For example, only the tensor core and the replication engine are integrated in the computing unit as the first functional unit, and the preset instruction types include a tensor operation instruction, a scalar operation instruction, a vector operation instruction and a data replication instruction, where if the instruction type corresponding to the target instruction is a scalar operation instruction, the execution unit is determined to be the corresponding target unit at this time because the corresponding first functional unit is not integrated in the computing unit.

In step S502, referring to step S501, it is known that the execution unit and each first functional unit are respectively corresponding to different instruction types, and after determining the instruction type corresponding to the target instruction, the corresponding target unit can be determined based on the instruction type.

Illustratively, the target instruction is a basic arithmetic operation instruction for adding two data, its instruction type is a scalar operation instruction, and when a scalar operation unit is integrated in the computing unit, the scalar operation instruction is dispatched to a scalar processing unit for processing, i.e., the target unit corresponding to the target instruction is the scalar processing unit.

In the embodiment disclosed in step S501 to step S502, by presetting the instruction types corresponding to the execution units and the respective first functional units, after the first instruction scheduling unit obtains the target instruction to be scheduled to be executed, the instruction type corresponding to the target instruction is identified, and the target unit for executing the target instruction is determined based on the instruction type. Based on this, the first instruction scheduling unit may automatically determine to which hardware module in the computing unit the target instruction needs to be scheduled for execution after the target instruction is acquired.

In one embodiment, referring to fig. 6, prior to step S501, the method further comprises:

step S601, a first instruction set is created;

step S602, dividing the preset instructions into a plurality of instruction subsets, wherein each instruction subset comprises at least one preset instruction, and the preset instructions in a single instruction subset correspond to the same instruction type;

Step S501 includes:

in step S603, an instruction subset corresponding to the target instruction is identified to determine an instruction type corresponding to the target instruction.

In step S601, the first instruction set is a preset instruction set, where the first instruction set includes a plurality of preset instructions, where the preset instructions are machine code instructions preset by a user and capable of being understood and executed by the underlying hardware of the graphics processing unit. It will be appreciated that each hardware module in the computing unit can only understand and execute the machine code instructions, but cannot directly understand the program codes written by the user, and therefore, in this embodiment, a first instruction set formed by the machine code instructions that can be understood and executed by each hardware module needs to be created in advance, so that, by converting the program codes written by the user into a series of corresponding instructions and scheduling the corresponding instructions to execute the corresponding instructions, the corresponding data processing tasks can be completed by each hardware module in the computing unit.

In step S602, the preset instructions include various data processing instructions that can be executed by the computing unit, and as can be seen from the description of step S501, when the computing unit is integrated with the first functional unit, the instruction type corresponding to the instruction needs to be identified, so as to determine which hardware module in the computing unit corresponds to each instruction to be dispatched to. Based on this, in this embodiment, after the first instruction set is created, the preset instructions in the first instruction set may be divided based on each different instruction type, so as to obtain instruction subsets corresponding to each instruction type respectively.

In step S603, since the first instruction set is divided into instruction subsets corresponding to the instruction types, each preset instruction is divided into one instruction subset, and the instruction type corresponding to the target instruction can be determined based on the instruction subset to which the target instruction belongs and the correspondence between the instruction subsets and the instruction types.

In the embodiment disclosed in step S601 to step S603, after the first instruction set is created, the preset instructions in the first instruction set are divided into instruction subsets corresponding to different instruction types. Therefore, when the target instruction is required to be scheduled subsequently, the instruction type corresponding to the target instruction can be determined based on the instruction subset to which the target instruction belongs.

Detailed description of step S204

In one embodiment, referring to fig. 7, prior to step S204, the method further comprises:

Step S701, acquiring working group creation parameters, wherein the working group creation parameters comprise working group size and working group number parameters, and the working group size represents the size of Zhang Liangkuai corresponding to a single working group in each dimension;

In step S702, a plurality of working groups are created based on the working group number parameter, and working group identifiers are assigned to the respective working groups, and tensors to be processed are divided into a plurality Zhang Liangkuai based on the working group size, wherein the working groups and tensor blocks are in one-to-one correspondence.

In step S701, the work group creation parameter is used to instruct the gpu how to divide the tensor to create the work groups corresponding to the tensor blocks, so as to allocate the computing resources of the gpu to the work groups, and thus execute the same instruction or different instructions on different data in parallel. In this embodiment, the working group creation parameters include at least a working group size and a working group number parameter, where the working group size is used to indicate the size of the corresponding tensor block of the single working group in each dimension, that is, the number of elements included in each dimension of the single working group, and the number of elements included in the single working group may be determined based on the working group size, and the working group number parameter is used to indicate the number of working groups. In one embodiment, the workgroup number parameter may be a value that directly characterizes the number of workgroups that need to be created, or may be an array of the dimensions of an N-dimensional grid made up of workgroups. For example, the working group number parameter may be (M, N, K), which means that M x N x K working groups need to be created, and the working groups are organized into a three-dimensional grid of (M, N, K) shape. It should be noted that, in this embodiment, the product of the size of the workgroup and the number of workgroups may correspond to the total number of elements in the tensor to be processed, so that each tensor block obtained by dividing the tensor to be processed corresponds to one workgroup.

In another embodiment, the number of working groups parameter may be a set of parameters for characterizing the shape of an N-dimensional grid formed by the working groups, specifically, referring to step S203, the user may control the grid shape formed by the working groups by setting grid_size in the kernel function to organize the respective working groups into one working group grid. Illustratively, the user may set gridsize to (a 2, b2, c 2) in the kernel function, which indicates that the work group grid includes a2 work groups along the x-dimension, b2 work groups along the y-dimension, and c2 work groups along the z-dimension, where the number of work groups to be created is a2×b2×c2.

In step S702, the working group number parameter is used to represent the number of working groups to be created, the working group size represents the number of elements included in each dimension in the tensor block corresponding to a single working group, the working group is created based on the working group number parameter, and simultaneously, the tensor to be processed based on the working group size can obtain Zhang Liangkuai corresponding to each working group and a plurality of working groups.

It will be appreciated that after creating a plurality of work groups, each work group corresponding to a different Zhang Liangkuai, a corresponding work group identification needs to be assigned to each work group to distinguish the work groups, and the identifications corresponding to each work group are different. For example, each work group may be assigned a serial number as the corresponding identifier for that work group.

In another possible implementation, the plurality of workgroups may be organized into an N-dimensional workgroup grid, each workgroup being a grid point in the workgroup grid. At this time, each working group has a corresponding coordinate to represent the position of the working group in the grid, and because the coordinates of each working group in the grid of the working group are different, the coordinates corresponding to each working group can be identified by the working group corresponding to the working group. For example, the work groups form a three-dimensional grid, and the work group identifier corresponding to each work group may be represented as (x 1, y1, z 1).

At this time, referring to the description of step S203, the address range of the element corresponding to the tensor block corresponding to the workgroup in the memory may be indicated based on the identifier corresponding to the workgroup, the address of the first element in the tensor to be processed, and the workgroup size.

It should be noted that, referring to the above step S301, the target kernel often includes a plurality of instructions that need to be executed in order, and when executing the kernel function corresponding to one operator, only one work group needs to be created, where each work group corresponds to one tensor block in the original input tensor of the operator and intermediate data generated by the tensor block after executing each instruction. These workgroups may be allocated to execute instructions in the kernel function. Specifically, each work group is allocated to a corresponding hardware module for execution after the scheduled instruction, and when the corresponding hardware module completes execution of the work group, the work group enters an idle state until the next instruction to be executed is scheduled to the work group. At this time, if the instruction scheduling is performed in a thread bundle unit manner in the related art, the corresponding data in one work group is divided into a plurality of thread bundles to perform instruction scheduling respectively, at this time, once a certain thread bundle completes the execution of a previous instruction before other thread bundles, the computing resources of the thread bundle are released in advance, the instruction scheduler may schedule the next instruction to the thread bundle in advance, so that the thread bundle is used to execute the next instruction, which may cause threads in one work group to simultaneously execute different instructions, thereby causing condition competition or other difficult prediction situations. When the instruction scheduling method of the present disclosure is used to schedule instructions by using the first instruction scheduling unit in units of a work group, the creation of multiple threads and thread bundles is not needed, the instructions are directly scheduled to the work group, then the work group is scheduled to a target unit for execution, and before the work group is executed, the work group does not enter an idle state and occupied computing resources are not released, the first instruction scheduling unit does not schedule the next instruction to the work group, and if and only after the work group is executed, i.e. all elements in the corresponding tensor block have been executed for the data processing task corresponding to the target instruction, at this time, the first instruction scheduling unit schedules the next instruction to the work group.

In the embodiment disclosed in step S701 to step S703, the working group creation parameters are obtained, so as to create a plurality of working groups, the tensor to be processed is divided into Zhang Liangkuai corresponding to each working group based on the size of the working group, and the working group identifier corresponding to each working group is determined, so that the memory address range of the element in the tensor block corresponding to each working group can be determined according to the working group identifier.

Detailed description of instruction scheduling by compatible execution units in thread bundle units

In one embodiment, a second instruction dispatch unit is further provided in each execution unit, and referring to fig. 8, the method further includes:

step S801, if the target unit is an execution unit, determining a target working group allocated to the target unit, and creating a plurality of threads corresponding to the target working group, wherein each thread corresponds to data in a tensor block corresponding to the target working group;

Step S802, a first instruction scheduling unit sends a target instruction to a second instruction scheduling unit in an execution unit;

step S803, in the execution unit, determining the created thread as a plurality of thread bundles, wherein each thread bundle comprises a second number of threads;

in step S804, in the execution unit, the target instruction is respectively scheduled to the threads in the plurality of thread bundles by the second instruction scheduling unit.

In step S801, if the target unit is an execution unit, the hardware structure of the execution unit is designed based on the mode that the execution unit performs instruction scheduling and concurrent execution of the thread bundles in the thread bundle unit, that is, the hardware structure of the execution unit itself is more suitable for performing instruction scheduling and concurrent execution of the threads in the thread bundle unit. Meanwhile, the memory resources and the computing resources of the single execution unit are relatively limited, while the size of the work group is set by a user when writing the kernel function, and the size of the single work group is variable, when the size of the single work group is large, that is, when the number of elements in the tensor block corresponding to the work group is large, the computing resources and the memory resources of the execution unit may not be enough to concurrently execute the target instruction on all the elements in the tensor block corresponding to the work group, that is, the execution unit may not support instruction scheduling and execution in the unit of the work group. Based on this, in the present embodiment, if the target unit is an execution unit, instruction scheduling is not performed by the first instruction scheduling unit, but a plurality of threads are created in the execution unit to perform instruction scheduling in units of thread bundles by the second instruction scheduling unit in the execution unit later. Specifically, the number of threads created may be determined based on the number of elements in the target tensor block, e.g., one element in the target tensor block for each thread.

In step S802, as described with reference to step S801, the execution unit itself is more suitable for executing the data processing task in the unit of thread bundle, while the first instruction scheduling unit performs instruction scheduling in the unit of work group, if the target instruction is directly scheduled to the work group formed by the threads allocated to the execution unit by the first instruction scheduling unit, then the work group loaded with the target instruction is scheduled to the execution unit for execution, which may cause the execution unit to have difficulty in executing and managing the threads in the unit of thread bundle. Based on this, in this embodiment, when the target unit is an execution unit, the first instruction scheduling unit may directly send the target instruction to the second instruction scheduling unit in the execution unit, and the second instruction scheduling unit schedules the target instruction to each thread in units of thread bundles, so that the execution unit itself may be better adapted to manage and execute the mode of the thread in units of thread.

In step S803, the second number refers to the number of threads in a single thread bundle, and may be 16, 32, 64, etc., specifically, the second number is determined by the graphics processing unit itself, and the second number is fixed on the premise that the graphics processing unit determines. As described in step S801 and step S802, the execution unit is more suitable for managing and executing threads in thread bundles, so in this embodiment, after a plurality of threads are scheduled to the execution unit, the threads allocated to the execution unit are divided into a plurality of thread bundles, and then instruction scheduling can be performed in thread bundles by a second instruction scheduling unit in the execution unit.

In step S804, the second instruction scheduling unit may be a warp scheduler, which performs instruction scheduling in units of thread bundles, that is, schedules one instruction to 32 threads of the same thread bundle at a time. Specifically, after dividing the thread allocated to the execution unit into a plurality of thread bundles, the process of performing instruction scheduling by the second instruction scheduling unit in thread bundles may refer to the process of performing instruction scheduling in thread bundles in the related art, which is not described herein. It is emphasized that in the embodiments of the present disclosure, if and only if the target instruction is an instruction that needs to be allocated to the execution unit to be completed, instruction scheduling is performed in units of thread bundles by the second instruction scheduling unit in the execution unit, whereas when the target instruction is an instruction executed by the tensor core, the replication engine, or the scalar processing unit, instruction scheduling is not performed in units of thread bundles by the second instruction scheduling unit, but is performed in units of work groups by the first instruction scheduling unit integrated outside the execution unit.

In the embodiment disclosed in step S801 to step S804, when the target unit is an execution unit, the first instruction scheduling unit is not used to schedule instructions in units of a working group, but the thread allocated to the target unit is directly scheduled to the execution unit, and the second instruction scheduling unit in the execution unit is used to schedule instructions in units of a thread bundle, so that the data processing mode of the execution unit is more adapted, and the instruction scheduling method of the present disclosure can be better compatible with the hardware characteristics of the execution unit itself.

Apparatus and device descriptions of embodiments of the present disclosure

Referring to fig. 9, fig. 9 is a schematic structural diagram of an instruction scheduling apparatus 900 according to the present disclosure, where the apparatus includes:

An instruction acquisition unit 910 configured to acquire a target instruction through the first instruction scheduling unit;

a first identifying unit 920 for identifying a target unit for executing the target instruction, the target unit being one of the executing unit or the first functional unit;

A target workgroup determining unit 930, configured to determine, if the target unit is the first functional unit, a target workgroup corresponding to the target tensor block and the target tensor block corresponding to the target instruction, where the target workgroup is used to indicate a memory address range of the target Zhang Liangkuai;

A first dispatch unit 940, configured to dispatch the target instruction to the target workgroup and dispatch the target workgroup to the target unit by using the first instruction dispatch unit, to read the target Zhang Liangkuai by the target unit based on the memory address range and execute the target instruction on the target tensor block.

The specific processing procedure of the instruction scheduling apparatus of the present disclosure for executing the instruction scheduling method of the above embodiment is the same as that of the instruction scheduling method of the above embodiment, and will not be repeated here.

The disclosed embodiments also provide an electronic device 1000, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions that are executed by the at least one processor to cause the at least one processor to perform a method according to any of the above-described embodiments of the application when the instructions are executed.

The hardware configuration of the electronic device will be described in detail with reference to fig. 10. The electronic device includes a processor 1010, memory 1020, input/output interfaces 1030, communication interfaces 1040, and a bus 1050.

The processor 1010 may be implemented by a general-purpose central processing unit (Central Processin Unit, CPU), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, for executing related programs, so as to implement the technical solutions provided by the embodiments of the present disclosure;

The Memory 1020 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). Memory 1020 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present disclosure are implemented in software or firmware, relevant program codes are stored in memory 1020 and are called by processor 1010 to execute the instruction scheduling method of the embodiments of the present disclosure;

an input/output interface 1030 for implementing information input and output;

communication interface 1040 for implementing communication interaction between the device and other devices, which may be implemented by wired means (e.g. USB, network cable, etc.), or wireless means (e.g. mobile network, WIFI, bluetooth, etc.), and

A bus 1050 that transfers information between the various components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040);

Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The embodiment of the application also provides a computer readable storage medium, which stores one or more programs, and the one or more programs can be executed by one or more processors to implement the instruction scheduling method of the above embodiment, which is not described herein.

The specific processing procedure of the computing unit of the present disclosure for executing the instruction scheduling method according to the above embodiment is the same as that of the instruction scheduling method according to the above embodiment, and will not be described herein again.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims

1. An instruction scheduling method, characterized by being applied to a computing unit, the computing unit comprising a first instruction scheduling unit, at least one execution unit and at least one first functional unit, the first functional unit comprising at least one of a replication engine, a tensor core and a scalar processing unit, the method comprising:

acquiring a target instruction through the first instruction scheduling unit;

2. The instruction scheduling method according to claim 1, wherein the first instruction scheduling unit is provided with a corresponding instruction cache region, the computing unit further comprising a command processor, the method further comprising, before the target instruction is acquired by the first instruction scheduling unit:

3. The instruction scheduling method according to claim 2, wherein the target kernel includes at least a main function kernel portion, a scalar kernel portion, and a vector kernel portion, the scalar kernel portion including a plurality of preset scalar operation kernels, the vector kernel portion including a plurality of preset vector operation kernels, the main function kernel portion calling the scalar operation kernels and the vector operation kernels by a call instruction.

4. The instruction scheduling method of claim 1, wherein the identifying a target unit for executing the target instruction comprises:

5. The instruction scheduling method of claim 4, wherein prior to said identifying a target unit for executing said target instruction, said method further comprises:

6. The instruction scheduling method of claim 4, wherein the instruction type comprises at least one of:

7. The instruction scheduling method of claim 1, wherein prior to said determining a target working group corresponding to the target instruction and the target Zhang Liangkuai and the target Zhang Liangkuai, the method further comprises:

8. The instruction scheduling method of claim 1, wherein each of the execution units comprises a second instruction scheduling unit, the method further comprising:

9. An instruction scheduling apparatus, the apparatus comprising:

10. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program when executed by the processor implementing the instruction scheduling method of any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs, the one or more programs are executable by one or more processors to implement the instruction scheduling method of any one of claims 1 to 8.