WO2024232775A1

WO2024232775A1 - Method, processor, device, and program product for processing instruction cell

Info

Publication number: WO2024232775A1
Application number: PCT/RU2023/000141
Authority: WO
Inventors: Yegor BUGAYENKO
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2024-11-14
Anticipated expiration: 2025-11-10

Abstract

Embodiments of the present disclosure relate to a method, a processor, an electronic device, and a computer program product for processing an instruction cell. The method comprises performing, by a first thread running on a processor, a first execution step on an instruction cell during a first time period. The method further comprises performing, by a second thread running on the processor, a second execution step from the set of execution steps on the instruction cell during a second time period. The first execution step and the second execution step are from a set of execution steps, and the second time period overlaps at least partially with the first time period. In this way, the execution steps for an instruction cell can be parallelized, thereby the execution time can be reduced and available threads can be more intensive and more balanced utilized.

Description

METHOD, PROCESSOR, DEVICE, AND PROGRAM PRODUCT FOR PROCESSING INSTRUCTION CELL

FIELD

[0001] Embodiments of the present disclosure generally relate to the field of computer technology and in particular, to a method, a processor, an electronic device, and a computer program product for processing an instruction cell.

BACKGROUND

[0002] Dataflow architecture is a computer architecture that represents a program as a dataflow graph (DFG). In a DFG, nodes are “operators” and directed edges between them are the “channels” through which immutable values are passed among nodes. Conceptually, an operator can be evaluated once data are available on all its input channels. Then, the completion of evaluation makes the resulting data available to the operators at the ends of its output channels, which is ready to be used as inputs for evaluating further nodes. The order of operator evaluation is nondeterministic during the execution of a DFG.

[0003] In a computer with a dataflow processor, upon determining an operator has all input data available, a thread may fetch the corresponding instruction cell in the memory and compute the instruction cell to generate the result tokens. The results are delivered to other instruction cells that expect the result tokens as input for evaluation.

SUMMARY

[0004] In general, embodiments of the present disclosure provide a solution for optimizing the compilation of a computational graph.

[0005] In a first aspect, there is provided a method. The method comprises: performing, by a first thread running on a processor, a first execution step from a set of execution steps on an instruction cell during a first time period; and performing, by a second thread running on the processor, a second execution step from the set of execution steps on the instruction cell during a second time period, wherein the second time period overlaps at least partially with the first time period. In this way, the execution steps for an instruction cell can be treated as atomic tasks and can be parallelized, thereby the execution time can be reduced and available threads can be more intensive and more balanced utilized. [0006] In some embodiments of the first aspect, the method further comprises: loading the instruction cell and metadata of the instruction cell to a memory, wherein the metadata indicates one of a plurality of execution states. The metadata can be used to determine the state of the instruction cell in an execution workflow, which provides the basis to enable the execution step parallelization.

[0007] In some embodiments of the first aspect, wherein performing the first execution step on the instruction cell comprises: transitioning, based on the metadata of the instruction cell, the instruction cell from a first execution state of a plurality of execution states to a second execution state of the plurality of execution states. In this way, executing an instruction cell per step can be realized by transition among the non-binary states of the cell.

[0008] In some embodiments of the first aspect, wherein transitioning the instruction cell from the first execution state to the second execution state comprises: updating, based on the performing of the first execution step, the metadata of the instruction cell to indicate the second execution state. In this way, the determination and execution of a further step by a thread can be enabled.

[0009] In some embodiments of the first aspect, the method further comprises: selecting, by the first thread and based on the metadata of the instruction cell, the instruction cell from a set of instruction cells for performing the first execution step. In this way, a thread may capture an appropriated cell for execution, independent of if the cell is being processed by another thread.

[0010] In some embodiments of the first aspect, wherein selecting the instruction cell comprises: determining that the metadata of the instruction cell satisfies a requirement to perform the first execution step. In this way, the thread may determine an appropriate step to perform on the captured cell.

[0011] In some embodiments of the first aspect, wherein the metadata of the instruction cell indicates one or more inputs for evaluating the instruction cell, and wherein performing the first execution step on the instruction cell comprises: in response to determining that the one or more inputs are available, evaluating the instruction cell according to the one or more inputs. In this way, the thread may determine the appropriate chance to perform the evaluation of an instruction cell.

[0012] In some embodiments of the first aspect, wherein evaluating the instruction cell comprises: performing an evaluation of an operation corresponding to the instruction cell with the one or more inputs; and storing an output of the evaluation in the metadata of the instruction cell. In this way, the evaluation and delivery can be separated into different execution steps, which refines the granularity of atomic steps in the execution workflow.

[0013] In some embodiments of the first aspect, wherein the metadata of the instruction cell indicates an output of evaluating an operation corresponding to the instruction cell, and wherein performing the first execution step on the instruction cell comprises: in response to determining that the output is available in the metadata, determining, from the set of instruction cells, one or more instruction cells; and storing the output in metadata of the one or more instruction cells as an input for evaluating the one or more instruction cells. This provides the way for performing the delivery step that is applicable to the step-level penalization approach disclosed herein.

[0014] In some embodiments of the first aspect, wherein each of the set of instruction cells corresponds to a node in a dataflow graph (DFG) of a program for execution. In this way, the step-level parallelism can be applied to a program for execution, thereby improving the performance of the program execution.

[0015] In some embodiments of the first aspect, the method further comprises: performing, by a third thread running on a processor, a third execution step from the set of the execution steps on the instruction cell during a third time period; and locking, according to the third execution step, at least the instruction cell during the performing of the third execution step. The stepwise locking strategy, while enhancing the thread safety, can avoid locking a cell only for one thread during the whole workflow.

[0016] In some embodiments of the first aspect, the method further comprises: performing, by a fourth thread running on the processor, execution steps of a same type on a plurality of instruction cells simultaneously. In this way, the overall utilization of threads can be decreased and free time becomes available for other processing, while still maintaining the benefit of step-level parallelism.

[0017] In a second aspect, there is provided a processor configured to: perform, by a first thread running on a processor, a first execution step from a set of execution steps on an instruction cell during a first time period; and perform, by a second thread running on the processor, a second execution step from the set of execution steps on the instruction cell during a second time period, wherein the second time period overlaps at least partially with the first time period.

[0018] In a third aspect, there is provided an electronic device. The electronic device comprises a processor according to the second aspect of the present disclosure and a memory coupled to the processor. The memory stores an instruction cell executed by the processor.

[0019] A computer program product tangibly stored on a computer-readable medium and comprising machine-executable instructions, wherein the machine-executable instructions, when executed, cause a machine to perform the method according to the first aspect of the present disclosure.

[0020] It is to be understood that the Summary section is not intended to identify key or essential features of embodiments of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure. Other features of the present disclosure will become easily comprehensible through the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] Some embodiments will now be described with reference to the accompanying drawings, where:

[0022] Fig. 1 illustrates a schematic diagram of an example environment in which a plurality of embodiments of the present disclosure can be implemented;

[0023] Fig. 2 illustrates a flowchart of an example method for processing an instruction cell, in accordance with some embodiments of the present disclosure;

[0024] Fig. 3 illustrates an example schematic diagram of decomposition and step parallelization of the workflow for executing an instruction cell, in accordance with some embodiments of the present disclosure;

[0025] Fig. 4 illustrates a conceptual architecture for executing instruction cells of a program, in accordance with some embodiments of the present disclosure;

[0026] Fig. 5 illustrates an example DFG representing a program that can be executed in accordance with some embodiments of the present disclosure;

[0027] Fig. 6 illustrates an example timeline demonstrating an execution process of a program by multiple threads running on a processor, in which the steps of executing a cell are performed by one thread sequentially; [0028] Fig. 7 illustrates an example diagram demonstrating an execution process of multiple threads executing instruction cells of a program, in accordance with some embodiments of the present disclosure;

[0001] Fig. 8 illustrates another example diagram demonstrating an execution process of multiple threads executing instruction cells of a program, in accordance with some embodiments of the present disclosure; and

[0002] Fig. 9 illustrates a schematic block diagram of a device that can be used to implement embodiments of the present disclosure.

[0003] Throughout all the drawings, the same or similar reference numerals represent the same or similar elements.

DETAILED DESCRIPTION

[0004] The principle of the present disclosure will now be described with reference to some embodiments. It is to be understood that these embodiments are described only for the purpose of illustration and to help those skilled in the art to understand and implement the present disclosure, without suggesting any limitation as to the scope of the disclosure. The disclosure described herein can be implemented in various manners other than the ones described below.

[0005] In the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of the ordinary skills in the art to which this disclosure belongs.

[0006] References in the present disclosure to “one embodiment,” “some embodiments,” “an embodiment,” and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with some embodiments, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0007] It shall be understood that although the terms “first” and “second” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of embodiments. As used herein, the term “and/or” includes any and all combinations of one or more of the listed terms.

[0008] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting to embodiments. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “has”, “having”, “includes” and/or “including”, when used herein, specify the presence of stated features, elements, and/or components, etc., but do not preclude the presence or addition of one or more other features, elements, components and/ or combinations thereof.

[0009] In the dataflow architecture, a program for execution is loaded as a set of instruction cells in a memory, one for each node of the DFG representing the program. Conceptually, when a processor executes a cell of the program, an “arbitration network” captures the cell and checks whether the cell has the data available at all its inputs and thus can be evaluated. Then, the cell is sent to an operation unit, such as an arithmetic logic unit (ALU), for evaluation. When the ALU has computed the evaluation result of the cell, a “distribution network” finds the cell(s) that needs the result, and stores the results into the found cell.

[0010] Although the exact sequence of steps may vary among different implementations of processors, an execution workflow of a single instruction cell is sequential in these implementations. Once an instruction cell is captured for execution, a thread of the processor locks the cell and holds it until the evaluation and the distribution of the result are finished. The cell has to go through the entire workflow, while other free thread(s) must wait till any cells are available to be executed, which could restrict the utilization of the threads.

[0011] Further, the durations of different steps are not homogeneous due to the different complexity. For example, checking the input availability of a cell is much faster than multiplying two floating-point numbers. Besides, the cost (e.g., in CPU ticks) of the evaluation of different cells by the ALU may also vary significantly. Due to this reason, the waiting time may not be equally distributed among threads, which may cause the execution performance not to be optimal.

[0012] Some approaches try to improve the performance of such processors by hardware-level task schedulers, thread criticality predictors, and/or thread accelerators. However, these approaches focus on thread-level parallelism, in which a single instruction cell is treated as an atomic job.

[0013] In accordance with embodiments of the present disclosure, there is provided an approach for processing instruction cells of a program in a dataflow machine. It enables Step-Level Parallelism (SLP) for performing multiple steps of the execution workflow on a cell in parallel. In particular, when a thread running on a processor performs one execution step on an instruction cell during a first time period, another thread running on the processor may perform another execution step on the same instruction cell during a second time period which overlaps at least partially with the first time period. In this way, the execution time can be reduced, and available threads can be more intensive and more balanced utilized, thereby improving the execution performance.

[0014] Principles and embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Reference is first made to Fig. 1, which illustrates a schematic diagram of an example environment 100 in which a plurality of embodiments of the present disclosure can be implemented.

[0015] The environment 100 includes a computing device 110. Examples of the computing device 110 include, but are not limited to, a desktop computer, a smartphone, a tablet, a laptop, and a server computer. The scope of the present disclosure is not limited in this regard. The computer device 110 includes a processor 120 and a memory 130 coupled to the processor 120, which can be a Random Access Memory (RAM), for example.

[0016] The computer device 110 may use the processor 910 to execute various programs according to the embodiments of the present disclosure. For execution, a program may be parsed into a DFG representation and be loaded into the memory 130 as a set of instruction cells 140-1, 140-2, ..., 140-N (collectively referred to as the instruction cell 140). Each instruction cell corresponds to a node of a DFG, i.e., a basic operation of the program to be executed.

[0017] The computer device 110 may start a plurality of threads 150-1, 150-2, ..., 150-N (collectively referred to as the threads 150 or the thread 150) on the processor 910 to execute the program. As known to those skilled in the art, the threads may divide a physical core of a processor into virtual multiple cores, which enables parallel execution. During an execution process, each thread may capture an available cell to execute it.

[0018] The execution workflow of a cell comprises multiple steps. In the embodiments of the present disclosure, when one of the threads 150 is capturing one cell to perform one step on it, it is possible that another of the threads 150 may capture the same cell for another cell in parallel. Thus, during the execution, when n instruction cells are available for execution in parallel according to the DFG of the program under execution, it is possible that more than n threads are occupied with cells, some of them are processing the same instruction cell.

[0019] It is to be understood that the environment 100 is described for illustrative purposes and there may also be other devices, systems, or components that are not shown in the environment 100. For example, the computing 110 may start another set of threads for executing another program. For example, the computing device 110 may have a co-processor that may execute a program in a similar way as the processor 130.

[0020] Reference is now made to Fig. 2 which illustrates a flowchart of an example method 200 for processing an instruction cell, in accordance with some embodiments of the present disclosure. The example method 200 may be performed, for example, by the computing device 110 shown in Fig. 1. It should be understood that the method 200 may also include additional actions not shown, and the scope of the present disclosure is not limited in this regard. The method 200 is described in detail below in conjunction with the example environment 100 of Fig. 1.

[0021] At 210, perform, by a first thread running on a processor, a first execution step on an instruction cell during a first time period. The first execution step is from a set of execution steps. The execution workflow for an instruction cell by the processor may comprise this set of steps. For example, the computing device 110 may make the thread 150-1 perform an execution step on the instruction cell 140-1 during a first time period.

[0022] Each instruction cell may correspond to a node in a DFG of a program for execution. The set of instruction cells for a program may be loaded by the computing device 110 into the memory 130 for execution, following the principles of a dataflow machine. In some embodiments, each instruction cell can be loaded with additional metadata of the instruction cell. The metadata indicates that this cell is in one of a plurality of execution states. Then, the thread 150-1 may select an instruction cell for performing a step. In some embodiments, the thread 150-1 may determine that the metadata of an instruction cell satisfies the requirement to perform a certain execution step. Upon the determination, the thread 150-1 may select this instruction cell and perform a corresponding step on it.

[0023] At 220, perform, by a second thread running on the processor, a second execution step on the instruction cell during a second time period, the second time period overlaps at least partially with the first time period. The second step is another step from the set of execution steps of the execution workflow. For example, the computing device 110 may make the thread 150-2 perform another execution step on the instruction cell 140-1, while the thread 150-2 is still performing the execution step on the instruction 140-1 at 210. In different examples, the second period may start earlier than the first period, later than the period, or at the same time as the first period. Also, the second period may end earlier than the first period, later than the period, or at the same time as the first period.

[0024] Similar to the thread 150-1, once the thread 150-2 determines that the metadata of the instruction cell 140-1 satisfies the requirement to perform the other execution step (e.g., all preconditions are ready, and no other restrictions exist), it can pick the instruction cell 140-1 to perform the other step on it, no matter if the thread 150-1 is still processing the instruction cell 140-1 as at 210. In some embodiments, it is possible that one cell may be captured by more than two threads in parallel, each may perform an execution step on it. The steps that can be executed in parallel may vary among different implementations, as will be described in more detail in the following text.

[0025] With the method 200, some of the steps for executing the same instruction cell can be performed in parallel by multiple threads, thereby the execution time can be reduced, and available threads for executing instruction cells of a program can be more intensive and more balanced utilized.

[0026] Fig. 3 illustrates a schematic diagram 300 of decomposition and step parallelization of the workflow for executing an instruction cell, in accordance with the present disclosure. For comparison, the diagram 300 presents two example processes for executing the same instruction cell on a timeline 330, with each step in the execution workflow represented as a rectangle. It is to be understood that the diagram 300 is presented in a qualitative manner but may not be quantitatively precise. Further, the number of steps is only illustrative, and different number of steps in an execution workflow may exist in different implementations.

[0027] As shown by the example process 310, the execution workflow of a single instruction cell by a processor in this example comprises a set of steps 311-314. When a thread running on the processor starts to execute the cell at the time tO 331 step by step, the workflow would be completed at the time t2 333.

[0028] Depending on the nature of the steps and the specific implementation, these steps are partially ordered. In other words, some of the steps must follow each other strictly sequentially, while others do not depend on the result of each other and thus may be completed in parallel when other preconditions for execution are satisfied. For example, steps 312 and 312 do not depend on each other, and each of them can be performed once the step 311 is completed. For example, in a dynamic dataflow architecture, the tagging step and the evaluation step of a single cell can be executed in parallel.

[0029] Thus, the step 313 does not have to wait for the completion of the step 312. The example process 320 presents another execution for the same instruction cell by a processor in accordance with some embodiments of the present disclosure. For simplification, assume that each step takes the same time as in the process 310 and the processor 320 also starts at the time tO 331. The difference is that in the process 320, the step 322 and 323 are parallelized. Though this way, the process 320 can be finished at the time tl 332, which is obviously earlier than the time t2 333.

[0030] For illustrating how the steps can be parallelized, reference is now made to Fig. 4, which illustrates an example conceptual architecture 400 for executing instruction cells of a program, in accordance with some embodiments of the present disclosure. For example, the architecture 400 may be implemented on the computing device 110 of Fig. 1.

[0031] In the architecture 400, a set of instruction cells of a program for execution may be loaded to a memory 430, such as the cell 431. Each cell corresponds to a node in a DFG of the program. An arbitration network 410 may receive “operation packets” from the fulfilled cells for execution. A thread running on a processor may perform the action of an arbitration network 410, which captures a cell from the set of instruction cells for execution.

[0032] Instead of going through the whole execution workflow of the captured cell, when a thread of the arbitration network 410 captures a cell, it may perform one of the steps in the workflow. Conceptually, after performing a corresponding execution step, the process executed by the thread may flow to a distribution network 420. The thread may perform the action of the distribution network 410, which modifies the cells as necessary depending on the executed step. Then, the thread may again capture an available cell for performing a certain step on this available cell.

[0033] In the architecture 400, the process may flow to the step of check 441 , where the thread may check the captured cell to determine if the cell is ready for evaluation. The cell that is determined to be ready may be captured in the next round by the same thread or a different thread for executing another step, depending on the implementation.

[0034] In the architecture 400, the process may flow to a step of evaluation 442 when the cell is ready for evaluation. At this step, the thread may send an operation packet 452 of the captured cell to an operation unit (e.g., an ALU in the processor). The operation unit may compute the evaluation result for the operation corresponding to the instruction cell, which turns the operation packet 452 into a data packet 451 comprising the evaluation result. Different from the conventional dataflow machine, the evaluation result may be first stored with the cell, e.g., stored in the metadata 432 for the cell 431, as will be described below in more detail.

[0035] In the architecture 400, the process may flow to a step of delivery 443 for a cell that has got its evaluation result. At this step, the result data is taken from the cell and delivered to the instruction cell(s) that are waiting for the result data. The thread may perform the action of the distribution network 420, which places data or references to data into corresponding instruction cells.

[0036] In the architecture 400, the process may also flow to a step of tagging 444 such as in a dynamic dataflow architecture case, during which the captured cell is tagged. Further, for an instruction cell, the architecture 400 allows the evaluation 442 and the tagging 444 for the same instruction cell to be performed in parallel. For example, when a thread captures the cell 431 and is still going through the flow of tagging 444, another thread may capture the cell 431 and go through the flow of evaluation 442 in parallel. It is to be understood that the specific steps shown in Fig. 4 are only illustrative, and the steps of workflow for executing an instruction cell, as well as the parallelizable steps, may vary among different implementations according to the present disclosure. [0037] To enable a thread to capture a cell to perform an appropriate step, in the architecture 400, each instruction cell may be loaded with metadata of the cell (e.g., the metadata 432 of the cell 431). Thus, each cell has additional read/write memory space for maintaining the meta information about the cell.

[0038] The metadata of the cell may indicate one of a plurality of execution states. When an instruction cell is executed by a thread of the processor on a device that implements the architecture 400, it may be in one of the plurality of execution states. In contrast, when treating the execution of an instruction cell as an atomic job in a conventional machine, the cell may only have a binary execution state, indicating whether the cell is executed.

[0039] Based on the metadata of the instruction cell, a free thread may select an available cell from the set of instruction cells for performing a corresponding step. In some embodiments, when the thread determines that the metadata of the cell satisfies the requirement to perform a corresponding execution step, it will select (namely, capture) the cell for execution. In some embodiments, for a selected cell, a thread may transition the cell from the current state to another state based on the metadata of the cell. Based on performing the execution step, the thread may update the metadata of the cell to indicate the other execution state, thus enabling the cell to be captured for the next execution step.

[0040] In this way, each cell is conceptually turned into a Finite State Machine (FSM) with a number possible of states corresponding to the steps of the execution workflow. Each thread may take a cell (or multiple cells as will be described later) and make an attempt to transition it to one of the next possible states.

[0041] The metadata may indicate one or more inputs for evaluating the cell. For example, the data of an input or a reference to the data may be stored in a slot in the metadata. In some embodiments, the metadata may indicate that the cell has all the inputs available, and the cell is ready for evaluation. Then, the thread may perform the evaluation step on the cell according to the one or more inputs. At the evaluation step, the thread may evaluate an operation corresponding to the instruction cell with the one or more inputs. For example, it may send the operation packets of the cell with the inputs to an ALU to let the ALU compute the output of the operation. In some embodiments, the metadata of the instruction cell may indicate an output of evaluating the operation corresponding to the instruction cell. At the evaluation step, the thread may update the metadata to store the result of the evaluation in the metadata, e.g., at a corresponding slot.

[0042] In some embodiments, the metadata may indicate that the cell is evaluated and the output is available in its metadata, and thus the cell is ready for the delivery step. At this step, the thread may determine one or more instruction cells from the set of instruction cells. For example, the metadata may store the dependency among the instruction cells according to the DGF of the program under execution, and as the function of the distribution step, the thread may determine one or more cells that are waiting for the outputs of the current cell. Then, the thread may store the output in the metadata of the waiting cell(s) as an input.

[0043] The state corresponding to a certain step may comprise multiple aspects, e.g., if the state shows no conflict with other threads for performing this certain step. This may be supported by some stepwise locking mechanism as will be described later. For example, in the case that the tag and evaluation of a cell may be parallelized, if all the inputs are available and the cell is tagged or is being tagged by another thread, then the thread may perform the evaluation step, but not the tagging step.

[0044] Further, other states and corresponding steps other than those presented above may be available. For example, the state may indicate that the cell still waiting for its input(s), and the thread or another thread will check it again later. For example, the state may indicate that the output of this cell has been distributed to other cells, and the thread may perform a delete step to remove the cell from the memory or make it unavailable otherwise.

[0045] For illustration, the states before and after performing a set of example steps in the execution workflow are presented in mathematic formalization below. In the formalization, an instruction cell is defined as a set as shown in formula ( 1 ) : pi 1= [v,ip, ko, ki, ..., km], m = v (1) where v is a reference to an operator in the DFG with the arity of m, \|/ is a reference to the “parent” cell which needs to be evaluated so as to evaluate the present cell, kj is an evaluation slot related to one of the edges of v as shown in (2):

[0046] The where “ — ► ” means that there is no interest yet in the evaluation of this edge,

“ means that this edge may need to be evaluated, “ means that this edge

must be evaluated using the operator v with p as its parent, “ ” means that the data

for this edge will be read from . a slot, as soon as it is evaluated, and “ ” means that the data d is already in the cell and can be propagated to other cells, which are waiting.

[0047] On this basis, a set of example steps in an execution workflow is presented below in formula (3) to (7), where the sub-formula above the mid-line shows the state of a cell that satisfies the requirement to perform a corresponding step, and the sub-formula below the mid-line shows the state that the cell needs to be turned to by performing the step on it.

[0048] The formula (3) presents a “copy” step of a cell

By performing this step, the input of the cell is copied into its input slot(s).

[0049] The formula (4) presents a “deliver” step of a cell 2 . By performing this step, the out of the cell A is delivered to the waiting cell A .

[0050] The formula (5) presents a “delete” step of a cell

By performing this step, the cell is removed from the memory or is made to be unavailable otherwise.

[0051] The formula (6) presents an “evaluation” step. By performing this step, the output of the cell A is computed and stored together with this cell.

[0052] The formula (7) presents an “instantiate” step of a cell P2 . By performing this step, the cell is instantiated upon need.

[0053] In a demonstrative simulation execution process for a program according to the above formulization, the number of CPU ticks spent to calculate the 16th Fibonacci number was 1.6 times smaller than the number of ticks spent by the same simulator to calculate the same Fibonacci number without the step-wise parallelization

[0054] It is to be understood that the above examples are only illustrative. The possible execution states of an instruction cell may vary depend on the operation units, the arbitration network, and the distribution network implemented in a specific processor. The transfer of “control” between steps is orchestrated by the arbitration network. Its threads “activate” the cells, letting them transition to the next state if necessary and possible.

[0055] The execution processes in accordance with the embodiments of the present disclosure are now illustrated in conjunction with a simplified example program in the following text. This non-limiting example program is to calculate a sum of two numbers. One of the numbers is calculated by the division operator and the other is calculated by the multiplication operator. Mathematically, the program is to calculate the y in the following equation:

32, 16, x₃ := 11, y

+ x₂ x x₃

[0056] Fig. 5 illustrates an example DFG 500 representing the above-described program that may be executed in accordance with some embodiments of the present disclosure. As Fig 5 shows, the DFG 500 consists of three nodes (namely, vertices) 510, 520, and 530, each representing an operator of the program. All necessary data are available as constants. The data constants are passed among operators through the edges representing data channels.

[0057] Each operator has input edge(s) indicating the data required to evaluate this operator. The operator further has output edge(s) indicating to which other operator(s) the evaluation result of this operator needs to be passed. For example, the evaluation of the division operator represented by the node 520 (“div”) requires data 541 (“ _x”) and data 542 (“x₂”) as inputs, illustrated by an edge 551 and an edge 552. Its evaluation result needs to be passed to the addition operator represented by the node 530 (“add”) as an evaluation input, as illustrated by an edge 553.

[0058] Overall, the program must evaluate y as represented by data 543, and present it as the output of the node 530. Among these operators, the evaluation of operators represented by the node 510 and the node 520 (i.e., “div” and “mul”) are not dependent on each other. Thus, their evaluation can be parallel. On the other side, the evaluation of the operator represented by the node 530 has to wait, until the evaluation results of the other two operators are available.

[0059] As a comparison, Fig. 6 illustrates an example diagram 600 demonstrating an execution process of the program corresponding to the DFG 500 by multiple threads running on a processor. In the example of Fig. 6, the steps of executing a cell are performed by one thread sequentially.

[0060] The example of Fig.6 shows the execution of DFG 500 by four threads 610 (“Ti”), 620 (“T2”), 630 (“T3”), and 640 (“T4”). Each of the threads is represented by a horizontal sequence of rectangles on a timeline 650. Each rectangle in the sequences represents an execution step of an operator (the instruction cell corresponding to the operator) and occupies some amount of horizontal space reflecting the time spent on performing this step. The horizontal space is measured in discrete increments each representing a CPU tick, as illustrated by the timeline 650.

[0061] It is to be understood that the durations of steps are only illustrative. Further, there is one tick separation between the consecutive two rectangles for the sake of readability. It is to be understood by those skilled in the art that such separations may not necessarily exist in practice.

[0062] As shown in Fig. 6, the thread 610 may start with the check step 611 (labeled as “C”) and may capture the first operator that has all necessary input data. In this example, it may capture the instruction cell corresponding to the multiplication operator represented by node 510 in Fig. 5. Taking the execution process of an instruction cell as an atomic job, the thread 610 may proceed till the process is completed. In this illustrative example, the thread 610 may first continue with a tagging step 612 (labeled as “T”) that tags the cell. Then, it may perform an evaluation step 613 (labeled as “Emui”) on the cell, which calculates the result of an arithmetic multiplication. After that, the thread 610 may perform the delivery step 614 (labeled as “D”) that delivers the result of the step 613 to the cells that were waiting, which will be the cells corresponding to the addition operator shown by node 530 in Fig. 5.

[0063] The thread 620 may capture the instruction cell corresponding to the division operator represented by the node 510. It follows the same routing as the thread 610 processing the cell for multiplication. However, as shown by the evaluation step 623 (labeled as “Ediv”), it takes longer than the evaluation step 613 to evaluate the division due to the different complexity of the evaluation. As a result, the execution of the division by then thread 620 is completed at a later time.

[0064] When the thread 620 is still working on the division operator, the thread 610 may check another instruction cell at the check step 615, in an attempt to capture it for processing. However, the only left cell (i.e., the cell corresponding to the addition operator) needs the evaluation result of the division operator as input, which is not ready yet. Thus, the check step does not lead to processing of any cell.

[0065] After that, the thread 610 may perform another check step 616 which captures the instruction cell corresponding to the addition operator for further processing. During the execution of this program, the other two available threads 630 and 640 remain work-less. They make attempts to check instruction cells and capture an available one (e.g., as shown by steps 631 and 641) but keep failing.

[0066] In the coarse-grained parallelism example shown in Fig. 6, the execution workflow of an instruction cell is considered as an atomic job and performed by one thread sequentially step by step. In such a situation, the maximum number of threads that a processor can utilize equals the number of operators that can be evaluated parallel according to the DFG of the program under execution. Adding more threads to the processor will not speed up the execution of this program.

[0067] Fig. 7 illustrates an example diagram 700 demonstrating an execution process of multiple threads executing instruction cells of a program, in accordance with some embodiments of the present disclosure. For illustration, the example of Fig.7 shows an execution process of DFG 500 also by four threads, namely threads 710 (“Tl”), 720 (“T2”), 730 (“T3”), and 740 (“T4”). These threads are presented on a timeline 750 in a similar form as used by the example in Fig. 6.

[0068] In the example in Fig. 7, a step of the execution workflow of a single instruction cell is treated as an atomic job, instead of the whole workflow as in the example of Fig. 6. Further, some of the steps in a workflow are not dependent on each other and thus their executions may be parallelized. For illustration, in the example implementation used for executing the process in Fig. 7, the tagging step and evaluation step for executing the same instruction cell can be executed in parallel.

[0069] As shown in Fig. 7, the thread 710 may start with the check step 711 and may capture the instruction cell corresponding to the multiplication (“mul”) operator. In this example, the thread 610 keeps this captured for performing a tagging step 712. On the other side, the thread 730 may not able to determine checking results for any cells at the check step 731 in Fig.7, since the available two cells at this moment are checked by the thread 710 and the thread 720.

[0070] Once the cell corresponding to multiplication is determined to be available for execution at the check step 711 , it is available for both the tagging step and the evaluation step. As a result, while the thread 710 is performing the tagging step 712, the thread 730 may also pick up the cell for multiplication to perform the evaluation step 732 on the cell. Thus, the periods for executing the tagging step 712 and the evaluation step 732 of the same instruction cell can be at least overlapped. Similarly, the thread 740 may pick up the cell corresponding to the division operator to perform the evaluation step 742 of the division, while this cell is still being tagged by the thread 720 at the tagging step 722.

[0071] In this example, after the evaluation steps 732 and 742, the threads 730 and 740 may keep the respective cells captured and may further perform the delivery steps 733 and 743. Each of the delivery steps 733 and 743 delivers respective evaluation results to the cell(s) that are waiting, which is the cell corresponding to the addition operator in this example.

[0072] After all the required inputs have been delivered to the cell for addition, the thread 710 may perform the check step 716 on this cell, it may determine that this cell can be further processed. Then, the thread 710 may perform the tagging step 717 for this cell. Similarly, when the cell for addition is still being tagged by the thread 710, the thread 720 may also capture this cell to perform the evaluation step 717 (labelled as “Eadd”) for it.

[0073] The fine-grained parallelism at the step level can speed up the execution process of programs and enable more intensive utilization of the available threads. For example, compared to the process shown in Fig. 6, the process shown in Fig. 7 shortens the time required to execute the same example program from 32 ticks to 26 ticks (including the illustrative pause ticks between steps).

[0074] Consider the utilization of a thread as a ratio between the number of CPU-ticks during which the thread is occupied and the number of CPU-ticks taken for the whole execution. The total utilization of the entire thread pool in the example of Fig. 6 is 36%, while the utilization of the thread pool in Fig. 7 increased to 45% (including the illustrative pause tick when the cell is kept by a thread, e.g. as indicated by the arrow 760). The higher utilization may result in more effective usage of the processor resources, including higher power efficiency.

[0075] The utilization of all threads can be more balanced with such fine-grained parallelism. For example, in the coarse-grained examples of Fig. 6, while the thread 610 is occupied most of the time, the thread 630 and the thread 640 remain work- less during the entire lifecycle of the program. In comparison, the four threads 710-740 in Fig.7 are more balanced-used.

[0076] It is to be understood that the diagram 700 is only illustrative. For example, the thread that captures a certain step of a certain cell may vary among different executions of the program. In another execution, it could be one of the threads 720-740 that performs the tagging step on the cell for addition, while thread 710 may perform the evaluation step in parallel.

[0077] Further, in the example in Fig.7, after capturing a cell, the thread may keep the cell captured to perform the next possible step. This is only illustrative and various variations that parallelize execution steps of the same instruction cell also fall into the scope of the present disclosure. For example, after determining a cell is available for execution, a thread may continue with the evaluation step on the cell, while another thread may capture the cell for tagging in parallel. As another example, it is possible that a cell may be captured by different threads for each of the different execution steps. In other implementations, the execution process of an instruction cell may comprise more steps, fewer steps, and/or different steps than those presented in the examples, and different steps may be parallelized in other implementations.

[0078] In some embodiments, at least some of the steps in the workflow may be implemented as parallel computations (also known as “matrix” or “vector” operations), meaning that they make modifications to multiple instruction cells simultaneously, instead of modifying one cell at a time. In other words, during the execution of a program, a thread may capture multiple instruction cells simultaneously and performs. For example, the thread may determine that the metadata of each of a plurality of instruction cell satisfies the requirement to perform a particular execution step. Upon the determination, the thread may capture the plurality of cells to perform the particular execution step simultaneously.

[0079] Fig. 8 illustrates another example diagram 800 demonstrating the progress of multiple threads executing instruction cells of a program, in accordance with some embodiments of the present disclosure. The example of Fig.8 shows an execution process of DFG 500 by four threads 810 (“Tl”), 820 (“T2”), 830 (“T3”), and 840 (“T4”) and present them on a timeline 850 in a similar form as used by the example in Figs. 6 and 7.

[0080] The example implementation used to perform the execution in Fig. 8 allows parallelism of the tagging step and the evaluation step for executing one instruction cell. Different from the example implementation used to perform the example execution illustrated in Fig. 7, the example implementation used in Fig.8 further allows a thread to perform the check step and tagging step for multiple instruction cells simultaneously.

[0081] As shown, the thread 810 may check both available instruction cells corresponding to the multiplication operator and division operator, as represented by nodes 520 and 510, respectively. The two cells are then available for both the tagging step and the evaluation step. The thread 810 may perform the tagging step on both the cell for multiplication and the cell for division. While the thread 810 is stilling tagging these two cells, the thread 830 may perform the evaluation step 832 for the multiplication cell and the thread 830 may perform the evaluation step 842 for the multiplication cell. Then, after the evaluation results of these two cells have been delivered to the cell for addition, this cell becomes available and is checked by the thread 810 at the check step 816. After that, the tagging step 817 and evaluation step 827 for the addition cell may be performed in parallel. For this example, there are no further cells to be executed when executing the cell for addition.

[0082] With the enablement of processing of a same step on multiple cells simultaneously, the overall utilization of threads can be decreased and free time becomes available for other processing, while still maintaining the benefit of step-level parallelism as illustrated by Fig. 7. For example, suppose that the DFG 500 is revised to have further two available cells when the addition cells become available. Without vector/matrix processing, the threads 830 and 840 would have been occupied for tagging the further cells when the addition cell is tagged by the thread 810 at 817. In comparison, with thread 810 being able to tag all available cells, the threads 830 and 840 would be able to begin to evaluate the two further cells at the time corresponding to the step 836 and 845.

[0083] In the partially ordered execution workflow, some locking/unlocking mechanisms are also needed to prevent collisions or races between threads. In such cases, instead of locking a cell during the whole execution flow, the locking strategy can be applied on the step-level. In some embodiments, when a running thread of a processor is performing an execution step on an instruction cell, at least the instruction cell may be locked according to the execution step. In other words, the locking strategy is determined and applied to an execution step for an instruction cell.

[0084] The specific locking applied to an execution step depends on the nature of the implemented step. For example, for performing an evaluation step, it may be sufficient to lock the cell being evaluated. For example, the locking strategy of a delete step may be setting a binary flag in the corresponding cell marking it as “deleted”. For example, for a step that can be executed in parallel with another step, the locking strategy needs to take the parallelism into account and allows the cell to be captured by a sufficient number of threads but not more than that.

[0085] As another example, a cell execution workflow may comprise an instantiation step. It can be performed when the metadata of an instruction cell indicates that it needs another cell to be instantiated. Then, the responsible thread may perform the step on the instruction cell, which instantiates the required cell and update the metadata accordingly. During the execution of the instantiate step, the entire set of instruction cells may be locked for the thread to find the cell to be instantiated and fill it up with data.

[0086] The above examples of locking the cell at the granularity of an execution step are only illustrative. The locking for the same type of steps may vary among different implementations according to the present disclosure. While avoiding locking a cell only for one thread during the whole workflow, the step-level locking can enhance the thread safety.

[0087] Fig. 9 illustrates a schematic block diagram of a device 900 that may be used to implement embodiments of the present disclosure. The device 900 may be the device or apparatus described in the embodiments of the present disclosure, such as the computing device 110. As shown in Fig. 9, the device 900 includes a processor 901, which may be configured to execute various appropriate actions and processing to perform the methods (e.g., the method 200) of the present disclosure. The processor 901 is implemented in hardware, firmware, or a combination of hardware and software. In addition, although not shown in Fig. 9, the device 900 may also include a co-processor.

[0088] The processor 901 may execute actions and processing to perform the methods of the present disclosure according to computer program instructions. The computer program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages as well as conventional procedural programming languages. In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), is customized by utilizing status information of the computer-readable program instructions. The electronic circuit may execute the computer-readable program instructions so as to implement various aspects of the present disclosure.

[0089] These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or a further programmable data processing apparatus, thereby producing a machine, such that these instructions, when executed by the processing unit of the computer or the further programmable data processing apparatus, produce means for implementing functions/actions specified in one or more blocks in the flow charts and/or block diagrams. These computer-readable program instructions may also be stored in a computer-readable storage medium, and these instructions cause a computer, a programmable data processing apparatus, and/or other devices to operate in a specific manner; and thus the computer-readable medium having instructions stored includes an article of manufacture that includes instructions that implement various aspects of the functions/actions specified in one or more blocks in the flow charts and/or block diagrams.

[0090] The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatuses, or other devices, such that a series of operating steps may be executed on the computer, the other programmable data processing apparatuses, or the other devices to produce a computer-implemented process, such that the instructions executed on the computer, the other programmable data processing apparatuses, or the other devices may implement the functions/actions specified in one or more blocks in the flow charts and/or block diagrams.

[0091] The computer program instructions may be stored in a Read-Only Memory (ROM) 902 or be loaded onto a Random Access Memory (RAM) 903 from a storage unit 908, for example. The processor 901, the ROM 902, and the RAM 903 are connected to each other via bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904. The various methods or processes described above may be performed by the processor 901.

[0092] A plurality of components in device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard and a mouse; an output unit 907, such as various types of displays and speakers; the storage unit 908, such as a magnetic disk and an optical disc; and a communication unit 909, such as a network card, a modem, and a wireless communication transceiver. The communication unit 909 allows the device 900 to exchange information/data with other devices via a computer network, such as the Internet, and/or various telecommunication networks.

[0093] In some embodiments, the methods and processes described above may be implemented as a computer program product. The computer program product may include a computer-readable storage medium on which computer-readable program instructions for performing various aspects of the present disclosure are loaded.

[0094] The computer-readable storage medium may be a tangible device that may retain and store instructions used by an instruction-executing device. For example, the computer-readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (a non-exhaustive list) of the computer-readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disc (DVD), a memory stick, a floppy disk, a mechanical coding device, for example, a punch card or a raised structure in a groove with instructions stored thereon, and any suitable combination of the foregoing. The computer-readable storage medium used herein is not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber-optic cables), or electrical signals transmitted through electrical wires.

[0095] The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device.

[0096] The flow charts and block diagrams in the drawings illustrate the architectures, functions, and operations of possible implementations of the devices, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flow charts or block diagrams may represent a module, a program segment, or part of an instruction, and the module, program segment, or part of an instruction includes one or more executable instructions for implementing specified logical functions. In some alternative implementations, functions marked in the blocks may also occur in an order different from that marked in the accompanying drawings. For example, two consecutive blocks may in fact be executed substantially concurrently, and sometimes they may also be executed in a reverse order, depending on the functions involved. It should be further noted that each block in the block diagrams and/or flow charts as well as a combination of blocks in the block diagrams and/or flow charts may be implemented using a dedicated hardware-based system that executes specified functions or actions, or using a combination of special hardware and computer instructions.

[0097] Various embodiments of the present disclosure have been described above. The foregoing description is illustrative rather than exhaustive, and is not limited to the disclosed various embodiments. Numerous modifications and alterations are apparent to persons of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of terms as used herein is intended to best explain the principles and practical applications of the various embodiments or the technical improvements to technologies on the market, or to enable other persons of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

WHAT IS CLAIMED IS:

1. A method for processing an instruction cell, comprising: performing, by a first thread running on a processor, a first execution step from a set of execution steps on an instruction cell during a first time period; and performing, by a second thread running on the processor, a second execution step from the set of execution steps on the instruction cell during a second time period, wherein the second time period overlaps at least partially with the first time period.

2. The method of claim 1, further comprising: loading the instruction cell and metadata of the instruction cell to a memory, wherein the metadata indicates one of a plurality of execution states.

3. The method of claim 2, wherein performing the first execution step on the instruction cell comprises: transitioning, based on the metadata of the instruction cell, the instruction cell from a first execution state of a plurality of execution states to a second execution state of the plurality of execution states.

4. The methods of claim 3, wherein transitioning the instruction cell from the first execution state to the second execution state comprises: updating, based on the performing of the first execution step, the metadata of the instruction cell to indicate the second execution state.

5. The method of claim 2, further comprising: selecting, by the first thread and based on the metadata of the instruction cell, the instruction cell from a set of instruction cells for performing the first execution step.

6. The method of claim 5, wherein selecting the instruction cell comprises: determining that the metadata of the instruction cell satisfies a requirement to perform the first execution step.

7. The method of claim 5, wherein the metadata of the instruction cell indicates one or more inputs for evaluating the instruction cell, and wherein performing the first execution step on the instruction cell comprises: in response to determining that the one or more inputs are available, evaluating the instruction cell according to the one or more inputs.

8. The method of claim 7, wherein evaluating the instruction cell comprises: performing an evaluation of an operation corresponding to the instruction cell with the one or more inputs; and storing an output of the evaluation in the metadata of the instruction cell.

9. The method of claim 5, wherein the metadata of the instruction cell indicates an output of evaluating an operation corresponding to the instruction cell, and wherein performing the first execution step on the instruction cell comprises: in response to determining that the output is available in the metadata, determining, from the set of instruction cells, one or more instruction cells; and storing the output in metadata of the one or more instruction cells as an input for evaluating the one or more instruction cells.

10. The method of claim 5, wherein each of the set of instruction cells corresponds to a node in a dataflow graph (DFG) of a program for execution.

11. The method of claim 1 , further comprising: performing, by a third thread running on a processor, a third execution step from the set of the execution steps on the instruction cell during a third time period; and locking, according to the third execution step, at least the instruction cell during the performing of the third execution step.

12. The method of claim 1, further comprising: performing, by a fourth thread running on the processor, execution steps of a same type on a plurality of instruction cells simultaneously.

13. A processor configured to: perform, by a first thread running on the processor, a first execution step from a set of execution steps on an instruction cell during a first time period; and perform, by a second thread running on the processor, a second execution step from the set of execution steps on the instruction cell during a second time period, wherein the second time period overlaps at least partially with the first time period.

14. An electronic device, comprising: a processor according to claim 13 ; and a memory coupled to the processor, storing an instruction cell executed by the processor.

15. A computer program product tangibly stored on a computer-readable medium and comprising machine-executable instructions, wherein the machine-executable instructions, when executed, cause a machine to perform the method according to any one of claims 1 to 12.