[go: up one dir, main page]

WO2024255997A1 - Devices and methods for optimizing parallel execution of a machine learning model - Google Patents

Devices and methods for optimizing parallel execution of a machine learning model Download PDF

Info

Publication number
WO2024255997A1
WO2024255997A1 PCT/EP2023/065907 EP2023065907W WO2024255997A1 WO 2024255997 A1 WO2024255997 A1 WO 2024255997A1 EP 2023065907 W EP2023065907 W EP 2023065907W WO 2024255997 A1 WO2024255997 A1 WO 2024255997A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
representation
task
data processing
dependencies
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2023/065907
Other languages
French (fr)
Inventor
Ioannis LAMPROU
Zhen Zhang
Etienne FILHOL
Denis BARTHOU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/EP2023/065907 priority Critical patent/WO2024255997A1/en
Publication of WO2024255997A1 publication Critical patent/WO2024255997A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present disclosure relates to information processing technology. More specifically, the present disclosure relates to devices and methods for optimizing the parallel execution of a machine learning, ML, model on a plurality of processing devices.
  • BACKGROUND Machine Learning (ML) models such as artificial neural networks, are being implemented in more and more electronic devices for a variety of different purposes and are usually very demanding with respect to computational resources.
  • DNNs Deep Neural Networks
  • PanGu- ⁇ and GPT-3 may have up to billions of parameters. Training such models requires a very large number of computations.
  • CN 112948079 B discloses a task scheduling method to support the execution of a DNN onto a target hardware.
  • US 2021/0255896 A1 discloses a method to process tasks in parallel in the context of AI technologies.
  • US 11,010,313 B2 discloses an apparatus with a plurality of PEs for performing inference acceleration.
  • SUMMARY It is an objective to provide improved devices and methods for optimizing parallel execution of a machine learning, ML, model on a plurality of processing devices.
  • a data processing apparatus for static parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices, wherein each processing device comprises a plurality of processing elements, PEs, configured to perform one or more of a plurality of ML model tasks.
  • the data processing apparatus is configured to generate based on the ML model a computational graph, CG, representation of the ML model, wherein the CG representation comprises a plurality of nodes and a plurality of edges, wherein each of the plurality of nodes is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges defines a plurality of precedence dependencies of the nodes of the CG representation.
  • the data processing apparatus is configured to generate an enhanced CG representation of the ML model by adding to the CG representation of the ML model one or more further dependencies of the plurality of nodes of the CG representation of the ML model.
  • the data processing apparatus is further configured to compile the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices.
  • the data processing apparatus according to the first aspect achieves the intra-device speeding up of a CG by statically adjusting the parallel execution ordering of nodes, thus improving the DNN execution performance.
  • CG nodes may correspond to executable tasks of multiple types, such as tensor computation, vector unit computation, cube unit computation, communication (gather and reduce), and the like.
  • a device comprises multiple PEs.
  • Each PE is associated with a specific task type. Depending on parallelism granularity, a task may saturate all PEs of the same type in order to reach peak performance or just a fraction of them.
  • the data processing apparatus according to the first aspect makes efficient use of PEs in order to enhance the parallel execution of tasks and overall improve end-to-end DNN performance through a global static optimization scheme.
  • the data processing apparatus according to the first aspect is able to support the execution of large DNN models at improved, nearly optimal, training time via the design of a static global scheme for parallel computing optimization of the Computational Graph (CG) corresponding to the DNN model.
  • CG Computational Graph
  • the data processing apparatus is configured to add the one or more further dependencies to the CG representation of the ML model by adding one or more further edges to the CG representation of the ML model.
  • the data processing apparatus for adding the one or more further dependencies to the CG representation of the ML model the data processing apparatus is configured to: assign each of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model to a respective PE of the plurality of PEs, wherein the respective ML model task and the respective PE are of the same type; determine for each PE of the plurality of PEs an order relation of the one or more ML model tasks assigned to the PE for obtaining a plurality of order relations; determine a plurality of task dependencies based on the plurality of order relations; and determine the one or more further dependencies of the plurality of nodes of the CG representation of the ML model based on the plurality of task dependencies.
  • the data processing apparatus for assigning each of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model to the respective PE of the plurality of PEs, is configured to: determine a plurality of precedence constraints among the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model; determine a weight for each ML model task of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model; and determine for each PE of the plurality of PEs the one or more ML model tasks assigned to the PE and the order relation of the one or more ML model tasks assigned to the PE based on the plurality of precedence constraints among the plurality of ML model tasks and based on the weight for each ML model task of the plurality of ML model tasks.
  • the data processing apparatus is configured to determine the weight for each ML model task of the plurality of ML model tasks based on a cost function, and/or based on ML model operator profiling information, and/or based on information about the set of input tensors and/or output tensors of the respective node of the plurality of nodes of the CG representation of the ML model.
  • the data processing apparatus is configured to determine the assignment of ML model tasks to the plurality of PEs and the plurality of order relations for the plurality of ML model tasks based on integer linear programming, ILP, constraint programming, CP, and/or heuristic algorithm design such that the maximum makespan among the plurality of PEs is minimized.
  • each of the plurality of PEs belongs to one of a plurality of PE types and wherein for determining the plurality of task dependencies based on the plurality of order relations the data processing apparatus is configured to: obtain for each of the plurality of PE types a type order relation for the tasks assigned to the PEs of the respective PE type; and determine for each PE type a plurality of task dependencies based on the type order relation obtained for the PE type.
  • the data processing apparatus is configured to determine the plurality of task dependencies based on the plurality of order relations corresponding to PE types as the set of task dependencies having the smallest number of task dependencies among a plurality of candidate sets of task dependencies.
  • the data processing apparatus for determining the one or more further dependencies of the nodes of the CG representation of the ML model based on the plurality of task dependencies is configured to: remove redundant task dependencies of the plurality of task dependencies for obtaining a reduced plurality of task dependencies; determine a plurality of node dependencies of the plurality of nodes of the CG representation of the ML model based on the reduced plurality of task dependencies; and determine the one or more further dependencies of the plurality of nodes of the CG representation of the ML model based on the plurality of node dependencies.
  • the data processing apparatus for compiling the enhanced CG representation of the ML model is configured to partition the enhanced CG representation of the ML model into a plurality of subgraphs of the enhanced CG representation of the ML model for each of the plurality of processing devices and to compile the respective subgraph of the enhanced CG representation of the ML model for each of the plurality of processing devices.
  • a data processing method for parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices wherein each processing device comprises a plurality of processing elements, PEs, configured to perform one or more of a plurality of ML model tasks.
  • the data processing method comprises the steps of: generating based on the ML model a computational graph, CG, representation of the ML model, wherein the CG representation comprises a plurality of nodes and a plurality of edges, wherein each of the plurality of nodes is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges defines a plurality of precedence dependencies of the nodes of the CG representation; generating an enhanced CG representation of the ML model by adding to the CG representation of the ML model one or more further dependencies of the plurality of nodes of the CG representation of the ML model; and compiling the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices.
  • the data processing method according to the second aspect of the present disclosure can be performed by the data processing apparatus according to the first aspect of the present disclosure.
  • further features of the method according to the second aspect of the present disclosure result directly from the functionality of the apparatus according to the first aspect of the present disclosure as well as its different implementation forms described above and below.
  • a computer program product comprising a computer- readable storage medium for storing program code which causes a computer or a processor to perform the method according to the second aspect when the program code is executed by the computer or the processor.
  • Fig.1 shows a schematic diagram illustrating a data processing apparatus according to an embodiment for enhancing, in particular optimizing an ML model for parallelized operation on a plurality of processing devices
  • Fig.2 shows a schematic diagram illustrating the architecture of a processing device including a plurality of processing elements for performing different ML model tasks
  • Fig.3 shows an exemplary computational graph representation of an ML model as generated and used by a data processing apparatus according to an embodiment
  • Fig.4 shows an exemplary symmetric computational graph representation of a ML model as generated and used by a data processing apparatus according to an embodiment
  • Fig.5 shows a flow diagram, illustrating processing stages implemented at least partially by a data processing apparatus according to an embodiment
  • Fig.6 shows a flow diagram, illustrating further details of processing stages implemented at least partially by a data processing apparatus according to an embodiment
  • Fig.7 illustrates the transformation of
  • a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa.
  • a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures.
  • a specific apparatus is described based on one or a plurality of units, e.g.
  • a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures.
  • Figure 1 shows a schematic diagram illustrating a data processing apparatus 10 for static parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices 20.
  • the data processing apparatus 10 may comprise one or more CPUs 11, a volatile memory 12 and a disk storage 13 for storing the ML model.
  • each processing device 20 comprises a plurality of processing elements, PEs, 21a-d configured to perform one or more of a plurality of ML model tasks, such as a plurality of tensor type PEs 21a, a plurality of vector type PEs 21b, a plurality of communication type PEs 21c as well as a plurality of PEs of other types 21d.
  • the plurality of devices 20 may be multi-PE-type AI accelerators.
  • the data processing apparatus 10, in particular the CPU 11 thereof is configured to generate based on the ML model a computational graph, CG, representation 30 of the ML model.
  • Figures 3 and 4 illustrate two exemplary CG representations 30 of an ML model.
  • the CG representation 30 comprises a plurality of nodes 31 and a plurality of edges 32, wherein each of the plurality of nodes 31 is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges 32 define a plurality of precedence dependencies of the nodes 31 of the CG representation 30.
  • the data processing apparatus 10 is further configured to generate an enhanced CG representation of the ML model by adding to the CG representation 30 of the ML model one or more further dependencies of the plurality of nodes 31 of the CG representation 30 of the ML model.
  • the data processing apparatus 10 is further configured to compile the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices 20.
  • the CPU 11 may be configured to generate the CG representation 30 based on neural network model files, e.g., a python file defining a natural language processing, NLP, neural network model.
  • the CG representation 30 may be generated by an upstream parallel module run on a cluster of devices, such as a MindSpore AutoPar optimizer, which can determine a CG representation 30 for each device 20.
  • the CG representation 30 may be composed of many operators on different PE types. Also, it may consist of multiple branches, where each branch can be of any form (often the upstream parallel module may force the branches to have symmetric form, as illustrated in figure 4).
  • Figure 5 shows a flow diagram illustrating processing stages implemented at least partially by the data processing apparatus 10 according to an embodiment for implementing the functionality described above.
  • the data processing apparatus 10 parses the ML model, in particular neural network, which may be provided, for instance, as a python file. Based on this parsing of the ML model, in particular neural network, the CPU 11 is configured to generate a computational graph, CG, representation 30 of the ML model, which comprises a plurality of nodes 31 and edges 32, as illustrated in figures 3 and 4.
  • the plurality of nodes 31 represent one or more ML model tasks (also referred to as neural operators), such as a matrix multiplication, while the plurality of edges 32 represent data-flow (variables) or control-flow between the operators and therefore a plurality of dependencies of the plurality of operators, i.e. nodes 31.
  • the CPU 11 may implement an upstream parallelism module, such as the MindSpore AutoPar optimizer, for partitioning the large CG representation 30 into a plurality of smaller-sized CG representations (for instance on the device-level).
  • the CPU 11 is configured to statically analyse the CG representation 30 of the ML model for adjusting the ordering of the plurality of nodes 31 by generating additional node dependencies and inserting them in the CG representation 30 of the ML model for generating an enhanced CG representation of the ML model.
  • stage 102 of figure 5 the CPU 11 is configured to compile the enhanced CG representation of the ML model for generating executable code that may be provided to the plurality of processing devices 20.
  • stage 103 of figure 5 the executable computational graph is executed on the target devices 20, for example for DNN training or inference.
  • stage 101 of figure 5 is illustrated in figure 6.
  • the stage 101 of figure 5 comprises the stages 200 to 203 of figure 6.
  • the CPU 11 initially is configured to generate based on the CG representation of the ML model an auxiliary data structure in the form of a task graph containing the plurality of ML model tasks.
  • a node of the task graph may correspond to multiple ML model task instances according to the specification of the CG representation. By default, a node may be represented by one task.
  • each task may be assigned a unique identifier, i.e. id.
  • the CPU 11 may be configured to maintain essential information like an identifier of the corresponding node of the CG representation, a processing element type capable of performing the task, the set of parent and children tasks, and information about the shape of its input and output tensors (intermediate variables shipped from one CG node to another over CG edges).
  • essential information like an identifier of the corresponding node of the CG representation, a processing element type capable of performing the task, the set of parent and children tasks, and information about the shape of its input and output tensors (intermediate variables shipped from one CG node to another over CG edges).
  • an instance representing target hardware PEs may be generated.
  • each task may be fine-grained parallelized and saturating all hardware PEs of its associated type.
  • all hardware PEs of a specific type may be represented by one instance.
  • a Davinci chipset may be equipped with 8 hardware Cube (tensor engine) PEs and each related node of a given CG on such a PE may be fine-grained parallelized and use 8 Cube PEs (those Cube PEs may thus be presented as one instance in this stage). For each PE instance, information about its id, its type, and parallelism granularity may be kept, as discussed above.
  • An exemplary task graph is shown in figure 7.
  • the nodes N0, N1, N2, N3, N6, N7 of the original CG representation shown on the left correspond to the type A tasks T0, T1, T2, T3, T6, T7 of the task graph shown on the right, while the original CG nodes S, N4, N5 correspond to type B tasks TS, T4, T5, respectively.
  • the parents of the task node TE are the task nodes T6 and T7, while the children of task node TS are the task nodes T0 and T1.
  • the CPU 11 may create two instances to represent two PE types, namely PE0 of type A and PE1 of type B.
  • stage 201 of figure 6 the CPU 11 is configured to determine an order for each task on a PE where it can be assigned, that is, of the same type as the task itself.
  • Figure 8 shows sub-stages 300, 301 and 302 of the stage 201 of figure 6 according to an embodiment.
  • the CPU 11 is configured to extract the set of precedence constraints among the plurality of ML tasks. If two tasks are of the same type and one is a child of the other, then the CPU 11 is configured to assign the parent task earlier in the order than the child task, in case the two tasks are assigned to the same PE.
  • the CPU 11 may assign a weight value to each task based on a computed cost function, which may depend on the shapes of the input and/or output tensor(s) of a respective ML task and/or other criteria (stage 301 of figure 8).
  • the CPU 11 is configured to determine a task ordering within each PE instance.
  • the CPU 11 may implement to this end an ILP or CP scheme considering the previously computed precedence constraints (stage 300 of figure 8) and having the optimization objective to minimize the maximum makespan among all PE instances, which is a function of the assigned task weights (stage 301 of figure 8).
  • the CPU 11 may be configured to use a heuristic algorithm, which includes the greedy selection of a launchable task, e.g., the one with the fewest parents belonging to a different type and/or the largest bottom level value (capturing the seminal critical path idea), largest depth in the graph, and the like, to place in the first available position of the first available PE instance of the same type.
  • Figure 9 illustrates an example of stage 201 and its output for the task graph and the PE model provided in stage 200 of figure 6.
  • the CPU 11 is configured to convert the order relations (per PE) generated in stage 201 into a set of task-to-task dependencies in the task graph.
  • stage 400 for each PE type, the CPU 11 identifies the partial order among all tasks associated with the PE type. For the example described above, stage step follows directly, since a single PE per type is considered.
  • the CPU 11 is configured to transform the partial order per each PE type into a minimal set of task-to-task dependencies. In the example described above, one obtains dependencies (T0,T2), (T2,T1), (T1,T3), (T3,T6), (T6,T7), (T7,TE), for type A and dependencies (TS,T4), (T4,T5) for type B.
  • FIG 11 shows the generated dependencies for the exemplary task graph.
  • the CPU 11 is configured to insert the generated dependency logic into the original CG representation of the ML model.
  • a more detailed embodiment for implementing this stage 203 by the CPU 11 is illustrated in figure 12.
  • the CPU 11 is configured to identify any redundant dependencies, i.e. those dependencies that introduce an already existent precedence flow between two nodes.
  • dependency (T0,T2) is redundant since there already exists an edge (T0,T2) in the task graph.
  • the dependency (TS,T4) there already exists a path from TS to T4 meaning that TS is already set to be executed before T4.
  • the dependency (TS,T4) is also redundant.
  • Such redundant dependencies may be identified by the CPU 11 using a graph traversal scheme, such as BFS or DFS from the dependency source, or otherwise by a global pre-processing step determining all original descendants for each task (which may again be implemented as a standard graph traversal algorithm).
  • the remaining set of dependencies that is, the ones that are found not to be redundant, are transformed in stage 501 of figure 12 by the CPU 11 to corresponding node to node dependencies, e.g., (T2,T1) corresponds to (N2,N1) in the input CG representation.
  • stage 502 of figure 12 the CPU 11 is configured to introduce the node-to-node dependencies as additional edges in the input CG representation of the ML model, as also illustrated in figure 13.
  • the CPU 11 may be configured to use ILP or CP to determine tasks ordering on each PE via modelling of a task scheduling process.
  • the below term “occupying” denotes that a task occupies an offset interval in a derived task schedule for a PE.
  • ⁇ h ⁇ ( ⁇ ) denote the set of tasks to which there is an edge from ⁇ .
  • ⁇ ( ⁇ ) denote the PE type associated with task ⁇
  • ⁇ h ⁇ ( ⁇ ) denote the weight of task ⁇ .
  • Precedence constraints among tasks are reduced to simply checking parent-child relationship, since precedence constraints ⁇ ⁇ ⁇ h ⁇ ( ⁇ ) and ⁇ ⁇ ⁇ h ⁇ ( ⁇ ) imply a precedence constraint ( ⁇ , ⁇ ).
  • PE instance the set of its PEs is denoted as ⁇ ( ⁇ ).
  • the set of constraints introduced becomes therefore: (i) each task must be occupying at a PE of its associated PE type; (ii) a task can start occupying later than its parent tasks occupying has ended; and (iii) the occupyings of any two tasks at the same PE cannot overlap each other.
  • a variable ⁇ _ ⁇ is introduced capturing the makespan, that is, the maximum duration a PE is necessary for task processing.
  • the goal is to minimize the value of this variable in order to make optimal use of PE resources, whilst respecting the above stated constraints.
  • the given CP representation may be linearized in order to obtain an ILP representation by modifying the non-linear constraints related to the choice of PE, that is, the constraints involving variables ⁇ . For this purpose, it may be assumed that PEs take unique integer ids, which are consecutive in case they belong to the same PE type.
  • the absolute values may become obsolete by employing an extra set of binary variables and performing a similar transformation to the one already described above with the variables ⁇ ⁇ .
  • Optimality can be determined, since by the graph topology all tasks must start after TS is finished and the solution leaves no gap between any two tasks of type A.
  • stage 400 of figure 10 the order relation of tasks assigned to some type of PEs is computed.
  • the CPU 11 sets ⁇ ⁇ ⁇ , that is, task ⁇ comes before task ⁇ in the order relation of PE type ⁇ , if ⁇ ( ⁇ ) ⁇ ⁇ ( ⁇ ).
  • this is a partial ordering, since such a relationship might not be satisfied for each pair of two tasks of type ⁇ (in case multiple PEs are available for type ⁇ and some tasks execute in parallel on them when the default option for full saturation of PEs is not set).
  • each computed partial order per type is transformed to a minimal set of task-to-task dependencies.
  • this process corresponds to the transitive reduction of the induced dependency graph, i.e., a graph with the same task set containing edges ( ⁇ , ⁇ ) only if ⁇ ⁇ ⁇ as defined in stage 400.
  • Such a logical formula can also be implemented using CP.
  • PE Type A ⁇ (T0, T2), (T2, T1), (T1, T3), (T3, T6), (T6, T7), (T7, TE) ⁇
  • PE Type B ⁇ (TS, T4), (T4, T5) ⁇ .
  • An example embodiment is to save for each node all its ancestors, where ⁇ ⁇ ⁇ ( ⁇ ) if ⁇ ⁇ ⁇ h ⁇ ( ⁇ ) or there exists ⁇ such that it holds ⁇ ⁇ ⁇ ( ⁇ ) and ⁇ ⁇ ⁇ h ⁇ ( ⁇ ).
  • the CPU 11 may implement heuristic algorithms for obtaining a solution fast.
  • a heuristic algorithm may be composed of a “selection” method and a “fitting” method.
  • the “selection” is to choose a task according to a topological sorting method or an inverse topological sorting method.
  • the “fitting” method determines above task ordering or occupying on the target PE.
  • one or more of the following “selection” methods may be employed by the CPU 11.
  • a task ready to be performed may be selected, that is, a task with all parents already performed in order to respect precedence constraints, by prioritizing tasks with the fewest/largest number of parent tasks of different PE type.
  • a task ready to be performed with the “largest/smallest bottom value” may be selected, i.e., total weight of a path from the task to a sink node.
  • a task ready to be performed with the “largest/smallest top value” may be selected, i.e., total weight of a path from the root node to the task.
  • any combination of the above approaches or other schemes like depth-first, breadth-first or heavier-first may be used by the CPU 11 to decide on task selection.
  • These combinations define a set of heuristic algorithms, which may be used in order to cover as much as possible of the whole search space.
  • the algorithm implemented by the CPU 11 iteratively selects a task ready to be performed by prioritizing tasks with the fewest number of parent tasks of different PE type.
  • Secondary criteria used for tie-breaking include selecting the task with the “largest bottom value”, otherwise in tertiary level the “smallest id”. For this example it is assumed that the tasks have the same weights as assigned for the embodiment above.
  • the selected task is then assigned to the eligible PE either i) having been assigned the smallest overall load (total weight) of tasks so far or ii) having the earliest available position for this task among all PEs of this type.
  • ⁇ ( ⁇ ) stand for the “bottom level” value of task ⁇
  • ⁇ ( ⁇ ) for the number of parents of different type for task ⁇ and ⁇ ( ⁇ ) for a unique id assigned as last sorting criterion e.g., can correspond to a breadth-first- search graph traversal of the task graph). All these values can be precomputed together in a single graph traversal.
  • ⁇ Round 1: Select TS and assign it in order 0 in PE type B.
  • Set ⁇ ⁇ 0, ⁇ 1 ⁇ .
  • Round 2 Select T0 and assign it in order 0 in PE type A.
  • Set ⁇ ⁇ ⁇ 1, ⁇ 2 ⁇ .
  • Round 3 Select T2 and assign it in order 1 in PE type A.
  • Set ⁇ ⁇ 1, ⁇ 4 ⁇ .
  • Round 4 Select T1 and assign it in order 2 in PE type A.
  • Set ⁇ ⁇ ⁇ 4, ⁇ 3 ⁇ .
  • Round 5 Select T3 and assign it in order 3 in PE type A.
  • Set ⁇ ⁇ ⁇ 4, ⁇ 5 ⁇ .
  • Round 6 Select T4 and assign it in order 1 in PE type B.
  • stage 302 (based on the CP model in the embodiment above) returns the solution shown in figure 15.
  • the top row refers to the type B PE, the other two rows to the type A PE.
  • the CPU 11 may implement an algorithmic approach for handling stages 400 and 401 as a whole. The idea is to traverse the implied interval graph (per type) from left to right.
  • the move may be rightward first ignoring any tasks ⁇ such that ⁇ ( ⁇ ) ⁇ ⁇ ( ⁇ ), and then, adding a dependency to all tasks ⁇ such that ⁇ ( ⁇ ) ⁇ ⁇ ( ⁇ ) ⁇ ⁇ , where ⁇ denotes the minimum end time among all tasks ⁇ such that ⁇ ( ⁇ ) ⁇ ⁇ ( ⁇ ).
  • ⁇ ⁇ stands for the minimizer among all tasks ⁇ with this property, then ⁇ ⁇ ⁇ ⁇ and ⁇ ⁇ ⁇ ⁇ for any task ⁇ such that ⁇ ( ⁇ ⁇ ) ⁇ ⁇ ( ⁇ ).
  • Step 1.1 Potential dependency ⁇ to ⁇ 4: ⁇ ( ⁇ 4 ) ⁇ ⁇ ( ⁇ ) and ⁇ ( ⁇ 4 ) ⁇ ⁇ ( ⁇ 4 ) . Add dependency from TS to T4.
  • Step 1.2 Potential dependency ⁇ to ⁇ 5: ⁇ ( ⁇ 5) ⁇ ⁇ ( ⁇ ), but ⁇ ( ⁇ 5) ⁇ ⁇ ( ⁇ 4). Do not add dependency. Stop Step 1.
  • Step 1.2 Potential dependency ⁇ 0 to ⁇ 2: ⁇ ( ⁇ 2 ) ⁇ ⁇ ( ⁇ 0), and ⁇ ( ⁇ 2 ) ⁇ ⁇ ( ⁇ 2 ) . Add dependency from T0 to T2.
  • Step 1.3 Potential dependency ⁇ 0 to ⁇ 3: ⁇ ( ⁇ 3) ⁇ ⁇ ( ⁇ 0), and ⁇ ( ⁇ 3) ⁇ ⁇ ( ⁇ 2). Add dependency from T0 to T3.
  • Step 1.4 Potential dependency ⁇ 0 to ⁇ 6: ⁇ ( ⁇ 6) ⁇ ⁇ ( ⁇ 0), but ⁇ ( ⁇ 6) ⁇ ⁇ ( ⁇ 2). Do not add dependency. Stop Step 1. The process continues in the same manner.
  • PE Type A ⁇ (T0, T2), (T0, T3), (T1, T2), (T1, T3), (T2, T6), (T2, T7), (T3, T6), (T3, T7), (T6, TE), (T7, TE) ⁇
  • PE Type B ⁇ (TS, T4), (T4, T5) ⁇
  • Figure 17 shows a flow diagram illustrating processing steps of a data processing method 1700 according to an embodiment for enhancing, in particular optimizing a machine learning, ML, model for parallelized operation on the plurality of processing devices 20.
  • each processing device 20 comprises a plurality of processing elements, PEs, such as the PEs 21a-d illustrated in figure 2, which are configured to perform one or more of a plurality of ML model tasks.
  • the data processing method 1700 comprises a step 1701 generating based on the ML model a computational graph, CG, representation 30 of the ML model, wherein the CG representation 30 comprises a plurality of nodes 31 and a plurality of edges 32.
  • each of the plurality of nodes 31 is associated with one or more of the plurality of ML model tasks and the plurality of edges 32 define a plurality of dependencies of the plurality of nodes 31 of the CG representation 30.
  • the method 1700 comprises a step 1703 of generating an enhanced CG representation of the ML model by adding to the CG representation 30 of the ML model one or more further dependencies of the plurality of nodes 31 of the CG representation 30 of the ML model.
  • the method 1700 further comprises a step 1705 of compiling the enhanced CG representation of the ML model.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described embodiment of an apparatus is merely exemplary.
  • the unit division is merely logical function division and may be another division in an actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • functional units in the embodiments disclosed herein may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing apparatus (10) for enhancing, in particular optimizing a machine learning, ML, model for parallelized operation on a plurality of processing devices (20) is disclosed, wherein each processing device (20) comprises a plurality of processing elements, PEs, (21a- d) configured to perform one or more of a plurality of ML model tasks. The data processing apparatus (10) is configured to generate based on the ML model a computational graph, CG, representation (30) of the ML model, wherein the CG representation (30) comprises a plurality of nodes (31) and a plurality of edges (32), wherein each of the plurality of nodes (31) is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges (32) define a plurality of dependencies of the plurality of nodes (32) of the CG representation (30). Furthermore, the data processing apparatus (10) is configured to generate an enhanced CG representation of the ML model by adding to the CG representation (30) of the ML model one or more further dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model. The data processing apparatus (10) is further configured to compile the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices (20).

Description

DEVICES AND METHODS FOR OPTIMIZING PARALLEL EXECUTION OF A MACHINE LEARNING MODEL TECHNICAL FIELD The present disclosure relates to information processing technology. More specifically, the present disclosure relates to devices and methods for optimizing the parallel execution of a machine learning, ML, model on a plurality of processing devices. BACKGROUND Machine Learning (ML) models, such as artificial neural networks, are being implemented in more and more electronic devices for a variety of different purposes and are usually very demanding with respect to computational resources. For instance, current Deep Neural Networks (DNNs), such as PanGu-α and GPT-3, may have up to billions of parameters. Training such models requires a very large number of computations. In this respect, there exist specialized hardware, such as AI accelerators, demonstrating a high efficiency on large- size DNN execution. Hence, nowadays their use is wide within the research and industrial community. Nonetheless, designing more potent AI devices is ever rapidly facing the physical limitations of semiconductor technology. Overall, the rhythm of hardware performance improvement cannot keep up with the growth pace of modern DNNs computational requirements. Thus, parallel computing is a necessary and promising means to alleviate the consequences of these trends. Optimizing intra- and inter-device parallel computing can maximize device efficiency, thus allowing improved training speed and facilitate the execution of large DNN models. Optimal parallelization of compute tasks executed onto a set of heterogeneous Processing Elements (PEs) is a research problem of high complexity (NP- hard). There have been several previous attempts to tackle the parallel computing optimization problem in the context of enhancing DNN execution performance. CN 112948079 B discloses a task scheduling method to support the execution of a DNN onto a target hardware. US 2021/0255896 A1 discloses a method to process tasks in parallel in the context of AI technologies. US 11,010,313 B2 discloses an apparatus with a plurality of PEs for performing inference acceleration. SUMMARY It is an objective to provide improved devices and methods for optimizing parallel execution of a machine learning, ML, model on a plurality of processing devices. The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. According to a first aspect a data processing apparatus for static parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices is provided, wherein each processing device comprises a plurality of processing elements, PEs, configured to perform one or more of a plurality of ML model tasks. The data processing apparatus is configured to generate based on the ML model a computational graph, CG, representation of the ML model, wherein the CG representation comprises a plurality of nodes and a plurality of edges, wherein each of the plurality of nodes is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges defines a plurality of precedence dependencies of the nodes of the CG representation. Furthermore, the data processing apparatus is configured to generate an enhanced CG representation of the ML model by adding to the CG representation of the ML model one or more further dependencies of the plurality of nodes of the CG representation of the ML model. The data processing apparatus is further configured to compile the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices. Thus, the data processing apparatus according to the first aspect achieves the intra-device speeding up of a CG by statically adjusting the parallel execution ordering of nodes, thus improving the DNN execution performance. Specifically, CG nodes may correspond to executable tasks of multiple types, such as tensor computation, vector unit computation, cube unit computation, communication (gather and reduce), and the like. A device comprises multiple PEs. Each PE is associated with a specific task type. Depending on parallelism granularity, a task may saturate all PEs of the same type in order to reach peak performance or just a fraction of them. The data processing apparatus according to the first aspect makes efficient use of PEs in order to enhance the parallel execution of tasks and overall improve end-to-end DNN performance through a global static optimization scheme. Thus, the data processing apparatus according to the first aspect is able to support the execution of large DNN models at improved, nearly optimal, training time via the design of a static global scheme for parallel computing optimization of the Computational Graph (CG) corresponding to the DNN model. In a further possible implementation form, the data processing apparatus is configured to add the one or more further dependencies to the CG representation of the ML model by adding one or more further edges to the CG representation of the ML model. In a further possible implementation form, for adding the one or more further dependencies to the CG representation of the ML model the data processing apparatus is configured to: assign each of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model to a respective PE of the plurality of PEs, wherein the respective ML model task and the respective PE are of the same type; determine for each PE of the plurality of PEs an order relation of the one or more ML model tasks assigned to the PE for obtaining a plurality of order relations; determine a plurality of task dependencies based on the plurality of order relations; and determine the one or more further dependencies of the plurality of nodes of the CG representation of the ML model based on the plurality of task dependencies. In a further possible implementation form, for assigning each of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model to the respective PE of the plurality of PEs, the data processing apparatus is configured to: determine a plurality of precedence constraints among the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model; determine a weight for each ML model task of the plurality of ML model tasks associated with the plurality of nodes of the CG representation of the ML model; and determine for each PE of the plurality of PEs the one or more ML model tasks assigned to the PE and the order relation of the one or more ML model tasks assigned to the PE based on the plurality of precedence constraints among the plurality of ML model tasks and based on the weight for each ML model task of the plurality of ML model tasks. In a further possible implementation form, the data processing apparatus is configured to determine the weight for each ML model task of the plurality of ML model tasks based on a cost function, and/or based on ML model operator profiling information, and/or based on information about the set of input tensors and/or output tensors of the respective node of the plurality of nodes of the CG representation of the ML model. In a further possible implementation form, the data processing apparatus is configured to determine the assignment of ML model tasks to the plurality of PEs and the plurality of order relations for the plurality of ML model tasks based on integer linear programming, ILP, constraint programming, CP, and/or heuristic algorithm design such that the maximum makespan among the plurality of PEs is minimized. In a further possible implementation form, each of the plurality of PEs belongs to one of a plurality of PE types and wherein for determining the plurality of task dependencies based on the plurality of order relations the data processing apparatus is configured to: obtain for each of the plurality of PE types a type order relation for the tasks assigned to the PEs of the respective PE type; and determine for each PE type a plurality of task dependencies based on the type order relation obtained for the PE type. In a further possible implementation form, the data processing apparatus is configured to determine the plurality of task dependencies based on the plurality of order relations corresponding to PE types as the set of task dependencies having the smallest number of task dependencies among a plurality of candidate sets of task dependencies. In a further possible implementation form, for determining the one or more further dependencies of the nodes of the CG representation of the ML model based on the plurality of task dependencies the data processing apparatus is configured to: remove redundant task dependencies of the plurality of task dependencies for obtaining a reduced plurality of task dependencies; determine a plurality of node dependencies of the plurality of nodes of the CG representation of the ML model based on the reduced plurality of task dependencies; and determine the one or more further dependencies of the plurality of nodes of the CG representation of the ML model based on the plurality of node dependencies. In a further possible implementation form, for compiling the enhanced CG representation of the ML model the data processing apparatus is configured to partition the enhanced CG representation of the ML model into a plurality of subgraphs of the enhanced CG representation of the ML model for each of the plurality of processing devices and to compile the respective subgraph of the enhanced CG representation of the ML model for each of the plurality of processing devices. According to a second aspect a data processing method for parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices, wherein each processing device comprises a plurality of processing elements, PEs, configured to perform one or more of a plurality of ML model tasks. The data processing method comprises the steps of: generating based on the ML model a computational graph, CG, representation of the ML model, wherein the CG representation comprises a plurality of nodes and a plurality of edges, wherein each of the plurality of nodes is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges defines a plurality of precedence dependencies of the nodes of the CG representation; generating an enhanced CG representation of the ML model by adding to the CG representation of the ML model one or more further dependencies of the plurality of nodes of the CG representation of the ML model; and compiling the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices. The data processing method according to the second aspect of the present disclosure can be performed by the data processing apparatus according to the first aspect of the present disclosure. Thus, further features of the method according to the second aspect of the present disclosure result directly from the functionality of the apparatus according to the first aspect of the present disclosure as well as its different implementation forms described above and below. According to a third aspect a computer program product is provided, comprising a computer- readable storage medium for storing program code which causes a computer or a processor to perform the method according to the second aspect when the program code is executed by the computer or the processor. Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims. BRIEF DESCRIPTION OF THE DRAWINGS In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which: Fig.1 shows a schematic diagram illustrating a data processing apparatus according to an embodiment for enhancing, in particular optimizing an ML model for parallelized operation on a plurality of processing devices; Fig.2 shows a schematic diagram illustrating the architecture of a processing device including a plurality of processing elements for performing different ML model tasks; Fig.3 shows an exemplary computational graph representation of an ML model as generated and used by a data processing apparatus according to an embodiment; Fig.4 shows an exemplary symmetric computational graph representation of a ML model as generated and used by a data processing apparatus according to an embodiment; Fig.5 shows a flow diagram, illustrating processing stages implemented at least partially by a data processing apparatus according to an embodiment; Fig.6 shows a flow diagram, illustrating further details of processing stages implemented at least partially by a data processing apparatus according to an embodiment; Fig.7 illustrates the transformation of a CG representation of an ML model into a task graph as implemented by a data processing apparatus according to an embodiment; Fig.8 shows a flow diagram, illustrating further details of processing stages implemented at least partially by a data processing apparatus according to an embodiment; Fig.9 is a diagram showing an example illustrating the ordering stage of figure 8; Fig.10 shows a flow diagram, illustrating further details of processing stages implemented at least partially by a data processing apparatus according to an embodiment; Fig.11 is a diagram showing an example illustrating the generated set of task-to-task dependencies according to an embodiment; Fig.12 shows a flow diagram, illustrating further details of processing stages implemented at least partially by a data processing apparatus according to an embodiment; Fig.13 is a diagram showing an example illustrating the generated set of task-to-task dependencies according to an embodiment; Fig.14 is a diagram illustrating an optimized task ordering generated by a data processing apparatus according to an embodiment; Fig.15 is a diagram illustrating an optimized task ordering generated by a data processing apparatus for different types of processing elements according to an embodiment; Fig.16 is a diagram illustrating a task graph with a minimal set of dependencies generated by a data processing apparatus according to an embodiment; and Fig.17 shows a flow diagram illustrating processing steps of a data processing method according to an embodiment. In the following, identical reference signs refer to identical or at least functionally equivalent features. DETAILED DESCRIPTION OF THE EMBODIMENTS In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 1 shows a schematic diagram illustrating a data processing apparatus 10 for static parallel optimization of a machine learning, ML, model for operation on a plurality of processing devices 20. As illustrated in figure 1, the data processing apparatus 10 may comprise one or more CPUs 11, a volatile memory 12 and a disk storage 13 for storing the ML model. As illustrated in figure 2, each processing device 20 comprises a plurality of processing elements, PEs, 21a-d configured to perform one or more of a plurality of ML model tasks, such as a plurality of tensor type PEs 21a, a plurality of vector type PEs 21b, a plurality of communication type PEs 21c as well as a plurality of PEs of other types 21d. The plurality of devices 20 may be multi-PE-type AI accelerators. As will be described in more detail, the data processing apparatus 10, in particular the CPU 11 thereof is configured to generate based on the ML model a computational graph, CG, representation 30 of the ML model. Figures 3 and 4 illustrate two exemplary CG representations 30 of an ML model. As can be taken from figures 3 and 4, the CG representation 30 comprises a plurality of nodes 31 and a plurality of edges 32, wherein each of the plurality of nodes 31 is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges 32 define a plurality of precedence dependencies of the nodes 31 of the CG representation 30. The data processing apparatus 10 is further configured to generate an enhanced CG representation of the ML model by adding to the CG representation 30 of the ML model one or more further dependencies of the plurality of nodes 31 of the CG representation 30 of the ML model. The data processing apparatus 10 is further configured to compile the enhanced CG representation of the ML model for generating code for each of the plurality of processing devices 20. In an embodiment, the CPU 11 may be configured to generate the CG representation 30 based on neural network model files, e.g., a python file defining a natural language processing, NLP, neural network model. Alternatively, the CG representation 30 may be generated by an upstream parallel module run on a cluster of devices, such as a MindSpore AutoPar optimizer, which can determine a CG representation 30 for each device 20. As already mentioned above, the CG representation 30 may be composed of many operators on different PE types. Also, it may consist of multiple branches, where each branch can be of any form (often the upstream parallel module may force the branches to have symmetric form, as illustrated in figure 4). Figure 5 shows a flow diagram illustrating processing stages implemented at least partially by the data processing apparatus 10 according to an embodiment for implementing the functionality described above. In stage 100 of figure 5 the data processing apparatus 10, in particular the CPU 11 thereof parses the ML model, in particular neural network, which may be provided, for instance, as a python file. Based on this parsing of the ML model, in particular neural network, the CPU 11 is configured to generate a computational graph, CG, representation 30 of the ML model, which comprises a plurality of nodes 31 and edges 32, as illustrated in figures 3 and 4. The plurality of nodes 31 represent one or more ML model tasks (also referred to as neural operators), such as a matrix multiplication, while the plurality of edges 32 represent data-flow (variables) or control-flow between the operators and therefore a plurality of dependencies of the plurality of operators, i.e. nodes 31. In case of a large ML model, e.g. neural network and, thus, a large CG representation 30 the CPU 11 may implement an upstream parallelism module, such as the MindSpore AutoPar optimizer, for partitioning the large CG representation 30 into a plurality of smaller-sized CG representations (for instance on the device-level). In stage 101 of figure 5 the CPU 11 is configured to statically analyse the CG representation 30 of the ML model for adjusting the ordering of the plurality of nodes 31 by generating additional node dependencies and inserting them in the CG representation 30 of the ML model for generating an enhanced CG representation of the ML model. Several embodiments for generating the enhanced CG representation of the ML model by generating additional node dependencies and inserting them in the CG representation 30 of the ML model will be described further below. In stage 102 of figure 5 the CPU 11 is configured to compile the enhanced CG representation of the ML model for generating executable code that may be provided to the plurality of processing devices 20. In stage 103 of figure 5 the executable computational graph is executed on the target devices 20, for example for DNN training or inference. A more detailed embodiment of stage 101 of figure 5 is illustrated in figure 6. In other words, in an embodiment the stage 101 of figure 5 comprises the stages 200 to 203 of figure 6. In stage 200 of figure 6 the CPU 11 initially is configured to generate based on the CG representation of the ML model an auxiliary data structure in the form of a task graph containing the plurality of ML model tasks. A node of the task graph may correspond to multiple ML model task instances according to the specification of the CG representation. By default, a node may be represented by one task. In an embodiment, each task may be assigned a unique identifier, i.e. id. For each task, the CPU 11 may be configured to maintain essential information like an identifier of the corresponding node of the CG representation, a processing element type capable of performing the task, the set of parent and children tasks, and information about the shape of its input and output tensors (intermediate variables shipped from one CG node to another over CG edges). Besides, for each PE type 21a-d, an instance representing target hardware PEs may be generated. For a given CG representation, each task may be fine-grained parallelized and saturating all hardware PEs of its associated type. Thus, in an embodiment, all hardware PEs of a specific type may be represented by one instance. For example, a Davinci chipset may be equipped with 8 hardware Cube (tensor engine) PEs and each related node of a given CG on such a PE may be fine-grained parallelized and use 8 Cube PEs (those Cube PEs may thus be presented as one instance in this stage). For each PE instance, information about its id, its type, and parallelism granularity may be kept, as discussed above. An exemplary task graph is shown in figure 7. In figure 7 the nodes N0, N1, N2, N3, N6, N7 of the original CG representation shown on the left correspond to the type A tasks T0, T1, T2, T3, T6, T7 of the task graph shown on the right, while the original CG nodes S, N4, N5 correspond to type B tasks TS, T4, T5, respectively. For instance, the parents of the task node TE are the task nodes T6 and T7, while the children of task node TS are the task nodes T0 and T1. For the PE model, the CPU 11 may create two instances to represent two PE types, namely PE0 of type A and PE1 of type B. In stage 201 of figure 6 the CPU 11 is configured to determine an order for each task on a PE where it can be assigned, that is, of the same type as the task itself. Figure 8 shows sub-stages 300, 301 and 302 of the stage 201 of figure 6 according to an embodiment. In the stage 300 of figure 8 the CPU 11 is configured to extract the set of precedence constraints among the plurality of ML tasks. If two tasks are of the same type and one is a child of the other, then the CPU 11 is configured to assign the parent task earlier in the order than the child task, in case the two tasks are assigned to the same PE. Subsequently, the CPU 11 may assign a weight value to each task based on a computed cost function, which may depend on the shapes of the input and/or output tensor(s) of a respective ML task and/or other criteria (stage 301 of figure 8). Finally, in the stage 302 of figure 8 the CPU 11 is configured to determine a task ordering within each PE instance. In an embodiment, the CPU 11 may implement to this end an ILP or CP scheme considering the previously computed precedence constraints (stage 300 of figure 8) and having the optimization objective to minimize the maximum makespan among all PE instances, which is a function of the assigned task weights (stage 301 of figure 8). According to an alternative embodiment, the CPU 11 may be configured to use a heuristic algorithm, which includes the greedy selection of a launchable task, e.g., the one with the fewest parents belonging to a different type and/or the largest bottom level value (capturing the seminal critical path idea), largest depth in the graph, and the like, to place in the first available position of the first available PE instance of the same type. Figure 9 illustrates an example of stage 201 and its output for the task graph and the PE model provided in stage 200 of figure 6. In stage 202 of figure 6 the CPU 11 is configured to convert the order relations (per PE) generated in stage 201 into a set of task-to-task dependencies in the task graph. In an embodiment this may be achieved by the CPU 11 with the processing stages 400 and 401 illustrated in figure 10. Initially, in stage 400, for each PE type, the CPU 11 identifies the partial order among all tasks associated with the PE type. For the example described above, stage step follows directly, since a single PE per type is considered. In the next stage 401 the CPU 11 is configured to transform the partial order per each PE type into a minimal set of task-to-task dependencies. In the example described above, one obtains dependencies (T0,T2), (T2,T1), (T1,T3), (T3,T6), (T6,T7), (T7,TE), for type A and dependencies (TS,T4), (T4,T5) for type B. The set is minimal in the sense that (TS,T5) is also a valid dependency that could have been inserted, but would have been redundant since there already exists a dependency path from TS to T5 via T4. This is illustrated in more detail in figure 11, which shows the generated dependencies for the exemplary task graph. In stage 203 of figure 6 the CPU 11 is configured to insert the generated dependency logic into the original CG representation of the ML model. A more detailed embodiment for implementing this stage 203 by the CPU 11 is illustrated in figure 12. In stage 500 of figure 12 the CPU 11 is configured to identify any redundant dependencies, i.e. those dependencies that introduce an already existent precedence flow between two nodes. In figure 11, for instance, dependency (T0,T2) is redundant since there already exists an edge (T0,T2) in the task graph. Regarding the dependency (TS,T4), there already exists a path from TS to T4 meaning that TS is already set to be executed before T4. Hence, the dependency (TS,T4) is also redundant. Such redundant dependencies may be identified by the CPU 11 using a graph traversal scheme, such as BFS or DFS from the dependency source, or otherwise by a global pre-processing step determining all original descendants for each task (which may again be implemented as a standard graph traversal algorithm). The remaining set of dependencies, that is, the ones that are found not to be redundant, are transformed in stage 501 of figure 12 by the CPU 11 to corresponding node to node dependencies, e.g., (T2,T1) corresponds to (N2,N1) in the input CG representation. Finally, in stage 502 of figure 12 the CPU 11 is configured to introduce the node-to-node dependencies as additional edges in the input CG representation of the ML model, as also illustrated in figure 13. In the following some further embodiments of the CPU 11 for implementing the processing stage 201 of figure 6 will be described in more detail, which as already described above and illustrated in figure 8 may comprise the stages 300, 301 and 302. In an embodiment, the CPU 11 may be configured to use ILP or CP to determine tasks ordering on each PE via modelling of a task scheduling process. The below term “occupying” denotes that a task occupies an offset interval in a derived task schedule for a PE. For a task ^ in the task graph, let ^ℎ^^^^^^(^) denote the set of tasks to which there is an edge from ^. Also, let ^^^^(^) denote the PE type associated with task ^ and ^^^^ℎ^(^) denote the weight of task ^. Precedence constraints among tasks (stage 300 of figure 8) are reduced to simply checking parent-child relationship, since precedence constraints ^ ∈ ^ℎ^^^^^^(^) and ^ ∈ ^ℎ^^^^^^(^) imply a precedence constraint (^, ^). An embodiment for a weight assignment (stage 301 of figure 8) based on input/output tensors (intermediate data variables) of task ^ is
Figure imgf000014_0001
wherein ^^(^) denotes the set of input tensors received by task ^ and ^^^(^) denotes the set of output tensors provided by task ^ and wherein ^^^^^^^^(^) is the default unit size of variable ^ and ^ℎ^^^(^) denotes the set of dimension sizes, e.g., ^^^^^^^^(^) = 16 and ^ℎ^^^(^) = {2,2} would mean a total of 2 ⋅ 2 ⋅ 16 = 64 units of space (bytes) used by variable ^. To capture the logic in stage 302 of figure 8 the following constraint program (CP) may be used by the CPU 11, which will be described in more detail below. Objective ^^^^^^^^ ^^^_^^^ Constraints ^^^ ^^^ℎ ^^^^ ^: ^^^^^(^) ≥ 0 ^^^(^) ≥ 0 ^^^(^) = ^^^^^(^) + ^^^^ℎ^(^) (1) ^^^(^) ∈ ^^^^^^^^(^)^ ^^^(^) ≤ ^^^_^^^ ^^^ ^^^ℎ ^^^^ ^^ ^^^^^ ^, ^ ^^^ℎ ^ℎ^^ ^ ∈ ^ℎ^^^^^^(^): (2) ^^^(^) ≤ ^^^^^(^) ^^^ ^^^ℎ ^^^^ ^^ ^^^^^ ^, ^ ^^^ℎ ^ℎ^^ ^^^^(^) = ^^^^(^): (3) ^^^(^) = ^^^(^) → (^^^(^) ≤ ^^^^^(^) ∨ ^^^(^) ≤ ^^^^^(^)) First, some auxiliary variables for use in the constraint program are defined. It may be assumed that each task ^ is eventually assigned a start of occupying, namely ^^^^^(^) and will be assigned to a PE ^^^(^) (herein also referred to as PE instance) without preemption until its end of occupying ^^^(^) = ^^^^^(^) + ^^^^ℎ^(^), that is, for its whole weight. For a PE type ^, the set of its PEs is denoted as ^^^(^). The set of constraints introduced becomes therefore: (i) each task must be occupying at a PE of its associated PE type; (ii) a task can start occupying later than its parent tasks occupying has ended; and (iii) the occupyings of any two tasks at the same PE cannot overlap each other. Finally, a variable ^^^_^^^ is introduced capturing the makespan, that is, the maximum duration a PE is necessary for task processing. In this embodiment, the goal is to minimize the value of this variable in order to make optimal use of PE resources, whilst respecting the above stated constraints. In an embodiment, the given CP representation may be linearized in order to obtain an ILP representation by modifying the non-linear constraints related to the choice of PE, that is, the constraints involving variables ^^^. For this purpose, it may be assumed that PEs take unique integer ids, which are consecutive in case they belong to the same PE type. For some PE type ^, its PEs take integer ids within the interval [^^, ^^ ], where ^^ − ^^ + 1 is the number of PEs of type ^. It suffices that start and end values for the intervals are chosen such that for any two PE types ^, ^′, it holds
Figure imgf000015_0001
= ∅. Then, the linearization can be made by transforming constraints ^^^(^) ∈ ^^^^^^^^(^)^ to equivalent constraints ^^^(^) ≥ ^^^^^(^) and ^^^(^) ≤ ^^^^^(^). Also constraints ^^^(^) = ^^^(^) → (^^^(^) ≤ ^^^^^(^) ∨ ^^^(^) ≤ ^^^^^(^)) may be transformed to the following set of constraints: ^^^(^)|
Figure imgf000016_0001
|,
Figure imgf000016_0003
are binary variables capturing the logical disjunction.
Figure imgf000016_0002
and ^^ are large enough constants such that i) either the top or bottom constraint becomes irrelevant given the value of ^^^ and ii) both constraints become irrelevant in case ^^^(^) ≠ ^^^(^). As will be appreciated, the absolute values may become obsolete by employing an extra set of binary variables and performing a similar transformation to the one already described above with the variables ^^^ . In order to further illustrate the approach described above, reference is again made to the exemplary task graph shown in figure 7. The following weight assignment of its tasks may be assumed as an example: ^^^^ℎ^(^^) = ^^^^ℎ^(^0) = ^^^^ℎ^(^1) = ^^^^ℎ^(^^) = 20 ^^^^ℎ^(^2) = ^^^^ℎ^(^3) = 10 ^^^^ℎ^(^4) = ^^^^ℎ^(^5) = 30 ^^^^ℎ^(^6) = ^^^^ℎ^(^7) = 40 Figure 14 illustrates the optimal solution of makespan ^^^_^^^ = 180 for the case of two PEs, one of each type, with tasks of type A assigned to the bottom row, say ^^^ = 0 and tasks of type B to the top row, say ^^^ = 1. ^^^^^() and ^^^() values correspond to the box positions, e.g., ^^^^^(^2) = 40 and ^^^(^2) = 50. One may verify all constraints are met. Optimality can be determined, since by the graph topology all tasks must start after TS is finished and the solution leaves no gap between any two tasks of type A. In the following a more detailed embodiment for the implementation of the processing stage 202 of figure 6 by the CPU 11 will be described, which as already mentioned above may comprise the sub-stages 400 and 401 illustrated in figure 10. After execution of the CP for stage 302, values have been assigned to variables ^^^^^(^), ^^^(^), for each task ^. These values are made use of to obtain a minimal set of task-to-task dependencies at the end of stage 401 again by employing a CP embodiment. In stage 400 of figure 10 the order relation of tasks assigned to some type of PEs is computed. Consider a PE type ^ and any two tasks ^, ^ such that ^^^^(^) = ^^^^(^) = ^. The CPU 11 sets ^ ≺ ^, that is, task ^ comes before task ^ in the order relation of PE type ^, if ^^^(^) ≤ ^^^^^(^). As will be appreciated, this is a partial ordering, since such a relationship might not be satisfied for each pair of two tasks of type ^ (in case multiple PEs are available for type ^ and some tasks execute in parallel on them when the default option for full saturation of PEs is not set). For the example of figure 14, one obtains ^^ ≺ ^4, ^^ ≺ ^5, ^4 ≺ ^5, ^0 ≺ ^2, ^0 ≺ ^1, … , ^0 ≺ ^^, ^2 ≺ ^1, … , ^2 ≺ ^^, … and so on. In stage 401 of figure 10, each computed partial order per type is transformed to a minimal set of task-to-task dependencies. As will be appreciated, in graph-theoretic terms this process corresponds to the transitive reduction of the induced dependency graph, i.e., a graph with the same task set containing edges (^, ^) only if ^ ≺ ^ as defined in stage 400. For any two tasks ^, ^ of type ^, where ^ ≠ ^, a dependency from ^ to ^ is introduced, if there does not exist a task ^, where ^^^^(^) = ^, ^ ≠ ^, ^ ≠ ^, such that ^ ≺ ^ and ^ ≺ ^. Such a logical formula can also be implemented using CP. For the example of figure 14 (as also illustrated in figure 11), one obtains the following respective dependencies set for each type (viewed as ordered pairs of tasks): PE Type A: {(T0, T2), (T2, T1), (T1, T3), (T3, T6), (T6, T7), (T7, TE)}, and PE Type B: {(TS, T4), (T4, T5)}. With respect to stage 203 and stage 500 a more detailed embodiment for the implementation of stage 500 by the CPU 11 will be described in the following, i.e. an embodiment for identifying redundant task to task dependencies. In an embodiment, a dependency is considered redundant if the precedence flow it introduces is already existent in the task graph. Let the sequence of tasks ^^, ^^, ^^, … , ^^ define a path of length ^ − 1 in the task graph if for all ^ ∈ {2, 3, … , ^} it holds ^^ ∈ ^ℎ^^^^^^(^^^^). A path is called simple if no task is repeated, in other words all ^^ are distinct. A dependency from ^ to ^ is deemed redundant if there exists a simple path of length at least 1 from ^ to ^ in the task graph. An example embodiment is to save for each node all its ancestors, where ^ ∈ ^^^^^^^^^(^) if ^ ∈ ^ℎ^^^^^^(^) or there exists ^ such that it holds ^ ∈ ^^^^^^^^^(^) and ^ ∈ ^ℎ^^^^^^(^). Such a rule can also be implemented by the CPU 11 with the use of constraint programming. For the exemplary task graph described above, ^^^^^^^^^(^^) = ∅, ^^^^^^^^^(^0) = {^^}, ^^^^^^^^^(^2) = {^^, ^0}, and so on. In a further embodiment of stage 302 the CPU 11 may implement heuristic algorithms for obtaining a solution fast. Although such algorithms may not come with a theoretical guarantee on solution quality, they operate very well in most practical scenarios. A heuristic algorithm may be composed of a “selection” method and a “fitting” method. The “selection” is to choose a task according to a topological sorting method or an inverse topological sorting method. The “fitting” method determines above task ordering or occupying on the target PE. In an embodiment, one or more of the following “selection” methods may be employed by the CPU 11. Firstly, in a (or inverse) topological sorting process, a task ready to be performed may be selected, that is, a task with all parents already performed in order to respect precedence constraints, by prioritizing tasks with the fewest/largest number of parent tasks of different PE type. Secondly, in a (or inverse) topological sorting process, a task ready to be performed with the “largest/smallest bottom value” may be selected, i.e., total weight of a path from the task to a sink node. Thirdly, in a (or inverse) topological sorting process, a task ready to be performed with the “largest/smallest top value” may be selected, i.e., total weight of a path from the root node to the task. In general, any combination of the above approaches or other schemes like depth-first, breadth-first or heavier-first, may be used by the CPU 11 to decide on task selection. These combinations define a set of heuristic algorithms, which may be used in order to cover as much as possible of the whole search space. To illustrate this, further reference is made again to the exemplary task graph of figure 7. In an embodiment, the algorithm implemented by the CPU 11 iteratively selects a task ready to be performed by prioritizing tasks with the fewest number of parent tasks of different PE type. Secondary criteria used for tie-breaking include selecting the task with the “largest bottom value”, otherwise in tertiary level the “smallest id”. For this example it is assumed that the tasks have the same weights as assigned for the embodiment above. The selected task is then assigned to the eligible PE either i) having been assigned the smallest overall load (total weight) of tasks so far or ii) having the earliest available position for this task among all PEs of this type. In the following, it is shown how such an algorithm would proceed in the case of the example task graph and two PEs (one per PE type). Let ^^^^^ denote the set of tasks available to decide on their ordering, where initially ^^^^^ = {^^}. Let ^^(^) stand for the “bottom level” value of task ^, ^^(^) for the number of parents of different type for task ^ and ^^(^) for a unique id assigned as last sorting criterion, e.g., can correspond to a breadth-first- search graph traversal of the task graph). All these values can be precomputed together in a single graph traversal. These values are outlined below and they are used as priority criterion to select the next node out of the ^^^^^ set: ^^(^^) = 0, ^^(^^) = 140, ^^(^^) = 0 ^^(^0) = 1, ^^(^0) = 120, ^^(^0) = 1 ^^(^1) = 1, ^^(^1) = 120, ^^(^1) = 2 ^^(^2) = 0, ^^(^2) = 100, ^^(^2) = 3 ^^(^3) = 0, ^^(^3) = 100, ^^(^3) = 4 ^^(^4) = 1, ^^(^4) = 90, ^^(^4) = 5 ^^(^5) = 1, ^^(^5) = 90, ^^(^5) = 6 ^^(^6) = 1, ^^(^6) = 60, ^^(^6) = 7 ^^(^7) = 1, ^^(^7) = 60, ^^(^7) = 8 ^^(^^) = 0, ^^(^^) = 20, ^^(^^) = 9 The ordering computed is exactly the (optimal) one given in figure 9. ^^^^^ = {^^} Round 1: Select TS and assign it in order 0 in PE type B. Set ^^^^^ = {^0, ^1}. Round 2: Select T0 and assign it in order 0 in PE type A. Set ^^^^^ = {^1, ^2}. Round 3: Select T2 and assign it in order 1 in PE type A. Set ^^^^^ = {^1, ^4}. Round 4: Select T1 and assign it in order 2 in PE type A. Set ^^^^^ = {^4, ^3}. Round 5: Select T3 and assign it in order 3 in PE type A. Set ^^^^^ = {^4, ^5}. Round 6: Select T4 and assign it in order 1 in PE type B. Set ^^^^^ = {^5, ^6}. Round 7: Select T5 and assign it in order 2 in PE type B. Set ^^^^^ = {^6, ^7}. Round 8: Select T6 and assign it in order 4 in PE type A. Set ^^^^^ = {^7}. Round 9: Select T7 and assign it in order 5 in PE type A. Set ^^^^^ = {^^}. Round 10: Select TE and assign it in order 6 in PE type A. Set ^^^^^ = { }. End of Algorithm In the following a further embodiment of the CPU 11 for implementing stages 400 and 401 is described. For illustrative purposes, the so far examined model is modified by introducing an additional PE of type A with the result of having two PEs of type A and one PE of type B. No assumption is made with respect to parallelism granularity: any two tasks of type A in the task graph may execute in parallel, if precedence constraints allow it. The example task graph and example weights remain identical as above. By way of example, stage 302 (based on the CP model in the embodiment above) returns the solution shown in figure 15. The top row refers to the type B PE, the other two rows to the type A PE. In an embodiment, the CPU 11 may implement an algorithmic approach for handling stages 400 and 401 as a whole. The idea is to traverse the implied interval graph (per type) from left to right. When considering task ^, the move may be rightward first ignoring any tasks ^ such that ^^^^^(^) < ^^^(^), and then, adding a dependency to all tasks ^ such that ^^^(^) ≤ ^^^^^(^) ≤ ^^^, where ^^^ denotes the minimum end time among all tasks ^ such that ^^^(^) ≤ ^^^^^(^). In other words, if ^ stands for the minimizer among all tasks ^ with this property, then ^ ≺ ^ and ^ ≺ ^ for any task ^ such that ^^^(^) < ^^^^^(^). Therefore, a dependency ^ to ^ is redundant, thus will not be added, and the same functionality as for the embodiment above is obtained. Overall, the advantage is that the rightward movement is stopped, once the first task ^ is encountered for which these relationships hold. Hence, this embodiment does not need to consider the whole partial order like the embodiment described above, and for this embodiment stage 400 is essentially implied in the implementation of stage 401. This idea is illustrated in more detail in the following in the context of figure 15. In the case of type B, the tasks are sorted by increasing start as ^^^^^^^^^ ≔ (^^, ^4, ^5) and by increasing end as ^^^^^^^ ≔ (^^, ^4, ^5). The tasks are iteratively considered in ^^^^^^^^^. Step 1: Consider ^^. Update ^^^^^^^ = (^4, ^5). Step 1.1: Potential dependency ^^ to ^4: ^^^^^(^4) ≥ ^^^(^^) and ^^^^^(^4) < ^^^(^4). Add dependency from TS to T4. Step 1.2: Potential dependency ^^ to ^5: ^^^^^(^5) ≥ ^^^(^^), but ^^^^^(^5) ≥ ^^^(^4). Do not add dependency. Stop Step 1. Step 2: Consider ^4. Update ^^^^^^^ = (^5). Step 1.1: Potential dependency ^4 to ^5: ^^^^^(^5) ≥ ^^^(^4) and ^^^^^(^5) < ^^^(^5). Add dependency from T4 to T5. ^ In the case of type A, one obtains ^^^^^^^^^ ≔ (^0, ^1, ^2, ^3, ^6, ^7, ^^) and ^^^^^^^ ≔ (^0, ^1, ^2, ^3, ^6, ^7, ^^). We break ties by increasing end, and start, respectively. Notice there may arise cases where ^^^^^^^^^ is not identical to ^^^^^^^ depending on the assigned task weights. Step 1: Consider ^0. Update ^^^^^^^ = (^1, ^2, ^3, ^6, ^7, ^^). Step 1.1: Potential dependency ^0 to ^1: ^^^^^(^1) < ^^^(^0). Ignore. Step 1.2: Potential dependency ^0 to ^2: ^^^^^(^2) ≥ ^^^(^0), and ^^^^^(^2) < ^^^(^2). Add dependency from T0 to T2. Step 1.3: Potential dependency ^0 to ^3: ^^^^^(^3) ≥ ^^^(^0), and ^^^^^(^3) < ^^^(^2). Add dependency from T0 to T3. Step 1.4: Potential dependency ^0 to ^6: ^^^^^(^6) ≥ ^^^(^0), but ^^^^^(^6) ≥ ^^^(^2). Do not add dependency. Stop Step 1. The process continues in the same manner. To recap, for each task, the movement is rightwards, first ignoring the tasks overlapping it, then adding dependencies, until the first task which starts after the minimum end of a rightward task is encountered. Overall, the following produced minimal set of dependencies in this case is illustrated in figure 16. PE Type A: {(T0, T2), (T0, T3), (T1, T2), (T1, T3), (T2, T6), (T2, T7), (T3, T6), (T3, T7), (T6, TE), (T7, TE)} PE Type B: {(TS, T4), (T4, T5)} Figure 17 shows a flow diagram illustrating processing steps of a data processing method 1700 according to an embodiment for enhancing, in particular optimizing a machine learning, ML, model for parallelized operation on the plurality of processing devices 20. As already described above, each processing device 20 comprises a plurality of processing elements, PEs, such as the PEs 21a-d illustrated in figure 2, which are configured to perform one or more of a plurality of ML model tasks. The data processing method 1700 comprises a step 1701 generating based on the ML model a computational graph, CG, representation 30 of the ML model, wherein the CG representation 30 comprises a plurality of nodes 31 and a plurality of edges 32. As already described above, each of the plurality of nodes 31 is associated with one or more of the plurality of ML model tasks and the plurality of edges 32 define a plurality of dependencies of the plurality of nodes 31 of the CG representation 30. Moreover, the method 1700 comprises a step 1703 of generating an enhanced CG representation of the ML model by adding to the CG representation 30 of the ML model one or more further dependencies of the plurality of nodes 31 of the CG representation 30 of the ML model. The method 1700 further comprises a step 1705 of compiling the enhanced CG representation of the ML model. The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step). In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. The described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, functional units in the embodiments disclosed herein may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims

CLAIMS 1. A data processing apparatus (10) for enhancing a machine learning, ML, model for parallelized operation on a plurality of processing devices (20), wherein each processing device (20) comprises a plurality of processing elements, PEs, (21a-d) configured to perform one or more of a plurality of ML model tasks, wherein the data processing apparatus (10) is configured to: generate based on the ML model a computational graph, CG, representation (30) of the ML model, wherein the CG representation (30) comprises a plurality of nodes (31) and a plurality of edges (32), wherein each of the plurality of nodes (31) is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges (32) define a plurality of dependencies of the plurality of nodes (32) of the CG representation (30); generate an enhanced CG representation of the ML model by adding to the CG representation (30) of the ML model one or more further dependencies of the plurality of nodes (32) of the CG representation (30) of the ML model; and compile the enhanced CG representation of the ML model.
2. The data processing apparatus (10) of claim 1, wherein the data processing apparatus (10) is configured to add the one or more further dependencies to the CG representation (30) of the ML model by adding one or more further edges to the CG representation (30) of the ML model.
3. The data processing apparatus (10) of claim 1 or 2, wherein for adding the one or more further dependencies to the CG representation (30) of the ML model the data processing apparatus (10) is configured to: assign each of the plurality of ML model tasks associated with the plurality of nodes (31) of the CG representation (30) of the ML model to a respective PE (21a-d) of the plurality of PEs (21a-d), wherein the respective ML model task and the respective PE (21a-d) are of the same type; determine for each PE (21a-d) of the plurality of PEs (21a-d) an order relation of the one or more ML model tasks assigned to the PE (21a-d); determine a plurality of task dependencies based on the plurality of order relations; and determine the one or more further dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model based on the plurality of task dependencies.
4. The data processing apparatus (10) of claim 3, wherein, for assigning each of the plurality of ML model tasks associated with the plurality of nodes (31) of the CG representation of the ML model to the respective PE (21a-d) of the plurality of PEs (21a-d), the data processing apparatus (10) is configured to: determine a plurality of precedence constraints among the plurality of ML model tasks associated with the plurality of nodes (31) of the CG representation (30) of the ML model; determine a weight for each ML model task of the plurality of ML model tasks associated with the plurality of nodes (31) of the CG representation (30) of the ML model; and determine for each PE (21a-d) of the plurality of PEs (21a-d) the one or more ML model tasks assigned to the PE (21a-d) and the order relation of the one or more ML model tasks assigned to the PE (21a-d) based on the plurality of precedence constraints among the plurality of ML model tasks and based on the weight for each ML model task of the plurality of ML model tasks.
5. The data processing apparatus (10) of claim 4, wherein the data processing apparatus (10) is configured to determine the weight for each ML model task of the plurality of ML model tasks based on a cost function, based on ML model operator profiling information, and/or based on information about a set of input tensors and/or output tensors of the respective node (31) of the plurality of nodes (31) of the CG representation (30) of the ML model.
6. The data processing apparatus (10) of any one of claims 3 to 5, wherein the data processing apparatus (10) is configured to determine the assignment of ML model tasks to the plurality of PEs (21a-d) and the plurality of order relations for the plurality of ML model tasks based on integer linear programming, ILP, constraint programming, CP, and/or heuristic algorithm design such that the maximum makespan among the plurality of PEs (21a-d) is minimized.
7. The data processing apparatus (10) of any one of claims 3 to 6, wherein each of the plurality of PEs (21a-d) belongs to one of a plurality of PE types and wherein for determining the plurality of task dependencies based on the plurality of order relations the data processing apparatus (10) is configured to: obtain for each of the plurality of PE types an order relation for the tasks assigned to the PEs (21a-d) of the respective PE type; and determine for each PE type a plurality of task dependencies based on the order relation obtained for the PE type.
8. The data processing apparatus (10) of claim 7, wherein the data processing apparatus (10) is configured to determine the plurality of task dependencies based on the plurality of order relations corresponding to PE types as the set of task dependencies having the smallest number of task dependencies among a plurality of candidate sets of task dependencies.
9. The data processing apparatus (10) of any one of claims 3 to 8, wherein for determining the one or more further dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model based on the plurality of task dependencies the data processing apparatus (10) is configured to: remove redundant task dependencies of the plurality of task dependencies for obtaining a reduced plurality of task dependencies; determine a plurality of node dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model based on the reduced plurality of task dependencies; and determine the one or more further dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model based on the plurality of node dependencies.
10. The data processing apparatus (10) of any one of the preceding claims, wherein for compiling the enhanced CG representation of the ML model the data processing apparatus (10) is configured to partition the enhanced CG representation of the ML model into a plurality of subgraphs of the enhanced CG representation of the ML model for each of the plurality of processing devices (20) and to compile the respective subgraph of the enhanced CG representation of the ML model for each of the plurality of processing devices (20).
11. A data processing method (1700) for enhancing a machine learning, ML, model for parallelized operation on a plurality of processing devices (20), wherein each processing device (20) comprises a plurality of processing elements, PEs, (21a-d) configured to perform one or more of a plurality of ML model tasks, wherein the data processing method (1700) comprises: generating (1701) based on the ML model a computational graph, CG, representation (30) of the ML model, wherein the CG representation (30) comprises a plurality of nodes (31) and a plurality of edges (32), wherein each of the plurality of nodes (31) is associated with one or more of the plurality of ML model tasks and wherein the plurality of edges (32) define a plurality of dependencies of the plurality of nodes (31) of the CG representation (30); generating (1703) an enhanced CG representation of the ML model by adding to the CG representation (30) of the ML model one or more further dependencies of the plurality of nodes (31) of the CG representation (30) of the ML model; and compiling (1705) the enhanced CG representation of the ML model.
12. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (1700) of claim 11, when the program code is executed by the computer or the processor.
PCT/EP2023/065907 2023-06-14 2023-06-14 Devices and methods for optimizing parallel execution of a machine learning model Pending WO2024255997A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/065907 WO2024255997A1 (en) 2023-06-14 2023-06-14 Devices and methods for optimizing parallel execution of a machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/065907 WO2024255997A1 (en) 2023-06-14 2023-06-14 Devices and methods for optimizing parallel execution of a machine learning model

Publications (1)

Publication Number Publication Date
WO2024255997A1 true WO2024255997A1 (en) 2024-12-19

Family

ID=87001895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/065907 Pending WO2024255997A1 (en) 2023-06-14 2023-06-14 Devices and methods for optimizing parallel execution of a machine learning model

Country Status (1)

Country Link
WO (1) WO2024255997A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
US11010313B2 (en) 2018-08-29 2021-05-18 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
CN112948079A (en) 2021-02-18 2021-06-11 北京百度网讯科技有限公司 Task scheduling method, device, equipment and computer storage medium
US20210255896A1 (en) 2020-02-14 2021-08-19 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for processing tasks in parallel, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010313B2 (en) 2018-08-29 2021-05-18 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
US20210255896A1 (en) 2020-02-14 2021-08-19 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for processing tasks in parallel, device and storage medium
CN112948079A (en) 2021-02-18 2021-06-11 北京百度网讯科技有限公司 Task scheduling method, device, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
US8893080B2 (en) Parallelization of dataflow actors with local state
US8239847B2 (en) General distributed reduction for data parallel computing
US11366649B2 (en) Compilation method
US9165035B2 (en) Differential dataflow
Hoang et al. A round-efficient distributed betweenness centrality algorithm
Shen et al. Accelerate FPGA routing with parallel recursive partitioning
Duarte et al. Parallel variable neighbourhood search strategies for the cutwidth minimization problem
WO2018054221A1 (en) Pipeline dependent tree query optimizer and scheduler
US11630986B2 (en) Graph conversion method
Li et al. Amp: Automatically finding model parallel strategies with heterogeneity awareness
JP2014525640A (en) Expansion of parallel processing development environment
Kusum et al. Efficient processing of large graphs via input reduction
Mencagli et al. Spinstreams: a static optimization tool for data stream processing applications
CN117573316A (en) Optimization methods, processing methods, systems and storage media for business calculation graphs
CN118245181A (en) Self-adaptive program parallel dividing and scheduling method for large-scale super-computing system
Khatami et al. A massively parallel distributed n-body application implemented with hpx
Shen et al. Coarse-grained parallel routing with recursive partitioning for FPGAs
Gupta et al. Map-based graph analysis on MapReduce
WO2024255997A1 (en) Devices and methods for optimizing parallel execution of a machine learning model
Wang et al. A divide-and-conquer algorithm for irregular redistribution in parallelizing compilers
Wang et al. An efficient algorithm for irregular redistributions in parallelizing compilers
Prihozhy et al. Techniques for optimization of net algorithms
CN121195252A (en) Devices and methods for optimizing the parallel execution of machine learning models
Higashino et al. Attributed graph rewriting for complex event processing self-management
Ritter et al. Catalog of optimization strategies and realizations for composed integration patterns

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23733885

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: CN2023800983749

Country of ref document: CN