WO2019095873A1 - 任务并行处理方法、装置、系统、存储介质及计算机设备 - Google Patents
任务并行处理方法、装置、系统、存储介质及计算机设备 Download PDFInfo
- Publication number
- WO2019095873A1 WO2019095873A1 PCT/CN2018/108298 CN2018108298W WO2019095873A1 WO 2019095873 A1 WO2019095873 A1 WO 2019095873A1 CN 2018108298 W CN2018108298 W CN 2018108298W WO 2019095873 A1 WO2019095873 A1 WO 2019095873A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- task
- processor
- data
- executed
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/485—Task life-cycle, e.g. stopping, restarting, resuming execution
- G06F9/4856—Task life-cycle, e.g. stopping, restarting, resuming execution resumption being on a different machine, e.g. task migration, virtual machine migration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/547—Remote procedure calls [RPC]; Web services
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Definitions
- the present application relates to the field of computer technology, and in particular, to a task parallel processing method, apparatus, system, storage medium, and computer device.
- CUDA Computer Unified Device Architecture, graphics platform manufacturer NVIDIA computing platform
- Cudnn CUDA Deep Neural Network library, NVIDIA's deep neural network acceleration library
- Cublas CUDA Basic Linear Algebra Subprograms, NVIDIA
- the accelerator matrix library (such as the matrix operation acceleration library) is programmed to implement the program instructions of the convolutional neural network.
- the CUDA Cudnn, Cublas and other accelerator API interface programming, there is no interdependence between the instructions of the convolutional neural network, and only the programming instructions can be executed sequentially.
- the neural network is actually a series of queue functions, which is a graph structure.
- the program instructions of the convolutional neural network there will be a task branch.
- tensorflow Google DistBelief based on research and development of second-generation artificial intelligence learning systems
- Caffe Convolutional Architecture for Fast Feature Embedding , convolution neural network framework
- applying the above framework program to achieve task parallelism requires not only additional software installation, but also incompatibility of the program interface, which is inconvenient to use.
- the present application proposes a task parallel processing method, including:
- the parallel execution tasks in each of the work queues are controlled to start running.
- the step of constructing the task directed acyclic graph DAG includes:
- the program is split according to the operation node and/or the data node in the program, and the task to be executed is obtained.
- the step of splitting the program according to the operation node in the program, the step of acquiring the task to be performed includes:
- the model of the operation request with the model is split and/or the input data of the model is split to obtain a task to be executed.
- the splitting the model of the operation request with the model, and obtaining the task to be performed includes:
- the correspondence between the input data to be executed and the output data is set using each of the weights.
- the splitting the model of the operation request with the model, and obtaining the task to be performed includes:
- the model of the operation with the model is split in the window direction and/or the channel direction of the model according to a preset rule, and the task to be performed is obtained.
- the step of splitting the input data of the operation request with the model and obtaining the task to be performed includes:
- the input data of the operation with the model is split in the window direction of the data according to a preset rule, and the task to be executed is obtained.
- the step of splitting the program according to the operation node in the program, the step of acquiring the task to be performed includes:
- the program includes an operation request without a model
- the input data and/or output data of the operation request without the model is split to obtain a task to be executed.
- the step of splitting the input data and/or the output data of the operation request without the model to obtain the task to be performed includes:
- the input data and/or the output data are split in the window direction of the data according to a preset rule to obtain a task to be executed.
- the step of constructing the task directed acyclic graph DAG according to the dependencies between the tasks to be performed includes:
- a task directed acyclic graph DAG is constructed according to the parallel node and the sequential node.
- the step of distributing each of the required tasks to be distributed to the plurality of work queues of the processor according to the task directed acyclic graph DAG comprises:
- the step of controlling the parallel execution of the tasks to be executed in each of the work queues according to the dependencies of the tasks to be executed in the acyclic graph DAG includes:
- the present application proposes a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps mentioned in the above method.
- the present application proposes a task parallel processing system including a memory, a multi-core processor, and a computer program stored on the memory and operable on the processor, the multi-core processor capable of running a split algorithm, the multi-core processor
- the steps mentioned in the above method are implemented when the computer program is executed.
- the present application also proposes a task parallel processing system, including a memory, a first processor and a second processor, the first processor being capable of running a splitting algorithm, the second processor being a multi-core processor, the first The steps mentioned in the above method are implemented when the processor and the second processor execute the computer program.
- the present application also provides a task parallel processing apparatus, including: a DAG graph construction module, a task distribution module, and a scheduling control module.
- the DAG graph construction module is configured to construct a task directed acyclic graph DAG according to a dependency relationship between tasks to be executed;
- the task distribution module is configured to distribute each of the tasks to be executed to a plurality of work queues of the processor according to the task directed acyclic graph DAG;
- the scheduling control module is configured to control, according to the dependencies of the tasks to be executed in the acyclic graph DAG, to control parallel execution tasks in each of the working queues to start running.
- a task parallel processing method Compared with the prior art, a task parallel processing method, a storage medium, a computer device, a device and a system provided by the present application have the following beneficial effects:
- a task parallel processing method, a storage medium, a computer device, a device and a system which are proposed by the present application, construct a task directed acyclic graph DAG according to a dependency relationship between tasks, and then according to a task directed acyclic graph DAG.
- the task distribution and control are performed, and the task re-scheduling of the work queue is realized to realize the parallelism of the tasks of the multi-core processor, thereby improving the data processing efficiency.
- the implementation of the task parallel processing method proposed in this embodiment does not depend on a framework program such as tensorflow or Caffe, so there is no need to consider interface compatibility issues when designing the program.
- the present application further provides an instruction list scheduling method, including: acquiring a to-be-scheduled instruction set in a to-be-scheduled instruction list, and performing data dependency analysis on the to-be-scheduled instruction set to obtain an instruction between the instructions in the to-be-scheduled instruction set Data dependencies;
- the selection node according to the corresponding order determines the instructions of each order in the post-scheduled instruction list.
- the step of determining, according to the preset rule, the instructions in the order of the post-scheduled instruction list according to the selecting node in the corresponding order comprises:
- the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the method comprises:
- the initial execution time is updated to the longest execution time corresponding to the currently accessed selection node.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list
- the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the post-scheduling instruction table.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the selection node is selected according to a random priority rule for access, and the longest execution time corresponding to the selected node currently selected for access is obtained.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the selection node is selected for access according to the breadth-first rule, and the longest execution time corresponding to the selected node currently selected for access is obtained.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the selection node is selected according to the depth-first rule for access, and the longest execution time corresponding to the selected node currently selected for access is obtained.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the selection node that is not less than the preset order is selected according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the step of determining the instructions of each order in the post-scheduled instruction list in the selection node according to the corresponding order according to the preset rule comprises:
- All the selected nodes corresponding to the current order are evaluated according to the preset priority of the instruction, the evaluation results of the selected nodes of the current order are obtained, and the instruction corresponding to the current order is determined according to the evaluation result.
- the method includes setting a priority of each instruction according to a specific content and/or type of the currently selected node.
- the step of determining the instructions of each order in the post-scheduled instruction list in the selection node according to the corresponding order according to the preset rule comprises:
- the instruction corresponding to the current order is determined according to the length of the shortest execution time corresponding to all the selected nodes in the current order.
- An instruction scheduling device includes: an acquisition module, a data dependency analysis module, and an evaluation module,
- the obtaining module is configured to obtain a to-be-scheduled instruction set in the to-be-scheduled instruction list, and obtain, according to a data dependency relationship between the instructions, all the selected nodes corresponding to each instruction selection in the instruction scheduling process;
- the data dependency analysis module is configured to perform data dependency analysis on the instruction set to be processed, and obtain a data dependency relationship between the instructions;
- the evaluation module is configured to determine, according to a preset rule, an instruction of each order in the scheduled instruction list according to a selection node in a corresponding order.
- a computer device comprising a memory, a processor, and a computer program stored on the memory and operative on the processor, the processor performing the steps recited in the method described above.
- a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps recited in the method described above.
- the instruction list scheduling method, device, computer device and storage medium provided by the present application have the following beneficial effects:
- all the selected nodes corresponding to each instruction selection in the scheduling process are obtained, and then the instructions of each order in the scheduled instruction list are determined according to the evaluation results of the selected nodes corresponding to the respective orders.
- the method can ensure that the selected instruction is the optimal result of the current state each time the instruction is selected, and the arranged instruction list obtained by using these optimal results, the arrangement between the instructions is more compact, and the instruction in the original instruction list is shortened. The execution time of the sequence.
- the present application also provides a computer device, including a first processor, a second processor, and a memory, wherein the memory stores a plurality of offline models corresponding to the original network and input data and can be in the first process.
- a runtime system running on the device includes:
- a data processing device configured to acquire an offline model and input data corresponding to a current original network from the memory, where the offline model corresponding to the current original network includes model parameters corresponding to each computing node in the original network, An instruction and interface data of each computing node in the original network;
- a device management device configured to control the second processor to be turned on or off
- a task execution device configured to control the second processor to run an offline model of the current original network and input data.
- the data processing apparatus includes an offline model loading module and an input data loading module;
- the offline model loading module is configured to obtain an offline model corresponding to each current original network from the memory, and parse the offline model corresponding to the current original network;
- the input data loading module is configured to obtain input data corresponding to the current original network from the memory.
- the data processing apparatus further includes an input data pre-processing module, where the input data pre-processing module is configured to pre-process input data corresponding to the current original network acquired by the input data loading module. ??? enabling the second processor to run input data corresponding to the current original network, and for storing output data obtained by the second processor to the memory.
- the computer device further includes application software capable of running on the runtime system
- the data processing device is capable of providing an offline model API and an input data API
- the device management device is capable of providing a second processor driver API
- the task execution device is capable of providing a second processor running API
- the application software is capable of invoking the offline model API and input data API, the second processor driver API, and the second processor running API.
- the number of the second processors is multiple, or the second processor includes multiple processing modules;
- the task execution apparatus is further capable of providing a task assignment API
- the application software is further capable of invoking the task assignment API to control a plurality of the second processors or a plurality of processing modules controlling the second processor.
- the present application also provides a data processing method for the computer device, the method comprising the following steps:
- the control data processing device obtains, from the memory, an offline model and input data corresponding to the current original network, where the offline model corresponding to the current original network includes model parameters, instructions, and instructions corresponding to the respective computing nodes in the current original network. Interface data of each computing node in the current original network;
- the method further includes the following steps:
- the method before the step of acquiring an offline model corresponding to the current original network and inputting data from the memory, the method further includes the following steps:
- the present application also provides a data processing method for the computer device, the method comprising the following steps:
- the second processor driver API is called to control the second processor to shut down.
- the present application also provides a computer readable storage medium having stored thereon a computer program that, when executed by one or more processors, implements the steps of any of the methods described above.
- the computer device, the data processing method and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the memory, so that the second processor of the computer device can obtain the original network according to the original network.
- the offline model and the input data run the current original network to obtain the output data of the current original network. Since the offline model corresponding to each original network only includes the model parameters and instructions corresponding to each computing node in the original network and the interface data of each computing node in the original network, the data size of the offline model of the original network is much smaller than the The data level of the original network, so that by running the offline model (lightweight) corresponding to the current original network on the computer device, the processing of the heavy-level neural network data by the computer device can be realized. At the same time, by directly running the offline model corresponding to the current original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the current original network.
- the present application further provides a computer device, including a first processor, a second processor, a first memory, and a second memory, wherein the first memory stores a plurality of offline models and input data corresponding to the original network. And a runtime system capable of running on the first processor, the second memory storing an operating system capable of running on the first processor or the second processor;
- the runtime system is a secure runtime system established based on a trusted operating environment, the first memory being a secure storage medium; and when the runtime system is running on the first processor, the runtime The system can obtain the offline model and the input data corresponding to the current original network from the first memory, and control the second processor to run the offline model corresponding to the current original network;
- the offline model corresponding to the current original network includes model parameters, instructions corresponding to each computing node in the original network, and interface data of each computing node in the original network.
- the runtime system comprises:
- the data processing device capable of providing an offline model API and an input data API, configured to acquire an offline model and input data corresponding to the current original network from the first memory;
- the device management device capable of providing a second processor driving API, configured to control the second processor to be turned on or off;
- a task execution device capable of providing a second processor running API for controlling the second processor to run an offline model of the current original network and input data.
- the data processing apparatus includes an offline model loading module and an input data loading module;
- the offline model loading module is configured to provide an offline model API, configured to obtain an offline model corresponding to each current original network from the first memory, and parse the offline model corresponding to the current original network;
- the input data loading module is capable of providing an input data API for obtaining input data corresponding to the current original network from the first memory.
- the data processing apparatus further includes an input data pre-processing module, the input data pre-processing module capable of providing a data pre-processing API for pre-processing the input data of the current original network, so that The second processor is capable of running input data of the current original network and for storing output data obtained by the second processor to the first memory.
- the number of the second processors is multiple, or the second processor includes multiple processing modules;
- the task execution apparatus is further capable of providing a task assignment API for controlling a plurality of the second processors or controlling a plurality of processing modules of the second processor.
- the computer device further includes secure application software capable of running on the runtime system, and the application software is capable of invoking the offline model API and input data API, the second processing The driver API, and the second processor runs the API.
- the first memory and the second memory are physically disposed independently of each other;
- first memory and the second memory are integrated, and the first memory and the second memory are logically disposed independently of each other.
- the present application also provides a data processing method for the computer device, the method comprising the following steps:
- the second processor that controls the computer device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network;
- the output data of the current original network is stored into the first memory.
- the application also provides a data processing method for the computer device, the method comprising the following steps:
- the second processor driver API is called to control the second processor to shut down.
- the method further includes the following steps:
- the method further includes the following steps:
- the present application also provides a computer readable storage medium having stored thereon a computer program that, when executed by one or more processors, implements the steps of the method described in any of the above.
- the computer device, the data processing method, and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the first memory, so that the second processor of the computer device obtains according to the data.
- the offline model of the original network and the input data run the current original network. Since the offline model of the current original network only stores necessary network structure information such as model parameters and instructions corresponding to each computing node in the current original network and interface data of each computing node in the current original network. Therefore, the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, a secure runtime established based on a trusted execution environment such as TEE can be realized.
- the system expands the application range of neural networks to the processing of heavy-weight data such as neural networks.
- the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.
- FIG. 1 is a schematic structural diagram of a task parallel processing system proposed in an embodiment
- FIG. 2 is a schematic structural diagram of a task parallel processing system proposed in an embodiment
- FIG. 3 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
- FIG. 4 is a schematic diagram of splitting input data and output data of an operation request without a model proposed in an embodiment
- FIG. 5 is a schematic diagram of input and output of a convolution operation (conv) of a neural network model proposed in an embodiment
- FIG. 6 is a schematic diagram of splitting a conv model proposed in an embodiment
- FIG. 7 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
- Figure 8 is a task directed acyclic graph DAG constructed in an embodiment
- FIG. 9 is a schematic diagram of a result of task assignment performed in an embodiment
- FIG. 10 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
- Figure 11 is a task directed acyclic graph DAG constructed in an embodiment
- FIG. 12 is a schematic diagram of a result of task assignment performed in an embodiment
- FIG. 13 is a schematic structural diagram of a task parallel processing apparatus according to an embodiment
- FIG. 14 is a schematic structural diagram of a computer system in an embodiment
- 15 is a flow chart showing the steps of an instruction list scheduling method in an embodiment
- 16 is a data dependency diagram of an instruction to be scheduled obtained in an embodiment
- 17 is an association diagram of selected nodes obtained in one embodiment
- FIG. 18 is a schematic structural diagram of an instruction list scheduling apparatus according to an embodiment
- 20 is a structural block diagram of a computer device in an embodiment
- FIG. 21 is a structural block diagram of an embodiment of the first processor of FIG. 20;
- 22 is a block diagram showing the structure of an embodiment of the runtime system of FIG. 20;
- FIG. 23 is a structural block diagram of another embodiment of the runtime system of FIG. 20;
- FIG. 24 is a flowchart of a data processing method of an embodiment of the computer device of FIG. 20;
- FIG. 25 is a flowchart of a data processing method of another embodiment of the computer device of FIG. 20;
- 26 is a flowchart of an offline model generation method according to an embodiment
- FIG. 27 is a flowchart of a method for generating an offline model according to another embodiment
- FIG. 28 is a network structure diagram of a neural network according to an embodiment
- FIG. 29 is a schematic diagram of an offline model generation process of the neural network in FIG. 28;
- Figure 30 is a block diagram showing the structure of a computer device in another embodiment
- FIG. 31 is a flowchart of a data processing method of an embodiment of the computer device of FIG. 30;
- FIG. 32 is a flow chart of a data processing method of another embodiment of the computer device of FIG.
- FIG. 1 is a schematic structural diagram of a task parallel processing system 600 (hereinafter referred to as a first task parallel processing system for convenience of distinction) according to an embodiment of the present application.
- the processor system includes a processor 620 and a memory 610.
- the memory 610 stores instructions executable by the processor 620.
- the processor 620 includes a plurality of processor cores, and each processor core can communicate through the internal bus and execute differently. Task.
- the processor core of processor 620 can run a split algorithm.
- FIG. 2 is a schematic structural diagram of another task parallel processing system 700 (hereinafter referred to as a second task parallel processing system) for facilitating differentiation according to an embodiment of the present application.
- the task parallel processing system includes a first processor 710, Two processors 720 and memory 730. Instructions executable by the first processor 710 and/or the second processor 720 are stored on the memory 730.
- the processor core of the first processor 710 is required to have the ability to run a split algorithm; the second processor 720 may not have the ability to run a split algorithm.
- the respective processor cores of the first processor 710 and the second processor 720 communicate via the internal bus to perform different tasks.
- the first processor 710 and the second processor 720 communicate via a bus to work together.
- the first processor 710 may be a multi-core processor or a single-core processor.
- the second processor 720 can be a multi-core processor.
- FIG. 3 is a flow chart of steps of a task parallel processing method proposed by the present application.
- the method can be applied to the task parallel processing system shown in FIG. 1 or FIG. 2, and the following steps may be stored in the memory of the task parallel processing system in the form of instructions.
- the task parallel processing method may include:
- Step S301 Construct a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed.
- the directed acyclic graph DAG in this embodiment is for indicating the drive dependency between tasks to be executed.
- DAG Directed Acyclic Graph
- DAG is a kind of directed graph. It is often used to represent the driving dependencies between events and to manage the scheduling between tasks. Based on these characteristics of DAG, DAG can be used to describe the logical relationship between acquired tasks to be executed.
- the task to be executed may be executed by the processor core of the processor 620 in the first task parallel processing system 600 by executing a preset splitting algorithm, and splitting the program to be executed.
- the task to be executed may be executed by the processor core of the first processor 710 in the second task parallel processing system 700 to execute a preset splitting algorithm, and the program to be executed is split.
- This implementation step S301 can be performed by the processor core of the processor 620 in the first task parallel processing system 600, or by the processor core of the first processor in the second task parallel processing system 700.
- Step S302 Distribute each of the tasks to be executed to a plurality of work queues of the processor according to the task directed acyclic graph DAG.
- the processor core of the processor in the first task parallel processing system 600, or the processor core in the second task parallel processing system 700 may include one or more work queues.
- a work queue is a mechanism for pushing a task back and forth. It can run the pending tasks to be executed in order. The running of each task in the work queue is controlled by a kernel thread, so the control thread of the work queue can be adjusted by the interrupt control mechanism in the processor system to achieve task rescheduling or even sleep.
- the downstream tasks to be executed in parallel with the parallel nodes in the acyclic graph DAG are generally parallel executable tasks. Therefore, according to the constructed task directed acyclic graph DAG, the tasks to be executed can be distributed.
- implementation step S302 may be performed by any processor core in the first task parallel processing system 600, or may be performed by any processor core in the second task parallel processing system 700.
- Step S303 According to the dependency relationship of each task to be executed in the acyclic graph DAG, the parallel execution tasks in each of the work queues are controlled to start running.
- each work queue runs independently, when there is an output result in a work queue that depends on the tasks to be executed in other work queues, an execution error occurs if the tasks to be executed are not scheduled. Therefore, in order to ensure that the program outputs the correct result, each task to be executed in each work queue is scheduled according to the dependency relationship of each task in the task-oriented acyclic graph DAG, and the operation of each task to be executed is controlled.
- this implementation step may be performed by any of the processor cores in the first task parallel processing system 600, or may be performed by any of the processor cores in the second task parallel processing system 700.
- the task parallel processing method proposed in this embodiment constructs a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed, and then performs task distribution and control according to the task directed acyclic graph DAG.
- the rescheduling of the work queue realizes the parallelism of the tasks of the multi-core processor, improving the data processing efficiency.
- the implementation of the task parallel processing method proposed in this embodiment does not depend on a framework program such as tensorflow or Caffe, so there is no need to consider interface compatibility issues when designing the program.
- the steps of constructing the task directed acyclic graph DAG are performed according to the dependencies between the tasks to be performed:
- the program is split according to the operation node and/or the data node in the program, and the task to be executed is obtained.
- the execution program contains multiple operation requests (such as conv, pool, active, add, etc.), and there are operation nodes between each operation request. Therefore, the task to be executed can be obtained according to the operation node splitting program.
- operational requests may be executed sequentially. In this case, you can consider the data level (code level) of the execution program, or you can split according to the data nodes in the program to increase the parallel possibility of the task.
- the implementation step requires the processor core of the processor 620 in the first task parallel processing system 600, or the processor core of the first processor 710 in the second task parallel processing system 700 to execute a preset splitting algorithm, according to the program.
- the operating node and/or the data node split the program to be executed to obtain the task to be performed.
- the execution program when the execution program is split, the execution program may be split only according to the operation node, or may be split according to the data node directly at the data level, or the two may be combined.
- the split mode is selected according to actual needs, which is not limited in this application.
- the processor core of the processor 620 in the first task parallel processing system 600, or the processor of the first processor 710 in the second task parallel processing system 700 checks the program according to the operating node in the program.
- splitting there are two situations: 1) the operation request of the model is included in the program; 2) the operation request with the model is not included in the program.
- Case 1 When the program includes an operation request without a model (such as pool, batchnorm, Lrn, active, add, etc.), the program is split according to the operation node in the program, and the step of executing the task is obtained.
- a model such as pool, batchnorm, Lrn, active, add, etc.
- the input data and/or the output data of the operation request without the model are split to obtain a task to be executed.
- the input data and/or the output data of the operation request without the model may be split in the window direction (height width direction, hw direction) of the data according to a preset rule. Get the task you need to perform.
- FIG. 4 a schematic diagram of splitting input data and output data of an operation request without a model in the window direction of data.
- the default rule for this split is to divide the input data and output data equally on the plane where the window is located.
- dividing the input data and the output data in the window direction of the data to obtain the task to be performed is only a specific form of splitting the input data and the output data in the window direction of the data proposed by the embodiment.
- the data may be split in the window direction of the data in a non-uniform manner, or the data may be split in the window direction of the data in different equalization manners, as long as the input data and the output can be performed according to certain rules.
- the purpose of this step can be achieved by separating the data, and how to split it. This application is not limited.
- the present application proposes to split the input data and the output data in the window direction of the data in order to obtain a plurality of tasks to be performed, and the purpose of this step can be achieved as long as the input data and the output data are split. Therefore, when splitting an operation request without a model to obtain a task to be executed, only the input data may be split, or only the output data may be split, and the input data may be split and the output data may be split.
- the above situations can achieve the implementation purpose of this step. Specifically, how to split can be flexibly selected according to specific operations and actual needs.
- Case 2 When the program includes an operation request with a model (such as conv, mlp, etc.), the program is split according to the operation node in the program, and the steps for obtaining the task to be performed include:
- the weights corresponding to the tasks to be executed obtained by the split model are set in advance; and the weights are used to set the tasks to be executed. The correspondence between the input data and the output data.
- the model with the operation of the model may be split in the window direction (height width direction, hw direction) of the model according to a preset rule, and the task to be executed is obtained; It is also possible to split the model of the operation with the model in the channel direction (channel direction, C direction) of the model to obtain a task to be performed; and to combine the two.
- the input data of the operation with the model can also be split on the hw plane to obtain the task to be executed.
- Fig. 5 is a schematic diagram showing the input and output of a convolution operation (conv) of a neural network model.
- Figure 6 shows a schematic diagram of splitting the conv model in the channel direction.
- the mlp (Multi-Layer Perceptron) task is divided into three subtasks in the C direction of the model.
- the input data X is split into x1, x2, x3, and the corresponding output data is y1, y2, y3.
- the output data Y can be obtained by arithmetic processing y1, y2, y3.
- the method of splitting the input data of the operation with the model on the hw plane is similar to the operation of the input without the model, and the input data is split on the hw plane, and will not be described in detail here.
- splitting the operation request with the model when splitting the operation request with the model, it can be split only in the direction of the model C, or can be split only in the hw plane of the model, and can also be in the C direction of the model and the model hw plane. Split on. Although multiple splitting methods can increase the parallel possibility of tasks, theoretically reduce the running time of the program, but the difficulty of its implementation will increase accordingly. In addition, in practical applications, the tasks to be executed after running the split, the actual operation The time is also slightly larger than the theoretical running time. Therefore, how to split the operation request with the model needs to be selected according to the actual scenario, which is not limited in this application.
- the parallelism of the tasks to be executed obtained by the methods obtained by the above two situations is high, and the parallel nodes of the task-oriented acyclic graph DAG are more abundant, which makes the execution of the program more efficient.
- the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 constructs a task directed acyclic graph DAG according to the obtained dependency relationship between the tasks to be executed.
- a task directed acyclic graph DAG is constructed according to the parallel node and the sequential node.
- the two tasks to be executed are generally parallel tasks; when there is a dependency between the two tasks to be executed, the two tasks to be executed are generally serial tasks. Therefore, the parallel nodes and the sequential nodes in the directed directed acyclic graph DAG can be determined according to the dependencies between the tasks to be executed, and the tasks are filled to the task directed acyclic graph DAG according to the determined different types of nodes. The corresponding position, the completion of the task has a construction of the acyclic graph DAG.
- the task parallel processing system includes at least one processor that can run the splitting algorithm, and is used for splitting the program to obtain a task to be executed.
- the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 distributes each of the required execution tasks to the processor according to the task directed acyclic graph DAG.
- Multiple work queues including:
- Step S2021 Perform topology topology on the task directed acyclic graph DAG, and obtain a task topology sorting sequence.
- Step S2022 Sort the obtained topological sorting sequence according to the preset execution time of each task to be executed, to obtain the longest topology sorting sequence.
- Step S2023 Distribute each of the tasks to be executed to the work queue according to the longest topology sorting sequence and the dependencies between the tasks to be executed.
- the task when the processor core performs task distribution, the task may be distributed to the work queue of the processor core running the split algorithm, for example, the task is distributed to the processor 620 of the first task parallel processor system 600.
- the work queue of the processor core; the task can also be distributed to a work queue of a processor core that does not have the ability to run the split algorithm, such as the work queue of the processor core of the second processor 720 in the second task parallel processing system 700.
- the processor core can perform the tasks to be distributed, it can be guaranteed that the program to be executed can be executed in parallel, and whether the task processor core needs to execute the function of running the split algorithm does not affect the execution of the program. Therefore, this application does not limit this.
- the task distribution is performed according to the longest path of the task topology sorting sequence, and the execution time of the program can be optimized, that is, the time for executing the task in the longest topology sorting sequence theoretically is the program execution time, so that the program needs to be executed. Execute in the shortest time.
- the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 controls each of the dependencies of the tasks to be executed in the directed acyclic graph DAG.
- the parallel execution of tasks in the work queue includes:
- Step S3031 Set a reference count for each of the required tasks to be executed according to the task directed acyclic graph DAG.
- Step S3032 If the dependent task to be executed has been executed, modify the reference count of the dependent task to be executed;
- Step S3033 When the reference count of the task to be executed reaches a preset value, the task running in each of the work queues that controls the reference count reaches a preset value is controlled.
- Figure 7 is a flow chart showing the steps of a task parallel processing method. The method includes:
- Step S701 split the execution program according to the operation node in the execution program, obtain tasks A3, B2, C2, D4, E5, and F1, and perform tasks A3, B2, C2, D4, E5, and F1 according to requirements.
- the dependency relationship build task is as shown in Figure 8 for the task directed acyclic graph DAG.
- Step S702 According to the task directed acyclic graph DAG shown in FIG. 8, the tasks A3, B2, C2, D4, E5, and F1 are to be distributed to the work queue 1, the work queue 2, and the work queue 3. The distribution results are shown in Figure 9.
- Step S703 Set the reference count according to the task directed acyclic graph DAG to perform tasks A3, B2, C2, D4, E5, and control the operation of A3, B2, C2, D4, E5, F1 according to the set reference count.
- the task in the work queue needs to be executed to start running. If the reference count of task A3 is 0, the task A3 needs to be put into the work queue and can be executed directly; the task E5 needs to be executed according to the execution result of task B2 and task C2, so task E5 will be executed.
- the reference count is set to 2.
- the reference count of task E5 needs to be adjusted to 1.
- the reference count of task E5 to be executed is adjusted to 0, and when the reference count is 0, the reference count is E5. It can be started, and the control needs to perform the operation of task F1, and the final operation needs to execute the program.
- Figure 10 is a flow chart showing the steps of a task parallel processing method. The method includes:
- Step S6001 Obtain the data node in the following execution program, split the program to be executed, obtain the task to be executed, and build the task according to the dependency relationship between the tasks to be executed.
- Figure DAG Obtain the data node in the following execution program, split the program to be executed, obtain the task to be executed, and build the task according to the dependency relationship between the tasks to be executed.
- A, B, C, D, E are data nodes, conv, pool, active, add are operation nodes.
- the task in this embodiment has the result of obtaining the data E in the acyclic graph DAG depending on the processing results of the data C and the data D.
- the obtaining of the data C and the data D depends on the processing result of the data B, and the obtaining of the data B depends on The result of processing data A.
- Step S6002 According to the task directed acyclic graph DAG described in FIG. 11, each task to be executed is distributed to the work queue 1' and the work queue 2'. The distribution results are shown in Figure 12.
- Step S6003 Set a reference count according to the task-oriented acyclic graph DAG for the task to be executed, and control the running of each task to be executed according to the set reference count.
- the task to be executed in the work queue starts running, otherwise it does not run.
- the task's reference count is decremented by one until it is reduced to zero, and the task can be executed.
- the reference count of the running task E add(C,D) becomes 0, and the task E needs to be executed. After the task E is executed, the execution of the program is completed.
- the present application proposes a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method referred to in the above embodiments.
- the present application proposes a task parallel processing device, which is shown in FIG. 13 and includes a DAG map construction module 410, a task distribution module 420, and a schedule control module 430.
- the DAG map construction module 410 is configured to construct a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed
- the task distribution module 420 is configured to: according to the task directed acyclic graph DAG, Performing a task to be distributed to a plurality of work queues of the processor
- the scheduling control module 430 is configured to control parallel execution of each of the work queues according to the dependencies of the tasks to be executed in the directed acyclic graph DAG The task starts running.
- the DAG map construction module 410 is configured to split the program according to the operation node and/or the data node in the program to obtain the task to be executed.
- the DAG map construction module 410 is configured to split the model of the operation request with the model and/or input data to the model if the program includes an operation request with a model. Perform a split to get the task to be performed.
- the DAG map construction module 410 is configured to split the input data and/or output data of the operation request without the model if the program includes an operation request without a model, and obtain the required data. Perform the task.
- the DAG map construction module 410 is configured to determine parallel nodes and sequential nodes in the directed directed acyclic graph DAG according to the obtained dependencies between the tasks to be executed; The parallel node and the sequential node construction task directed acyclic graph DAG.
- the task distribution module 420 is configured to perform topology sorting on the task-oriented acyclic graph DAG, and obtain a task topology sorting sequence; according to the preset execution time of each task to be executed, the obtained location
- the topological sorting sequence is sorted to obtain a longest topological sorting sequence; and each of the required tasks to be executed is distributed to the working queue according to the longest topological sorting sequence and the dependencies between the respective tasks to be executed.
- the scheduling control module 430 is configured to set a reference count for each of the required execution tasks according to the task directed acyclic graph DAG; if the dependent execution task has been executed, the modification needs to be dependent Executing a reference count of the task; when the reference count of the task to be executed reaches a preset value, controlling the task to be executed in each of the work queues to reach a preset value to start running.
- the present application can be implemented by hardware, or by software plus a necessary general hardware platform.
- the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including several The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to run the methods of various implementation scenarios of the present application.
- the different instructions can be processed in parallel according to the corresponding instruction list, thereby improving the processing efficiency of the computer system.
- the order of the instructions in the instruction list corresponding to each processor core in the above computer system processing system may not be reasonable, for example, the instructions in the instruction list are not paralleled as much as possible, which may not improve the processing efficiency of the processing system, or may improve The effect of efficiency is not good. Therefore, how to provide an instruction list scheduling method, device, computer device and storage medium, and perform instruction sequence adjustment in the instruction list, so that the arrangement between instructions in the instruction list is more compact, and shortening the execution time of the instruction list becomes urgent Solved technical problems.
- the computer system 300 of one embodiment may be a multi-processing including a multi-core processor computing system, a heterogeneous computing system, or the like including multiple processors.
- Multi-processor Computing System the computer system may specifically include an instruction list scheduling device 310, a plurality of first processors 320, and a memory 330.
- the plurality of first processors 320 may be simultaneously connected to the instruction list scheduling device 310, and the instruction list scheduling device 310 may The instruction list for the plurality of first processors 320 is rescheduled.
- the instruction list scheduling device 310 may also include a second processor.
- the second processor may include an acquisition module, a data dependency analysis module, an evaluation module, an operation module, a control module, and the like, wherein the acquisition module may be a hardware module such as an IO (Input Input/Output Output) interface.
- the arithmetic module and the control module are hardware modules.
- the plurality of first processors 320 can process different instructions in parallel according to the instruction list to improve the processing efficiency of the computer system.
- the instruction list may include one or more instructions, and each instruction includes a set of reference operations on the resource, and the resource referenced by the instruction may be obtained by reading or running the instruction. That is, when the first processor or the like executes the instruction, the resource referenced by the instruction can be called to implement a specific operation.
- the instruction may be a load instruction, a calculation instruction, a store instruction, or the like.
- the instruction may also be a N-layer calculation of a neural network, N>0, N may be an integer, or Is not a whole number.
- each instruction in the instruction list is arranged in an execution order, and the resource referenced by each instruction in the instruction list may be a virtual memory object or a physical memory object.
- the virtual memory object can be a virtual storage space in a software logic of a memory block, a register, or other storage device capable of storing data.
- the instruction scheduling process in this embodiment is a process of reordering instructions in the instruction list under the premise of ensuring the semantics of the original instruction list, which can make the arrangement between the instructions in the instruction list more compact, so that Improve the processing efficiency of the system by shortening the execution time of the instruction list.
- the instruction list includes N instructions, where N ⁇ 1, N is a positive integer, and N instructions are marked as the first instruction, the second instruction, ..., the Nth instruction according to the execution timing.
- the scheduling process of the instruction list is a process of reordering the above N instructions.
- the instruction list scheduling apparatus 310 may first obtain the data dependency of each instruction in the instruction list to be scheduled.
- the form of the data dependency may include RAW (Read After Write) / WAR (Write After Read) / WAW (Write After Write).
- the data dependency relationship may be described by a Data Dependence Graph (DDG).
- the second processor of the instruction list scheduling apparatus 310 may obtain a list of instructions to be scheduled by using the acquiring module, and perform data dependency analysis on the instructions in the instruction list to be scheduled by the data dependency analysis module, to obtain an instruction between the instructions.
- Data dependencies may perform resource scan tracking on each instruction in the instruction list to be scheduled, and then analyze data dependencies between the instructions.
- the data dependency between the instructions in this embodiment refers to whether the execution of the current instruction needs to depend on the execution result of other instructions. For example, if there is an instruction A "reading the data written by the written instruction B", then the instruction A depends on the execution result of the instruction B. Afterwards, the obtaining module can obtain all the selected nodes for each instruction selection in the instruction scheduling process according to the data dependency relationship between the instructions.
- the instruction list scheduling apparatus may determine, by the evaluation module, the instructions of each order in the scheduled instruction list from all the selected nodes of the corresponding order according to the preset rule.
- the second processor may use the evaluation module to evaluate the selection node corresponding to the current order, obtain an evaluation result of each selection node in the current order, and determine an instruction corresponding to the current order according to the evaluation result.
- Each selection node records the ordered instruction and the instruction set to be scheduled corresponding to the selected node.
- the evaluation module evaluates the selected node corresponding to the current order according to the priority of each instruction.
- the second processor may further set an priority of the instruction according to a specific content and/or type of the currently selected node.
- the instruction list scheduling apparatus 310 may adjust the first processor corresponding to the instruction in the instruction list to be scheduled.
- the first processor corresponding to the to-be-scheduled instruction may be determined according to the type of the instruction, or the specific content of the to-be-scheduled instruction determines the corresponding first processor.
- FIG. 15 is a flowchart of steps of an instruction list scheduling method according to an embodiment of the present application.
- the instruction list scheduling method can be applied to the computer system shown in FIG. 14.
- the above computer system can include a memory 330 and a plurality of first processors 320.
- the instruction list scheduling method is used to implement rescheduling of instructions in the instruction list corresponding to the plurality of first processors in the computer system to improve processing efficiency of the computer.
- the above method may include the following steps:
- Step S100 Acquire a to-be-scheduled instruction set in the to-be-scheduled instruction list, and perform data dependency analysis on the scheduling instruction set to obtain a data dependency relationship between the instructions in the to-be-scheduled instruction set.
- the second processor may obtain a to-be-scheduled instruction set of the to-be-scheduled instruction list through the acquiring module, and obtain a data dependency relationship of the foregoing instruction by using the data dependency analysis module.
- the to-be-scheduled instruction set in this embodiment is composed of multiple to-be-scheduled instructions in the to-be-scheduled instruction list.
- the to-be-scheduled instruction set does not include a non-semantic instruction (such as a synchronization instruction, etc.) in the to-be-scheduled instruction list.
- the step of acquiring the to-be-scheduled instruction set of the to-be-scheduled instruction list includes: obtaining a to-be-scheduled instruction list, deleting a non-semantic instruction in the to-be-scheduled instruction list, and obtaining a to-be-scheduled instruction set.
- the instruction set to be scheduled acquired by the acquisition module includes six instructions ⁇ L1, L2, C1, C2, S1, S2 ⁇ .
- L1, C1, and S1 need to be executed sequentially
- L2, C2, and S2 need to be executed sequentially
- the rest of the instructions have no data dependency
- L1, L2, S1, and S2 are I/O instructions
- C1 and C2 are calculation instructions.
- the data dependency analysis module performs data dependency analysis on the to-be-scheduled instruction, and obtains a data dependency relationship between the instructions in the instruction set to be scheduled, and uses the DDG (Data Dependence Graph) shown in FIG. 16 to describe the data dependency. relationship.
- DDG Data Dependence Graph
- the resource referenced by each to-be-scheduled instruction in the to-be-scheduled instruction list may be a virtual memory object or a physical memory object.
- the virtual memory object can be a virtual storage space in a software logic of a memory block, a register, or other storage device capable of storing data.
- Step S200 According to the data dependency relationship between the instructions, all the selected nodes that perform instruction selection in the instruction scheduling process are obtained.
- Each selection node records the ordered instruction and the set of instructions to be scheduled corresponding to the selected node.
- the process of obtaining all the selections may be: the second processor obtains, by using the acquiring module, all the first selected nodes when the first instruction is selected, and specifically, obtains the sorted instruction corresponding to each first selected node. And the set of instructions to be scheduled. It should be clear that there are data dependencies for the instructions in these to-be-scheduled instruction sets. Then, the second processor acquires, by the acquisition module, all the second selection nodes associated with each first selection node according to the data dependency relationship of each first selection node, and the second selection node corresponds to the second instruction selection.
- the third selection node ..., the Nth selection node, N ⁇ 3, N is a positive integer.
- the sum of the first selection node, ..., the Nth selection node acquired in the above steps constitutes all the selection nodes that are selected each time the instruction is selected.
- the acquired instruction set in the to-be-scheduled instruction list includes a total of six instructions: ⁇ L1, L2, C1, C2, S1, S2 ⁇ , and the data dependency relationship between the six instructions is represented by FIG.
- FIG. 16 It can be clearly seen from FIG. 16 that the six instructions L1 and L2 in the to-be-scheduled instruction set can be executed independently of other instructions. Therefore, when performing the first instruction selection, it is necessary to select from L1 and L2, that is, the first acquired.
- the selection node corresponds to two cases of the selection instruction L1 or L2. When L1 is selected when the first instruction is selected, L1 is the sorted instruction.
- the first selection node records the sorted instruction L1, and deletes the instruction set ⁇ L2, C1, C2, S1, S2 ⁇ of the instruction L1.
- the first selection node records the sorted instruction L2, and the instruction set to be scheduled ⁇ L1, C1, C2 of the delete instruction L2. , S1, S2 ⁇ .
- the above process can be cycled to obtain the second selection node when the second instruction is selected, ..., the sixth selection node when the sixth instruction is selected.
- the instruction instruction set to be scheduled according to the previous instruction is selected, for example, the instruction set to be scheduled corresponding to FIG. 3, and the instruction selected when the first instruction is selected is L1 (corresponding to One of the first selection nodes), the obtained instruction set ⁇ L2, C1, C2, S1, S2 ⁇ , the scheduling instruction set instruction L2 of the first selection node, C1 may not depend on the execution of other instructions, at this time,
- L2 corresponding to One of the first selection nodes
- the instruction selected when the first instruction is selected is L2 (corresponding to another first selection node)
- the instruction set to be scheduled ⁇ L1, C1, C2, S1, S2 ⁇ , the instruction instruction set command L1, C2 of the first selection node may not depend on the execution of other instructions, and at this time, when the second instruction selection is performed, Select from L1, C2 (there are also two second selection nodes). It can be seen that there is an association between all the selected nodes obtained in this embodiment,
- Step S300 Determine, according to a preset rule, an instruction of each order in the list of instructions after scheduling according to a selection node of the corresponding order.
- the second processor may evaluate, by using the evaluation module, the selected node corresponding to the current order, obtain the evaluation result of each selected node in the current order, and determine an instruction corresponding to the current order according to the evaluation result.
- the current order is the second instruction.
- the four second selection nodes in FIG. 17 are evaluated according to a preset rule, and the second instruction in the scheduled instruction list is obtained according to the evaluation result.
- the evaluation module evaluates the selected node corresponding to the current order according to the preset priority of each instruction (for example, L2 has the highest priority, C1 is the second%), and the evaluation result is obtained.
- the second processor sets the priority of each instruction according to the specific content and/or type of the currently selected node.
- the evaluation module may determine the instruction corresponding to the current order according to the length of the shortest execution time corresponding to all the selected nodes in the current order. For example, in FIG. 17 L1 instruction node corresponding to the first selection, the minimum execution time of the instruction sequence corresponding to t 1, the first selection instruction node corresponding to L2, the minimum execution time for the instruction sequence corresponding to t 2, t 1 >t 2 , then L2 is determined as the first instruction in the dispatched instruction list. Similarly, the second instruction of the scheduled instruction list is determined, ..., the sixth instruction.
- the instruction list scheduling method in this embodiment determines all the selected nodes for each instruction selection in the instruction scheduling process by analyzing the data dependency relationship of the instruction to be scheduled, and then determines the scheduling according to the evaluation result of the selected node corresponding to each order.
- the instructions in each order in the instruction list. The method can ensure that the selected instruction is the optimal result of the current state each time the instruction is selected, and the arranged instruction list obtained by using these optimal results, the arrangement between the instructions is more compact, and the instruction in the original instruction list is shortened. The execution time of the sequence.
- the evaluating module determines, according to the preset order, the instructions of each order in the scheduled instruction list in the selected node, including:
- Step S210 The evaluation module accesses the selection node, and acquires the longest execution time corresponding to the currently accessed selection node.
- the selection node accessed by the evaluation module may be a first selection node, a second selection node, ..., an Nth selection node.
- Step S220 If the longest execution time corresponding to the currently accessed selection node is less than the initial execution time T 0 , the ordered instruction of the current access node is determined as the corresponding instruction in the scheduled instruction list.
- the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the maximum execution time corresponding to the currently selected selection node in this implementation step refers to the execution time when the arrangement of the instruction sequence corresponding to the current access node is the most unreasonable.
- the execution time of the instruction sequence obtained by the instruction list scheduling method in the present embodiment is not greater than the instruction sequence in the instruction list to be scheduled, because the maximum execution time corresponding to the currently accessed selection node is less than the initial execution time.
- the instruction in the instruction list is not scheduled according to the selection order of the current order, and the influence of the determined current order instruction on the subsequent instruction selection can be avoided. It is particularly suitable for scheduling a list of instructions containing a computationally intensive instruction, optionally a list of instructions containing neural network operational instructions.
- the instruction list contains N instructions, which include a weight loading instruction A, and a neural network convolutional layer calculation instruction B. If the conventional method is used, the instruction A and the instruction B may not be parallel. To achieve the highest processing efficiency of the system, the instruction list scheduling scheme of this embodiment can implement the instruction A and the instruction B in parallel in the scheduled instruction list.
- the method may further include: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time is updated to the longest execution time corresponding to the currently accessed selection node. For example, in the above embodiment, when T 1 ⁇ T 0 , L1 and L2 are respectively used as the first instruction and the second instruction in the scheduled instruction list, and T 1 is updated to the initial execution time.
- the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list, which can be guaranteed.
- the execution time of the instruction sequence in the instruction list after scheduling is shorter.
- the above scheme for updating the initial execution time is to further optimize the ordering of instructions and improve the processing efficiency of the system.
- the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
- the selected node is accessed within a preset access time period to obtain the longest execution time corresponding to each selected node in the preset access time period.
- This embodiment needs to determine the instructions of each order of the post-schedule instruction list in combination with the method proposed in the above embodiment.
- the instruction list scheduling method proposed by the present application It is intended to further shorten the execution time of the instruction list by rearranging the instructions in the instruction list. Based on this, the purpose of the present application is achieved as long as the new instruction list obtained by the instruction list scheduling method proposed by the present application shortens the execution time. Therefore, when the instruction list scheduling method proposed by the present application is actually used to perform instruction reordering, the access time period and the scheduling time of the control instruction are generally set according to actual needs.
- the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the post-scheduling instruction table.
- the longest execution time corresponding to the currently selected selection node is not less than the initial execution time, and the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the scheduling instruction table, which is the instruction list scheduling method proposed in the foregoing embodiment. optimization. It can be guaranteed that the obtained instruction sequence in the list of scheduled instructions is the optimal result obtained within the preset time period.
- the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node is as follows:
- Step S230 The evaluation module acquires the shortest execution time corresponding to the currently accessed selected node.
- Step S240 If the shortest execution time corresponding to the currently accessed selection node is greater than the initial execution time T 0 , the access node associated with the currently accessed selection node is terminated.
- the shortest execution time of the second selection node corresponding to the instruction L2 is T 2
- the T 2 corresponds to the case where the unsorted instructions C1, C2, S1, and S2 corresponding to the selected node are perfectly parallel, and the sorting is most reasonable. If T 2 > T 0 , then accessing the third selection node associated with the second selection node, and the fourth selection node, ..., the sixth selection node associated with the third selection node.
- the technical solution of the embodiment can eliminate invalid access to the selected node and improve the scheduling efficiency of the instruction list.
- the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the selected node currently selected for access includes: the evaluation module according to random priority (eg, Monte Carlo tree search, MCTS, Monte Carlo Tree Search) selects the selected node to access and obtains the longest execution time corresponding to the selected node currently selected for access.
- random priority eg, Monte Carlo tree search, MCTS, Monte Carlo Tree Search
- the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes: the evaluation module selects the rule according to the breadth first (BFS, Breadth First Search) rule.
- the selection node performs access and obtains the longest execution time corresponding to the selected node currently selected for access.
- the breadth priority in the embodiment refers to preferentially selecting a selection node in the same order as the currently accessed selection node for access. For example, if the second selection node is currently accessed, the next selected selection node preferentially selects other second selection nodes.
- the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes: the evaluation module selects according to the rule of depth priority (BFS, Depth First Search)
- the selection node performs access and obtains the longest execution time corresponding to the selected node currently selected for access.
- the depth priority in this embodiment refers to preferentially selecting a selection node in the next order associated with the currently accessed selection node for access. For example, if the second selection node is currently accessed, the next visited selection node preferentially selects the third selection node associated with the second selection node.
- the evaluation module may further select the selected node to access by using a random preference combined with a depth-first rule, or select the selected node to access by using a breadth-first priority combined with a depth-first rule.
- the selected node that is smaller than the preset order is selected according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access; and the preset order is not selected according to the depth-first rule.
- the selection node performs access to obtain the longest execution time corresponding to the selected node currently selected for access.
- the preset values of the corresponding order are determined according to empirical values, or determined according to pre-experiment results.
- the evaluation module of the instruction list scheduling apparatus does not have enough time to traverse all the selected nodes.
- the selection node is selected by the principle of depth-first or breadth-first, the access node is selected.
- the extent of the selection node of the final access may be compared one-sided (for example, only access nodes associated with a certain selected node, or only selected nodes of the previous order), and the selection node is selected only by random preference.
- the randomness of the selected node that is ultimately accessed during access is too strong. Therefore, it is preferable to select the selected node for access by using the random preference combined with the depth-first rule, or to select the selected node for access by using the breadth-priority combined with the depth-first rule. Program.
- FIG. 18 is a schematic structural diagram of an instruction list scheduling apparatus proposed in one embodiment, the apparatus includes an obtaining module 510, a data dependency analyzing module 520, and an evaluating module 530, wherein the obtaining module 510 is configured to acquire a to-be-scheduled The set of instructions to be scheduled in the instruction list, and according to the data dependency relationship between the instructions, all the selected nodes for each instruction selection in the instruction scheduling process are obtained.
- the data dependency analysis module 520 is configured to perform data dependency analysis on the instruction set to be processed, and obtain a data dependency relationship between the instructions in the instruction set to be scheduled.
- the evaluation module 530 is configured to determine, according to a preset rule, instructions in each order in the scheduled instruction list according to the selected nodes in the corresponding order.
- the evaluation module 530 accesses the selection node and obtains the longest execution time corresponding to the currently accessed selection node; if the currently accessed selection node corresponds to the maximum execution time is less than the initial execution time Then, the ordered instruction of the currently accessed selection node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the instruction scheduling apparatus further includes an update module, where the maximum execution time corresponding to the currently accessed selection node is less than the initial execution time, and the initial execution time is updated to the current access selection. The maximum execution time corresponding to the node.
- the evaluation module 530 is configured to access the selected node within a preset access time period, and obtain the longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to the longest execution If the time is less than the initial execution time, the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the evaluation module 530 is configured to use the instruction sequence in the instruction list to be scheduled as the instruction sequence in the scheduling instruction table when the maximum execution time corresponding to the currently accessed selection node is not less than the initial execution time.
- the evaluation module 530 is configured to select the selected node for access according to a random priority rule, and obtain a maximum execution time corresponding to the selected node currently selected for access.
- the evaluation module 530 is configured to select the selected node for access according to the breadth-first rule, and obtain the longest execution time corresponding to the selected node currently selected for access.
- the evaluation module 530 is configured to select the selected node for access according to a depth-first rule, and obtain a maximum execution time corresponding to the selected node currently selected for access.
- the evaluation module 530 is configured to select the selected node that is smaller than the preset order according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access; according to the depth The priority rule selects the selected node that is not less than the preset order to access, and obtains the longest execution time corresponding to the selected node currently selected for access.
- the evaluation module 530 is configured to obtain a shortest execution time corresponding to the currently accessed selection node; if the shortest execution time corresponding to the currently accessed selection node is greater than the initial execution time, the selection of the termination and the current access is terminated.
- the node associates the selection node; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the evaluation module 530 is configured to evaluate all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtain the evaluation results of the selection nodes of the current order, and determine the current order according to the evaluation result. Corresponding instructions.
- the evaluation module 530 is configured to set the priority of each instruction according to the specific content and/or type of the currently selected node.
- the evaluation module 530 is configured to determine an instruction corresponding to the current order according to the length of the shortest execution time corresponding to all the selected nodes in the current order.
- Each of the above-described instruction list scheduling devices may be implemented in whole or in part by software, hardware, and combinations thereof.
- Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
- a computer device which may be a terminal, and its internal structure diagram may be as shown in FIG.
- the computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus.
- the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium, an internal memory.
- the non-volatile storage medium stores an operating system and a computer program.
- the internal memory provides an environment for operation of an operating system and computer programs in a non-volatile storage medium.
- the network interface of the computer device is used to communicate with an external terminal via a network connection.
- the computer program is executed by the processor to implement the verification excitation generation method and/or the chip verification method mentioned in the above embodiments.
- the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
- the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.
- FIG. 19 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
- the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
- a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program, the following steps: obtaining a list of instructions to be scheduled The set of instructions to be scheduled, and the data dependency analysis of the scheduling instruction set, to obtain the data dependency relationship between the instructions; according to the data dependency relationship between the instructions, each instruction selection process is obtained during the instruction scheduling process. Selecting a node; according to a preset rule, determining a sequence of instructions in the list of instructions after scheduling according to a corresponding node of the corresponding order.
- the following steps are further performed: accessing the selection node, and obtaining a maximum execution time corresponding to the currently accessed selection node; if the currently accessed selection node corresponds to a maximum execution time is less than The initial execution time determines the ordered instruction of the current access node as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the processor executes the computer program, the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time update is the longest execution corresponding to the currently accessed selection node. time.
- the processor executes the computer program
- the following steps are further performed: if the currently executed selected node corresponds to the longest execution time being less than the initial execution time, the instruction sequence is randomly generated based on the ordered instruction corresponding to the current access node, and The sequence of instructions of the list of instructions to be scheduled is updated using the randomly generated sequence of instructions.
- the following steps are further performed: accessing the selected node within a preset access time period, and obtaining a longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to The maximum execution time is less than the initial execution time, and the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the processor when executing the computer program, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the processor when executing the computer program, further implements the step of selecting the selected node for access according to a random-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the processor when executing the computer program, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the following steps are further performed: selecting the selected node that is smaller than the preset order according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access. Selecting the selected node that is not less than the preset order according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
- the processor executes the computer program
- the following steps are further performed: obtaining a shortest execution time corresponding to the currently accessed selected node; and if the currently executed selected node corresponding to the shortest execution time is greater than the initial execution time, terminating the access and the current The selected node associated with the selected node is accessed; wherein the initial execution time is the execution time of the sequence of instructions in the list of instructions to be scheduled.
- the processor executes the computer program, the following steps are further implemented: evaluating all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtaining the evaluation results of the selected nodes of the current order, and according to the evaluation result Determine the instruction corresponding to the current order.
- the processor also implements the step of setting the priority of each instruction based on the specific content and/or type of the currently selected node when executing the computer program.
- the processor when executing the computer program, further implements the step of determining an instruction corresponding to the current order based on the length of the shortest execution time corresponding to all of the selected nodes in the current order.
- a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of: acquiring a set of instructions to be scheduled in a list of instructions to be scheduled, and treating the instructions
- the data dependency analysis is performed to obtain a data dependency relationship between the instructions; according to the data dependency relationship between the instructions, all the selected nodes for each instruction selection in the instruction scheduling process are obtained; according to the preset rule, according to the preset rule
- the selection node of the order determines the instructions of each order in the list of instructions after scheduling.
- the following steps are further performed: accessing the selection node, and obtaining the longest execution time corresponding to the currently accessed selection node; and the longest execution time corresponding to the currently accessed selection node If the initial execution time is less than the initial execution time, the ordered instruction of the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time is updated to be the longest corresponding to the currently accessed selection node. execution time.
- the following steps are further performed: accessing the selected node within a preset access time period, and obtaining a longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to The longest execution time is less than the initial execution time, and the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is not less than the initial execution time, the instruction sequence in the instruction list to be scheduled is used as the post-scheduling instruction list. The sequence of instructions in .
- the computer program when executed by the processor, further implements the step of selecting the selection node for access according to a random-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the computer program when executed by the processor, further implements the step of selecting the selected node for access in accordance with a depth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the computer program when executed by the processor, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
- the following steps are further performed: selecting the selected node that is smaller than the preset order to access according to the breadth or random priority rule, and obtaining the longest execution corresponding to the selected node currently selected for access Time; selecting the selected node that is not less than the preset order according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
- the following steps are further performed: obtaining a shortest execution time corresponding to the currently accessed selected node; and if the currently executed selected node corresponding to the shortest execution time is greater than the initial execution time, terminating the access and The currently selected selection node associated with the selection node; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
- the following steps are further performed: evaluating all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtaining the evaluation results of the selected nodes of the current order, and according to the evaluation The result determines the instruction corresponding to the current order.
- the computer program when executed by the processor, also implements the step of setting the priority of each instruction based on the specific content and/or type of the currently selected node.
- the computer program when executed by the processor, further implements the step of determining an instruction corresponding to the current order based on the length of the shortest execution time corresponding to all of the selected nodes in the current order.
- each computing node in the neural network model needs to be compiled and parsed separately, and then, according to the structural form of the neural network model.
- Each computing node is executed in a certain form.
- the neural network model and the network structure may be artificial neural network model data that has been trained or not trained. The above processing method for the neural network affects the processing speed of the processor, and the processing efficiency is low.
- the embodiment of the present application further provides a method for generating an offline model, where the offline model generation method can be run on a cloud server or a neural network dedicated processor, and the obtained offline model of the original network is stored in the memory 130.
- the cloud server or neural network dedicated processor is a processor capable of executing heavyweight data such as a neural network, which may not be included in the above computer device.
- the foregoing method includes the following steps:
- model data set of the original network and a model structure parameter may be obtained by using an acquisition module of the cloud server or the neural network dedicated processor, and the model data set of the original network is obtained.
- model structure parameters can obtain the network structure diagram of the original network.
- the model data set includes data such as model parameters corresponding to each computing node in the original network, and W1 to W6 in the neural network shown in FIG. 28 are used to represent model parameters of the computing node.
- the model structure parameter includes a connection relationship of a plurality of computing nodes in the original network and a calculation attribute of each computing node, wherein the connection relationship between the computing nodes is used to indicate whether there is data transmission between the computing nodes, for example, when multiple computing nodes When there is a data flow between, it can be said that there are connection relationships between multiple computing nodes.
- the connection relationship of the computing nodes may include an input relationship, an output relationship, and the like.
- the calculation node F1 outputs as an input of the calculation nodes F4 and F5, it can be explained that there is a connection relationship between the calculation node F1 and the calculation node F4, and the connection relationship between the calculation node F1 and the calculation node F4.
- the computing node F1 and the computing node F2 it can be said that there is no connection relationship between the computing node F1 and the computing node F2.
- the calculation attribute of each computing node may include a calculation type and a calculation parameter of the corresponding calculation node, wherein the calculation type of the calculation node refers to what calculation is used by the calculation node, for example, the calculation type of the calculation node may include addition, subtraction, and Convolution operations and the like, correspondingly, the compute node may be a compute node for performing an add operation, a compute node for implementing a subtraction operation, a compute node for implementing a convolution operation, and the like.
- the calculation parameter of the calculation node may be a necessary parameter required to complete the calculation type corresponding to the calculation node.
- the calculation type of the calculation node may be a calculation node for implementing an addition operation.
- the calculation parameter of the calculation node may be an addend in the addition operation, and the addend in the addition operation may be obtained as input data.
- the module acquires, or the added number in the addition operation may be the output data of the previous computing node of the computing node, and the like.
- the original network may be an artificial neural network established for a general purpose processor such as a CPU, GPU or DSP based on a deep learning system such as TensorFlow, MXNet, Caffe, and PyTorch.
- the original network may also be an artificial neural network established for an intelligent processor such as an IPU.
- the model data set (caffemodel) and model structure parameters (prototxt) of the Caffe network can be obtained.
- the model data set (caffemodel) includes data such as model parameters of the Caffe network
- the model structure parameter (prototxt) includes calculation attributes of each computing node of the Caffe network and a connection relationship between a plurality of computing nodes.
- the computing module of the cloud server or the neural network dedicated processor may run the original network according to the model data set of the original network and the model structure parameters, and obtain instructions corresponding to the respective computing nodes in the original network.
- the acquisition module of the cloud server or the neural network dedicated processor can also obtain the input data of the original network, and the operation module of the cloud server or the neural network dedicated processor can be based on the input data of the original network, the network model data set, and the model structure.
- the parameter runs the original network, and obtains instructions corresponding to each computing node in the original network.
- the process of running the original network to obtain the instructions of the respective computing nodes is substantially a compiled process, and the compiling process can be implemented by a cloud server or a neural network dedicated processor or a virtual device. That is, the cloud server or the neural network dedicated processor or virtual device runs the original network according to the model data set of the original network and the model structure parameters.
- the virtual device refers to a virtual processor running space in the memory space of the memory.
- the operation of the original network in this embodiment means that the cloud server or the neural network dedicated processor runs the machine learning algorithm (such as the neural network algorithm) using the artificial neural network model data, and implements the algorithm by performing the forward operation.
- Target applications such as artificial intelligence applications such as speech recognition).
- the control module of the cloud server or the neural network dedicated processor may generate an offline model corresponding to the original network according to model parameters and instructions corresponding to the respective computing nodes of the original network, for example, the cloud server or the neural network dedicated processor controls
- the module may store the model parameters and instructions corresponding to the respective computing nodes of the original network into the non-volatile second memory to implement generation and storage of the offline model.
- the model parameters and instructions of the computing node are stored in one-to-one correspondence.
- the offline model corresponding to the original network can be directly obtained from the non-volatile memory, and the original network is run according to the offline model corresponding thereto, without performing online calculation on each computing node of the original network. Compile and get instructions to improve the speed and efficiency of the system.
- directly running the offline model corresponding to the original network means that the offline learning model is used to run a machine learning algorithm corresponding to the original network (such as a neural network algorithm), and the target of the algorithm is implemented by performing a forward operation.
- Applications such as artificial intelligence applications such as speech recognition).
- step S102 may include:
- the computing module of the cloud server or the neural network dedicated processor can obtain the execution order of each computing node in the original network according to the model structural parameters of the original network, and further, the computing module of the cloud server or the neural network dedicated processor can be The connection relationship of each computing node in the original network obtains the execution order of each computing node in the original network.
- the input data of the calculation node F4 is the output data of the calculation node F1 and the output data of the calculation node F2
- the input data of the calculation node F6 is the output data of the calculation node F4 and the output data of the calculation node F5.
- each computing node in the neural network shown in FIG. 28 may be F1-F2-F3-F4-F5-F6 or F1-F3-F2-F5-F4-F6 and the like.
- the computing nodes F1, F2, and F3 can be executed in parallel, and the computing nodes F4 and F5 can also be executed in parallel, and the execution order is not specifically limited herein.
- the computing module of the cloud server or the neural network dedicated processor may run the original network according to the execution order of each computing node in the original network to obtain instructions corresponding to each computing node in the original network, that is, cloud server or neural network dedicated processing.
- the device can compile the data of the model data set of the original network to obtain the instruction corresponding to each computing node, and the instruction corresponding to each computing node can know which computing function the computing node uses to implement, that is, the computing type of the computing node can be obtained. And calculation properties such as calculation parameters.
- step S103 further includes:
- the computing module of the cloud server or the neural network dedicated processor can obtain the memory allocation manner of the original network according to the model data set of the original network and the model structure parameters. Further, the cloud server or the neural network dedicated processor may obtain the execution order of each computing node in the original network according to the model structure parameters of the original network, and determine the memory allocation mode of the current network according to the execution order of each computing node in the original network. For example, related data of each computing node during operation is saved to a stack in the execution order of each computing node.
- the memory allocation mode refers to determining a storage location of data (including input data, output data, model parameters, intermediate result data, and the like) related to each computing node in the original network in a memory space (such as the first memory).
- a data table may be used to store a mapping relationship between data (input data, output data, model parameters, intermediate result data, and the like) associated with each computing node and a memory space.
- S107 Store, according to a memory allocation manner of the original network, related data in the running process of the original network to the first memory, where the related data in the original network running process includes model parameters and instructions corresponding to the respective computing nodes of the original network.
- X1 and X2 represent input data of the neural network
- Y represents output data of the neural network
- a cloud server or a neural network dedicated processor can convert the output data of the neural network into a control robot or different Control commands for digital interfaces.
- W1 to W6 are used to represent model parameters corresponding to the calculation nodes F1, F2, and F3, and the output data of the calculation nodes F1 to F5 can be used as an intermediate calculation result.
- the cloud server or the neural network dedicated processor can store related data in the original network running process to the first memory, such as an internal memory or a cache, etc. according to the determined memory allocation manner, and the specific storage manner can be seen.
- the second memory may be a non-volatile memory such as an external memory.
- the corresponding offline model stored in the storage space in the right half of FIG. 29 is the original network.
- a cloud server or a neural network dedicated processor can obtain a model data set, model structure parameters, and input data of the original network, so that a network structure diagram of the original network can be obtained according to the model data set and model structure parameters of the original network. As shown in Figure 9.
- the cloud server or the neural network dedicated processor can obtain the connection relationship of each computing node of the original network according to the model structure parameters of the original network, and obtain the execution order of each computing node in the original network according to the connection relationship of each computing node, and the original The memory allocation mode of the network during the running process, so that the storage location of the relevant data of the original network during the running process can be obtained.
- the relevant data of the original network during operation can be stored in a stack in the order in which the respective compute nodes are executed.
- the cloud server or the neural network dedicated processor may store the model parameters and instructions corresponding to the respective computing nodes of the original network in the non-volatile second memory to generate an offline model, and the storage mode of the offline model can be seen in FIG. 29 The middle right half of the storage space is shown.
- the offline model only includes data such as model parameters and instructions necessary for running the original network, and does not need to store input data, output data or intermediate calculation results during the operation of the original network, thereby reducing the second The consumption of storage space in the memory.
- an artificial neural network is a kind of heavyweight data, which is composed of a large number of nodes (or neurons) connected to each other.
- the traditional computer device directly reads the neural network, and sequentially executes the computing nodes of the neural network in a certain manner according to the structural form of the neural network, and obtains the calculation result of the neural network. That is, the traditional computing device directly performs data processing on the heavyweight neural network, which will affect the data processing speed and efficiency of the computer device.
- the artificial neural network data will not be able to operate, which will limit the application range of the neural network.
- an embodiment of the present application provides a computer device, which may include a hardware system and a software system, where the hardware system may include a first processor 110, a second processor 120, and a memory 130.
- the first processor 110 is configured to provide a computing and control capability, which may include a first obtaining module 111, a first computing module 113, a first control module 112, and the like.
- the first obtaining module 111 may It is a hardware module such as an IO (Input Input/Output Output) interface, and the first arithmetic module 113 and the first control module 112 are hardware modules.
- the first operation module 113 and the first control module 112 may be digital circuits or analog circuits or the like.
- the second processor 120 can also be used to provide calculation and control capabilities, which can include a second acquisition module, a second operation module, and a second control module, etc., and the second acquisition module can be an IO (Input input / Output output)
- the hardware module such as the interface, the second computing module and the second control module are all hardware modules.
- the connection relationship and the configuration of the respective structures of the second processor 120 may be the same as the connection relationship and the configuration of the respective structures in the first processor. For details, refer to the description above, and details are not described herein again.
- the first processor or the second processor may be a general processing unit such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processing).
- a neural network dedicated processor such as an IPU (Intelligence Processing Unit).
- the memory 130 is configured to store a plurality of offline models and input data corresponding to the original network and a software system of the computer device.
- the software system of the computer device can include software such as an operating system, a computer program, application software, and a runtime system 131 that can run on the first processor 110 or the second processor 120.
- the memory 130 can also be used to store output data of each original network (ie, calculation results of respective original networks).
- the memory 130 may include a first storage module for storing an offline model, a second storage module for storing input data, a third storage module for storing output data, and a storage system for storing the runtime system. Four storage modules. Alternatively, the number of the memory 130 may be two or more.
- the number of the memory 130 may be two, which are respectively labeled as a first memory and a second memory, wherein the first memory is used to store an offline model and input corresponding to the original network. Data, the second memory is used to store the runtime system.
- the memory 130 may be a non-volatile memory such as a read only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), or a flash memory.
- runtime refers to the state in which a program is running (or executed), and the runtime indicates which program is in the program during a certain period of time.
- a runtime system is a process-level virtual machine that is used to represent the operating environment of a program.
- the runtime system may be a software system established by using computer software, and the software system may be in a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processing). , digital signal processing) or IPU (Intelligence Processing Unit, intelligent processor) and other processors run to achieve specific data processing functions.
- the runtime system in the embodiment of the present application is different from the operating system of the computer device, and the software system of the computer device can include the above-mentioned runtime system and operating system.
- the runtime system 131 in the embodiment of the present application can be run on the first processor 110.
- the runtime system 131 can include a data processing device 1310, a device management device 1314, and a task execution device 1315. Both the device 1310 and the device management device 1314 can be connected to the task execution device 1315.
- the runtime system 131 can control the second processor 120 to run heavyweight data such as a neural network, that is, the runtime system 131 can control the second processor 120 according to the nerve.
- the offline model of the network and the input data are calculated to obtain the output data of the neural network.
- the data processing device 1310 is configured to obtain, from the memory 130, an offline model corresponding to the current original network and the input data thereof, and the offline model of the current original network is correspondingly set with the input data of the current network.
- the offline model corresponding to the current original network includes necessary network structure information such as model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network. Since the offline model of the current original network does not include intermediate calculation results, input data, and output data of each computing node in the current original network, the data size of the offline model of the current original network is much smaller than that of the current original network.
- the data level, that is, the offline model of the current original network can be considered as lightweight data.
- the instruction corresponding to each computing node may be used to indicate which computing function is used by the computing node, and may specifically include computing attributes of respective computing nodes in the original network.
- the node interface data of the current original network is used to represent the connection relationship of each computing node of the current original network.
- the node interface data of the current original network may include an input data source and an output data source of each computing node. For example, as shown in FIG. 28, X1 and X2 are input data corresponding to the current original network, Y is output data corresponding to the current original network, and W1 to W6 are respectively model parameters corresponding to the computing nodes F1 to F3 in the current original network.
- the node interface data of the current original network may include the computing nodes F1, F2, and F3 as the starting computing nodes, the inputs are preset input data, and the output data of the node F1 is calculated as the input data of the computing node F4 and the computing node F5. Wait. In this way, when the original network is run again, only the offline model and the input data of the current original network are obtained, and the running process of the current original network can be implemented by running the offline model corresponding to the current original network.
- the device management device 1314 functions as a driving device of the second processor 120, which can be used to control the second processor 120 to be turned on or off. Wherein, when the second processor 120 is turned off, the second processor 120 does not perform any task, and when the second processor 120 is started, the second processor 120 can perform tasks such as calculation or control.
- the second processor 120 may be a neural network accelerator for executing an offline model of the current original network.
- the task execution device 1315 is configured to control the second processor 120 to run the offline model and input data of the current original network acquired by the data processing device 1310 to obtain output data of the current original network (ie, a calculation result of the neural network).
- running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
- the machine learning algorithm such as the neural network algorithm
- the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
- the runtime system 131 described above may be run on the first processor 110 to control the second processor 120 to operate by the runtime system 131.
- the neural network and other data that is, when it is required to run heavy-weight data such as a neural network on the computer device 100, the offline model corresponding to the current original network and the input data may be first acquired from the memory 130 by the data processing device 1310. After completing the offline model corresponding to the current original network and the loading of the input data, the device management device 1314 may control the second processor 120 to start. Thereafter, the task executing device 1315 can control the second processor 120 to run the offline model and input data of the current original network to implement the running process of the current original network, and obtain the calculation result of the current original network.
- the offline model of the current original network since the offline model of the current original network only stores the necessary model network parameters, such as model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network. Therefore, the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, the computer device can implement processing of heavyweight data such as a neural network. Expanded the application range of neural networks. At the same time, by directly running the offline model corresponding to the original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.
- the data processing device 1310 includes an offline model loading module 1311 and an input data loading module 1312.
- the offline model loading module 1311 is configured to obtain an offline model corresponding to the current original network from the memory 130, and parse the obtained offline model of the current original network to obtain model parameters corresponding to each computing node in the current original network, The instruction and the interface data of each computing node in the current original network.
- the process of parsing the offline model of the current original network by the offline model loading module 1311 may further include a process of performing data preprocessing (such as data format conversion, normalization, etc.) on the offline model corresponding to the current original network.
- data preprocessing such as data format conversion, normalization, etc.
- the input data loading module 1312 is configured to retrieve input data from the memory 130, which may be input data corresponding to the starting computing node of the original network. As shown in Figure 28, X1 and X2 serve as input data for the starting compute node of the original network. Further, the input data can be obtained by application software and stored in the memory 130.
- the application software can be run on the first processor or the second processor. For example, the user can set the input data of the current original network through the interaction interface of the application software, and the runtime system can store the acquired input data of the current original network. In the memory 130.
- the offline model loading module 1311 can also be used to obtain the loading progress of the offline model in real time
- the input data loading module 1312 can also be used to obtain the loading progress of the input data in real time.
- the offline model loading module 1311 completes the loading of the offline model corresponding to the current original network (for example, the data loading ratio of the offline model is 100%)
- the input data loading module 1312 completes the loading of the input data corresponding to the current original network ( For example, the loading ratio of the input data is 100%)
- the offline model loading module 1311 and the input data loading module 1312 may send a data loading completion signal to the device management device 1314, so that the device management device 1314 may load the completion signal according to the data received by the device management device 1314.
- the second processor 120 is controlled to start. After the second processor 120 is started, the device management device 1314 may send a startup completion signal to the task execution device 1315, and the task execution device 1315 may control the second processor 120 to run the offline of the current original network according to the startup completion signal it receives. model.
- the second processor startup can be controlled in advance to further increase the data processing speed and efficiency of the computer device.
- the data magnitude of the offline model is greater than the data magnitude of the input data
- the required loading time of the offline model may be greater than the loading time of the input data. Therefore, if the offline model loading module 1311 has completed the data loading ratio is greater than or equal to At the first predetermined ratio (eg, 80%), a load completion signal may be sent to the device management device 1314 to start the second processor 120 in advance.
- the offline model loading module 1311 and the input data loading module 1312 can send a data loading completion signal to the device management device 1314, so that the device management device 1314 can control the second processor 120 according to the data loading completion signal it receives. start up.
- the data processing device 1310 may further include an input data pre-processing module 1313 for pre-processing the input data (such as data format conversion, normalization, etc.). ) to enable the second processor 120 to run the input data.
- the input data loading module 1312 may send an input data loading completion signal to the input data preprocessing module 1313, and the input data preprocessing module 1313 may input the input data according to the input data.
- the loading completion signal performs data preprocessing operations such as normalization and format conversion on the input data corresponding to the current original network.
- the device management device 1314 can control the second processor 120 to start according to the offline model loading completion signal transmitted by the offline model loading module 1311 and the pre-processing completion signal transmitted by the input data pre-processing model 1314.
- the input data pre-processing module 1313 is further configured to store the output data obtained by the second processor 120 to the memory 130. Specifically, after the second processor 120 completes the offline model of the current original network and the execution process of the input data, The second processor 120 can transmit the output data (ie, the calculation result) of the current original network to the input data pre-processing module 1313, and the input data pre-processing module 1313 can perform pre-processing such as data format conversion on the output data of the current original network. The output data of the current original network can then be stored into the memory 130.
- the software system of the computer device 100 further includes application software and an operating system (such as an Android operating system, a Microsoft operating system, a Linux operating system, etc.), and the application software can run on the operating system or the above-mentioned runtime system.
- an operating system such as an Android operating system, a Microsoft operating system, a Linux operating system, etc.
- the operating system and the runtime system described above provide an executable environment for various applications.
- the operating system and application software may also be stored in memory 130, which may be run on first processor 110 or second processor 120.
- Each device of the runtime system 131 can provide a security API (Application Programming Interface) that can be invoked by the application software, so that the application software can obtain the offline model and input data of the current original network through the runtime system 131, and control
- the second processor 120 runs the offline model of the current original network to obtain output data of the current original network.
- the data processing device 1310 can provide an offline model API and an input data API.
- the offline model loading module 1311 can provide an offline model API
- the input data loading module 1312 can provide an input data API.
- the application software can invoke the offline model API of the data processing device 1310, so that the offline model loading module 1311 can obtain the offline model corresponding to the current original network from the memory 130.
- the application software may invoke the input data API of the data processing device 1310, so that the input data loading module 1312 can obtain the input data corresponding to the current original network from the memory 130.
- the input data of the current original network can be obtained by application software. For example, the user can manually set the input data corresponding to the current original network through the interactive display interface of the application software.
- the application software can also simultaneously invoke the offline model API and the input data API, so that the offline model and the input data of the current original network can be loaded at the same time, which is used for illustration only, and is not used.
- the order in which it is specifically performed is defined.
- the input data pre-processing module 1313 of the data processing device 1310 is also capable of providing a data pre-processing API. After completing the loading of the input data of the current original network, the application software may invoke the data pre-processing API, so that the data pre-processing module 1313 can pre-process the input data of the current original network, so that the second processor can run the current original as described above. Input data for the network.
- the device management device 1314 can provide a second processor driver API
- the task execution device 1315 can provide a second processor runtime API.
- the application software may start the second processor 120 by calling the second processor driver API provided by the task executing device 1315.
- the application software may invoke the second processor running API provided by the task executing device 1315 to control the second processor 120 to execute the offline model corresponding to the current original network and input data to obtain the current original network.
- Output Data After completing the execution process of the offline model corresponding to the current original network, the application software may close the second processor 120 by calling the second processor driver API.
- the application software may also invoke the data pre-processing API, so that the input data pre-processing module 1313 can perform data pre-processing on the output data of the current original network, and will The output data of the original network is stored in the memory 130.
- the number of the second processors 120 may be multiple, the task executing device 1315 may also be able to provide a task allocation API, and the task executing device 1315 may be configured to control the plurality of second processors 120 to implement multiple second Task assignment and scheduling between processors 120.
- the application software may select a target second processor that executes the current task from the plurality of second processors 120 by calling a task assignment API provided by the task execution device 1315. After the offline model of the current original network and the loading of the input data are completed, the application software may start the target second processor by calling a second processor driver API corresponding to the target second processor.
- the application software may invoke the second processor running API corresponding to the target second processor provided by the task executing device 1315 to control the target second processor to execute the offline model corresponding to the current original network. And input data.
- the target second processor may be shut down by calling a second processor driver API corresponding to the target second processor.
- the second processor 120 may be a multi-core processor, that is, the second processor 120 may include multiple processing modules.
- the task execution device 1315 can be configured to control a plurality of processing modules of the plurality of second processors 120 to implement task allocation and scheduling between the plurality of processing modules of the plurality of second processors 120.
- the application software may select a target processing module that executes the current task from among the plurality of processing modules in the second processor 120 by calling the task assignment API provided by the task execution device 1315. After the offline model of the current original network and the loading of the input data are completed, the application software may start the target processing module by calling a second processor driver API corresponding to the target processing module.
- the application software may invoke the second processor running API corresponding to the target processing module to control the target processing module to execute the offline model and the input data corresponding to the current original network.
- the target processing module may be closed by calling a second processor driver API corresponding to the target processing module.
- the runtime system 131 can be a secure runtime system built on a trusted operating environment.
- the runtime system 131 can be a runtime system built on a TEE (Trusted Execution Environment).
- TEE Trusted Execution Environment
- TEE can construct a runtime system that is isolated from non-secure software systems such as operating systems, thereby implementing software isolation and ensuring the offline model of the original network and the security of input data and output data.
- the above application software may be a secure application such as TA, and the secure application software such as the TA may run on a runtime system based on TEE.
- the storage space of the memory 130 can be divided into a secure storage space and a non-secure storage space.
- the storage space for storing the offline model and the input data of the current original network is a secure storage space
- the storage space for storing the software system such as the operating system and the application software is a non-secure storage space
- the runtime system can be stored in the storage system.
- the memory 130 can also be a secure memory.
- the above runtime system, TA and secure storage space constitute a complete TEE operating environment.
- the number of the memory 130 may be two or more, and one of the memories 130 may serve as a secure storage space for storing an offline model and input data of the current original network.
- One of the memories 130 can be used as a non-secure storage space for storing software systems such as an operating system and application software. Further, the operating system, application software, and the like can also be stored in a secure storage space.
- the secure storage space in the embodiment of the present application refers to a trusted storage space, which may be an encrypted storage space, and may specifically adopt a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption. Algorithm (such as using a random password generator to obtain a password).
- the secure storage interval may also be a storage space encrypted by a fingerprint or the like.
- the above secure runtime system 131 and application software can also be obtained by an encryption algorithm.
- the secure storage space may be a secure storage space obtained by a trusted metric method, and the secure runtime system 131 and application software may also be obtained by a trusted metric method.
- the first processor 110 can also be a security chip, such as a TPM (Trusted Platform Module), a TCM (Trusted Cryptography Module), or a TPCM (Trusted Platform Control Module). Module) and so on.
- the second processor 120 may also be a security chip such as TPM, TCM or TPCM.
- the computer device of the embodiment of the present application may further include only a processor and a memory, where the processor is a multi-core processor.
- the processor can include a plurality of processing modules.
- the processor includes a first processing module and a second processing module, wherein the runtime system can run on the first processing module.
- the runtime system may include a data processing device, a device management device, and a task execution device, where the data processing device is configured to acquire, from the memory, an offline model corresponding to the current original network and input data, corresponding to the current original network.
- the offline model includes model parameters, instructions, and interface data of each computing node in the original network corresponding to each computing node in the original network.
- the device management device is configured to control the second processing module to be started or shut down
- the task execution device is configured to control the second processing module to run the offline model of the current original network and input data.
- the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 20 to implement processing on heavy-weight data such as a neural network by using an offline model, and improve the computer device.
- Data processing speed and efficiency includes the following steps:
- the control data processing device acquires an offline model and input data corresponding to the current original network from the memory, and the offline model corresponding to the current original network includes model parameters and instructions corresponding to each computing node in the original network.
- the offline model corresponding to the current original network and the input data can be read from the memory by the data processing device 1310 of the runtime system 131.
- the offline model corresponding to the current original network may be obtained from the memory 130 by the offline model loading module 1311 of the data processing device 1310.
- the input data is retrieved from the memory 130 by the input data loading module 1312, which may be the input data corresponding to the starting computing node of the original network.
- the second processor that controls the computer device by the device management device is started. Specifically, the second processor can be controlled to be turned on or off by the device management device 1314 of the runtime system 131. That is, after the offline model loading module 1311 completes the loading of the offline model corresponding to the current original network, and the input data loading module 1312 completes the loading of the input data corresponding to the current original network, the offline model loading module 1311 and the input data loading module 1312 may The device management device 1314 transmits a data loading completion signal, so that the device management device 1314 can control the second processor 120 to start according to the data loading completion signal it receives.
- the second processor that controls the computer device by the task execution device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network.
- the second processor 120 can be controlled by the task execution device 1315 of the runtime system 131 to run an offline model of the current original network.
- running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
- the data processing device Store, by the data processing device, output data of the current original network into the memory.
- the output data of the current original network may be stored into the memory 130 by the data processing device 1310.
- the data processing device 1310 can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the memory 130.
- the input data processing module 1313 of the data processing device 1310 can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the memory 130.
- step S110 may further include the following steps:
- the offline model of the current original network may be parsed by the offline model loading module 1311 to obtain model parameters and instructions corresponding to each computing node in the current original network, and interfaces of each computing node in the current original network. data. Further, the offline model loading module 1311 may perform preprocessing operations such as data format conversion, normalization, and the like on the parsed data.
- S112 Perform pre-processing on the input data of the current original network obtained, such as performing data format conversion, normalization, and the like on the input data.
- the input data may be pre-processed (such as data format conversion, normalization, etc.) by the input data pre-processing module 1313 to enable the second processor 120 to run the input data.
- the above method may further include the following steps:
- the loading progress of the offline model corresponding to the current original network is obtained in real time; specifically, the offline model loading module 1311 can obtain the loading progress of the offline model corresponding to the current network in real time, and the loading progress of the offline model can be represented by using a data ratio or a remaining duration. .
- the step of controlling the second processor of the controlling computer device is performed.
- the first preset ratio may be 80% to 100%.
- the offline model loading module 1311 may send a data loading completion signal to the device management device 1314, thereby The device management device 1314 can control the second processor 120 to start according to the data loading completion signal it receives.
- the data loading ratio of the offline model loading module 1311 is greater than or equal to the first preset ratio (eg, 80%)
- the loading completion signal may be sent to the device management device 1314 to start the second processor 120 in advance.
- the required loading time of the offline model may be greater than the loading time of the input data. Therefore, whether to activate the second processor 120 may be determined based only on the loading progress of the offline model. Further, the input data loading module 1312 can also obtain the loading progress of the input data in real time.
- the offline model loading module 1311 and the input data loading module 1312 may send a data loading completion signal to the device management device 1314, so that the device management device 1314 may The second processor 120 is controlled to start according to the data loading completion signal it receives.
- the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 20 to implement processing on heavy-weight data such as a neural network by using an offline model, and improve the computer.
- Data processing efficiency and speed of the device includes the following steps:
- the offline model API is invoked, and the offline model corresponding to the current original network is obtained.
- the application software may invoke the offline model API provided by the offline model loading module 1311, so that the offline model loading module 1311 can read the current original from the memory 130.
- the offline model corresponding to the network includes model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network; wherein the offline model generation process can be referred to the description above. .
- the application software may call the input data API provided by the input data loading module 1312, and obtain the input data of the current original network from the memory 130 through the input data loading module 1312. Further, the application software may also invoke the data pre-processing API provided by the input data pre-processing module 1313, and perform pre-processing operations such as data format conversion, normalization, and input data input by the input data loading module 1312 through the input data pre-processing module 1313. So that the second processor 120 can run the input data of the current original network described above.
- S220 Call the second processor driver API to control the second processor in the computer device to start.
- the application software can invoke the second processor driver API provided by the device management module 1314, and the second processor 120 is controlled to be started by the device management module 1314.
- S230 Call the second processor to run the API, and control the second processor to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
- the application software can invoke the second processor running API provided by the task executing device 1315, and the task executing device 1315 controls the second processor 120 to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network. .
- S240 Call a second processor driver API to control the second processor to be turned off.
- the application software can invoke the second processor driver API provided by the device management module 1314, and the second processor 120 is controlled to be turned off by the device management module 1314.
- Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory can include random access memory (RAM) or external cache memory.
- RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
- SRAM static RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDRSDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- Synchlink DRAM SLDRAM
- Memory Bus Radbus
- RDRAM Direct RAM
- DRAM Direct Memory Bus Dynamic RAM
- RDRAM Memory Bus Dynamic RAM
- the computer device, the data processing method and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the memory, so that the second processor of the computer device can obtain the original network according to the original network.
- the offline model and the input data run the current original network to obtain the output data of the current original network. Since the offline model corresponding to each original network only includes the model parameters and instructions corresponding to each computing node in the original network and the interface data of each computing node in the original network, the data size of the offline model of the original network is much smaller than the The data level of the original network, so that the computer device can process the heavyweight neural network data by running the offline model corresponding to the current original network on the computer device. At the same time, by directly running the offline model corresponding to the current original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the current original network.
- the computer device 200 may include a first processor 210, a second processor 220, a first memory 230, and a second memory 240, wherein the first memory 230 stores therein An offline model corresponding to the plurality of original networks and input data and a runtime system capable of running on the first processor 230, the second memory 240 storing an operating system capable of running on the first processor or the second processor.
- the first memory 230 and the second memory 240 described above may be two memories that are physically independent of each other.
- the first memory 230 and the second memory 240 may be integrated as a whole, and the first memory 230 and the second memory 240 are two storage spaces that are logically independent of each other.
- the number of the first processors 210 may be two or more.
- the number of the first processors 210 is two, one of the first processors 210 is used to run the above-described secure runtime system 231, and the other first processor 210 is used to run the operating system.
- the foregoing first processor 210 may be a multi-core processor, which may include more than two processing modules, one of which may be used to run the above-described runtime system 231, wherein one processing module is used to run the above operations. system. In this way, computer equipment can be divided into a safe operating environment and a non-secure operating environment by hardware isolation.
- the first processor 210 may be implemented by using a security chip such as TCM, TPM or TPCM.
- the above-mentioned runtime system is a secure runtime system established based on a trusted operating environment.
- the runtime system 231 may be a runtime system established based on a TEE (Trusted Execution Environment).
- TEE Trusted Execution Environment
- the TEE can construct a runtime system that is isolated from a non-secure software system such as an operating system, thereby implementing software isolation, ensuring the offline model of the original network and the security of input data and output data.
- the secure runtime system 231 can be obtained by an encryption algorithm or by a trusted metric.
- the first memory 230 is a secure storage medium.
- the runtime system 231 When the runtime system 231 is running on the first processor 210, the runtime system 231 can obtain the offline model and input data corresponding to the current original network from the first memory 230, and control the second processor 220 to run the current original network corresponding. Offline model.
- the security in the embodiment of the present application refers to Trusted, which can be implemented by using a preset encryption algorithm.
- a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption algorithm may be used (such as using random The password generator gets the password).
- a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption algorithm may be used (such as using random The password generator gets the password).
- encrypt by fingerprint or the like it is also possible to encrypt by fingerprint or the like.
- security can also be achieved through trusted metrics.
- the runtime system 231 can provide a security API (Application Programming Interface) that can be invoked by the application software.
- the API mainly includes key management, cryptographic algorithms, and secure storage.
- the above-described runtime system 231 may include a data processing device, a device management device, and a task execution device, the structure of which is similar to that of the above-described runtime system 131, as shown in Figs. 22 and 23.
- the data processing device can provide an offline model API and an input data API, and is used to obtain an offline model and input data corresponding to the current original network from the first memory 230.
- the offline model corresponding to the current original network includes each computing node in the original network.
- the device management device can provide a second processor driver API for controlling the second processor 220 to be turned on or off.
- the task execution device can provide a second processor execution API for controlling the second processor 220 to run an offline model of the current original network and input data.
- the data processing apparatus includes an offline model loading module and an input data loading module.
- the offline model loading module is configured to provide an offline model API for obtaining an offline model corresponding to each current original network from the first memory 230, and parsing an offline model corresponding to the current original network.
- the input data loading module is capable of providing an input data API for obtaining input data corresponding to the current original network from the first memory 230.
- the data processing apparatus further includes an input data preprocessing module capable of providing a data preprocessing API for preprocessing the input data acquired by the input data loading module to enable the second processor 220 to operate
- the input data of the current original network is used to store the output data obtained by the second processor 220 to the first memory 230.
- the number of the second processors 220 is multiple, or the second processor 220 includes multiple processing modules; the task execution device can also provide a task allocation API for controlling the plurality of second processors 220, or controlling A plurality of processing modules of the second processor 220.
- the computer device further includes a secure application software (TA, Trusted Application) that can be run on the runtime system 231, and the application software can invoke the offline model API and the input data API, the second processor driver API, and the second The processor runs the API.
- TA Trusted Application
- the secure application software can be implemented by an encryption algorithm or by a trusted metric.
- the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 30, and the method includes the following steps:
- S310 Obtain an offline model and input data corresponding to the current original network from the first memory, where the offline model corresponding to the current original network includes model parameters, instructions, and respective current networks in the original network. Calculate the interface data of the node.
- the secure runtime system 231 can obtain the offline model and input data corresponding to the current original network from the secure first memory 230.
- the offline model corresponding to the current original network and the input data may be read from the first memory 230 by the data processing device of the runtime system 231.
- the offline model corresponding to the current original network may be acquired from the first memory 230 by the offline model loading module of the data processing device.
- the input data is obtained from the first memory 230 by the input data loading module, and the input data may be input data corresponding to the starting computing node of the original network.
- the second processor that controls the computer device starts.
- the secure runtime system 231 described above can control the second processor 220 of the computer device to boot.
- the device management device of the runtime system 231 can control the second processor to be turned on or off.
- the offline model loading module may send a data loading completion signal to the device management device, so that the device management device may control the second processor according to the received data loading completion signal thereof. 220 starts.
- the second processor that controls the computer device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network.
- the runtime system 231 can control the second processor 220 of the computer device to run the offline model and its corresponding input data to obtain output data of the current original network.
- the second processor 220 can be controlled by the task execution device of the runtime system 231 to run an offline model of the current original network.
- running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
- the machine learning algorithm such as the neural network algorithm
- the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
- the runtime system 231 can store the output data of the current original network into the secure first memory 230.
- the output data of the current original network may be stored into the first memory 230 by the data processing device of the runtime system 231.
- the data processing device can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the first memory 230.
- the input data processing module of the data processing device can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the first memory 230.
- the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 32.
- the offline model corresponding to the current original network includes model parameters, instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network.
- S420 Calling the input data API to obtain input data of the current original network; specifically, the secure application software may invoke the input data API, and obtain the input data of the current original network from the first memory 230 by using the input data loading module.
- S440 Call the second processor to run the API, and control the second processor to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
- the secure application software can invoke the second processor to run the API to control, by the task execution device, the second processor 220 to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
- S450 Call a second processor driver API to control the second processor to be turned off.
- the secure application software can invoke the second processor driver API to control the second processor 220 to be turned off by the device management module.
- the above method further includes the following steps:
- the data pre-processing API is called to store the output data of the current original network into the first memory.
- the secure application software can invoke the data pre-processing API provided by the runtime system 231 to perform data format conversion, normalization, and the like on the output data through the input data pre-processing module of the data processing device, and the current The output data of the original network is stored in the first memory 230.
- the method further includes the following steps:
- the data preprocessing API is invoked to preprocess the acquired input data of the current original network, so that the second processor can run the input data.
- the secure application software may also invoke a data pre-processing API provided by the input data pre-processing module to perform a data format conversion, normalization, and the like on the input data through the input data pre-processing module to enable the second processing.
- the device 220 is capable of running the input data of the current original network described above.
- the embodiment of the present application may further include an offline model generation process, where the offline model generation process may be run on a cloud server or a neural network dedicated processor, and the obtained offline model of the original network is stored to the first In a memory 230.
- the cloud server or neural network dedicated processor is a processor capable of executing heavyweight data such as a neural network, which may not be included in the above computer device.
- Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory can include random access memory (RAM) or external cache memory.
- RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
- SRAM static RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDRSDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- Synchlink DRAM SLDRAM
- Memory Bus Radbus
- RDRAM Direct RAM
- DRAM Direct Memory Bus Dynamic RAM
- RDRAM Memory Bus Dynamic RAM
- the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, a secure runtime established based on a trusted execution environment such as TEE can be realized.
- the system expands the application range of neural networks to the processing of heavy-weight data such as neural networks.
- the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Neurology (AREA)
- Devices For Executing Special Programs (AREA)
- Feedback Control In General (AREA)
- Advance Control (AREA)
- Multi Processors (AREA)
- Stored Programmes (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims (15)
- 一种任务并行处理方法,其特征在于,包括:根据需执行任务之间的依赖关系,构建任务有向无环图DAG;根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
- 根据权利要求1所述的方法,其特征在于,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤之前包括:根据程序中的操作节点和/或数据节点对程序进行拆分,获取所述需执行任务。
- 根据权利要求2所述的方法,其特征在于,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:若所述程序包括带模型的操作请求,则对所述带模型的操作请求的模型进行拆分和/或对所述模型的输入数据进行拆分,获取需执行任务。
- 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:设置拆分模型得到的各所述需执行任务对应的权值;使用各所述权值,设置所述需执行任务的输入数据与输出数据的对应关系。
- 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:按照预设规则在模型的窗口方向和/或通道方向上拆分所述带模型的操作的模型,得到需执行任务。
- 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的输入数据进行拆分,获取需执行任务的步骤包括:按照预设规则在数据的窗口方向拆分所述带模型的操作的输入数据,得到需执行任务。
- 根据权利要求2所述的方法,其特征在于,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:若所述程序包括不带模型的操作请求,则对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务。
- 根据权利要求7所述的方法,其特征在于,所述对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务的在步骤包括:按照预设规则在数据的窗口方向拆分所述输入数据和/或输出数据,得到需执行任务。
- 根据权利要求1所述的方法,其特征在于,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤包括:根据获取的各所述需执行任务之间的依赖关系,确定所述任务有向无环图DAG中的并行结点与顺序结点;根据所述并行结点与顺序结点构建任务有向无环图DAG。
- 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述任务有向无环图DAG将各所述需执行任务分发至所述处理器的多个工作队列的步骤包括:对所述任务有向无环图DAG进行拓扑排序,获取任务拓扑排序序列;根据各所述需执行任务的预设执行时间,对得到的所述拓扑排序序列进行排序,得到最长拓扑排序序列;根据所述最长拓扑排序序列以及各所述需执行任务之间的依赖关系,分发各所述需执行任务至所述工作队列。
- 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述任务有向无环图DAG中各所述需执行任务的 依赖关系,调控各所述工作队列中并行的需执行任务开始运行的步骤包括:根据所述任务有向无环图DAG为各所述需执行任务设置引用计数;若被依赖的需执行任务已执行,则修改需依赖的需执行任务的引用计数;当所述需执行任务的引用计数达到预设值,控制各所述工作队列中引用计数达到预设值的需执行任务开始运行。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-11中任意一项所述方法的步骤。
- 一种任务并行处理系统,其特征在于,包括存储器、多核处理器,及存储在存储器上并可在处理器上运行的计算机程序,所述多核处理器能够运行拆分算法,其特征在于,所述多核处理器执行所述计算机程序时实现权利要求1-11中任一项所述方法的步骤。
- 一种任务并行处理系统,其特征在于,包括存储器、第一处理器和第二处理器,所述第一处理器能够运行拆分算法,第二处理器为多核处理器,其特征在于,所述第一处理器和第二处理器执行所述计算机程序时实现权利要求1-11中任一项所述方法的步骤。
- 一种任务并行处理装置,其特征在于,包括:DAG图构建模块、任务分发模块和调度控制模块,所述DAG图构建模块,用于根据需执行任务之间的依赖关系,构建任务有向无环图DAG;所述任务分发模块,用于根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;所述调度控制模块,用于根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
Priority Applications (8)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020197037907A KR102569086B1 (ko) | 2017-11-20 | 2018-09-28 | 태스크 병렬 처리 방법, 장치, 시스템, 기억 매체 및 컴퓨터 기기 |
| EP19210491.7A EP3651020A1 (en) | 2017-11-20 | 2018-09-28 | Computer equipment, data processing method, and storage medium |
| EP18878728.7A EP3614260A4 (en) | 2017-11-20 | 2018-09-28 | PROCESS, APPARATUS AND SYSTEM FOR PARALLEL TASK PROCESSING, INFORMATION MEDIA AND COMPUTER DEVICE |
| JP2019568198A JP7074777B2 (ja) | 2017-11-20 | 2018-09-28 | タスク並列処理方法、装置、システム、記憶媒体およびコンピュータ機器 |
| US16/575,344 US11221877B2 (en) | 2017-11-20 | 2019-09-18 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/702,502 US11113103B2 (en) | 2017-11-20 | 2019-12-03 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/702,491 US11360811B2 (en) | 2017-11-20 | 2019-12-03 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/705,190 US11113104B2 (en) | 2017-11-20 | 2019-12-05 | Task parallel processing method, apparatus and system, storage medium and computer device |
Applications Claiming Priority (8)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711157341.XA CN109814986B (zh) | 2017-11-20 | 2017-11-20 | 任务并行处理方法、存储介质、计算机设备、装置和系统 |
| CN201711157341.X | 2017-11-20 | ||
| CN201711484410.8 | 2017-12-29 | ||
| CN201711484410.8A CN109992307B (zh) | 2017-12-29 | 2017-12-29 | 指令列表调度方法、装置、计算机设备及存储介质 |
| CN201810083577.1 | 2018-01-29 | ||
| CN201810084077.X | 2018-01-29 | ||
| CN201810083577.1A CN110097179B (zh) | 2018-01-29 | 2018-01-29 | 计算机设备、数据处理方法及存储介质 |
| CN201810084077.XA CN110097180B (zh) | 2018-01-29 | 2018-01-29 | 计算机设备、数据处理方法及存储介质 |
Related Child Applications (4)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/575,344 Continuation US11221877B2 (en) | 2017-11-20 | 2019-09-18 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/702,491 Continuation US11360811B2 (en) | 2017-11-20 | 2019-12-03 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/702,502 Continuation US11113103B2 (en) | 2017-11-20 | 2019-12-03 | Task parallel processing method, apparatus and system, storage medium and computer device |
| US16/705,190 Continuation US11113104B2 (en) | 2017-11-20 | 2019-12-05 | Task parallel processing method, apparatus and system, storage medium and computer device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019095873A1 true WO2019095873A1 (zh) | 2019-05-23 |
Family
ID=66540014
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/108298 Ceased WO2019095873A1 (zh) | 2017-11-20 | 2018-09-28 | 任务并行处理方法、装置、系统、存储介质及计算机设备 |
Country Status (5)
| Country | Link |
|---|---|
| US (4) | US11221877B2 (zh) |
| EP (2) | EP3651020A1 (zh) |
| JP (1) | JP7074777B2 (zh) |
| KR (1) | KR102569086B1 (zh) |
| WO (1) | WO2019095873A1 (zh) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111782426A (zh) * | 2020-07-10 | 2020-10-16 | 上海淇毓信息科技有限公司 | 一种处理客户端任务的方法、装置和电子设备 |
| WO2020263587A1 (en) * | 2019-06-26 | 2020-12-30 | Amazon Technologies, Inc. | Neural network operation reordering for parallel execution |
| CN114185673A (zh) * | 2021-12-10 | 2022-03-15 | 阿波罗智能技术(北京)有限公司 | 分布式资源调度方法、装置及系统 |
| CN114499958A (zh) * | 2021-12-24 | 2022-05-13 | 东软睿驰汽车技术(沈阳)有限公司 | 控制方法及装置、车辆及存储介质 |
| CN115309521A (zh) * | 2022-07-25 | 2022-11-08 | 哈尔滨工业大学(深圳) | 面向海上无人设备的深度强化学习任务调度方法及装置 |
| US20230153565A1 (en) * | 2020-08-13 | 2023-05-18 | Samsung Electronics Co., Ltd. | Method and system of dnn modularization for optimal loading |
| EP4088259A4 (en) * | 2020-01-07 | 2023-05-24 | Argo AI, LLC | METHOD AND SYSTEM FOR CONSTRUCTING STATIC ORIENTED ACYCLIC GRAPHS |
| US20240161125A1 (en) * | 2022-10-31 | 2024-05-16 | Tata Consultancy Services Limited | Method and system for data regulations-aware cloud storage and processing service allocation |
Families Citing this family (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10853079B2 (en) * | 2018-09-26 | 2020-12-01 | Side Effects Software Inc. | Dependency-based streamlined processing |
| CN112016666B (zh) * | 2019-05-31 | 2025-12-12 | 微软技术许可有限责任公司 | 深度学习模型的执行 |
| US11175898B2 (en) * | 2019-05-31 | 2021-11-16 | Apple Inc. | Compiling code for a machine learning model for execution on a specialized processor |
| KR102147912B1 (ko) | 2019-08-13 | 2020-08-25 | 삼성전자주식회사 | 프로세서 칩 및 그 제어 방법들 |
| CN112465129B (zh) * | 2019-09-09 | 2024-01-09 | 上海登临科技有限公司 | 片内异构人工智能处理器 |
| CN112463709B (zh) * | 2019-09-09 | 2025-01-10 | 苏州登临科技有限公司 | 可配置的异构人工智能处理器 |
| WO2021150952A1 (en) * | 2020-01-23 | 2021-07-29 | Spero Devices, Inc. | Data flow architecture for processing with memory computation modules |
| US20220036158A1 (en) * | 2020-07-29 | 2022-02-03 | Apple Inc. | Task skew management for neural processor circuit |
| US11561826B1 (en) * | 2020-11-12 | 2023-01-24 | Xilinx, Inc. | Scheduling processing of machine learning tasks on heterogeneous compute circuits |
| US11847490B2 (en) * | 2021-02-18 | 2023-12-19 | Dell Products L.P. | Intelligent workload scheduling using a ranking of sequences of tasks of a workload |
| CN113283742A (zh) * | 2021-05-21 | 2021-08-20 | 建信金融科技有限责任公司 | 一种任务分配方法和装置 |
| US11537374B1 (en) * | 2021-06-03 | 2022-12-27 | Oracle International Corporation | System and method for hot method call graph analysis |
| CN115686766A (zh) * | 2021-07-28 | 2023-02-03 | 深圳富联富桂精密工业有限公司 | 自动化任务排程方法、电子设备及存储介质 |
| CN113703775B (zh) * | 2021-08-31 | 2023-11-28 | 上海阵量智能科技有限公司 | 一种编译方法、装置、设备及存储介质 |
| CN114168275B (zh) * | 2021-10-28 | 2022-10-18 | 厦门国际银行股份有限公司 | 任务调度方法、系统、终端设备及存储介质 |
| US12333340B1 (en) | 2021-10-29 | 2025-06-17 | Zoox, Inc. | Data processing pipeline horizontal scaling |
| CN114169427B (zh) * | 2021-12-06 | 2022-10-04 | 北京百度网讯科技有限公司 | 基于端到端自适应的分布式训练方法、装置、设备 |
| US20230205592A1 (en) * | 2021-12-23 | 2023-06-29 | Intel Corporation | Asymmetric tuning |
| JP7621557B2 (ja) * | 2022-04-26 | 2025-01-24 | 三菱電機株式会社 | 情報処理装置、推論システム、及び制御方法 |
| CN115114028B (zh) * | 2022-07-05 | 2023-04-28 | 南方电网科学研究院有限责任公司 | 一种电力仿真二次控制的任务分配方法及装置 |
| CN118154077A (zh) * | 2022-12-06 | 2024-06-07 | 鼎捷软件股份有限公司 | 基于数据驱动的执行系统及其执行方法 |
| US20240311195A1 (en) * | 2023-03-16 | 2024-09-19 | Salesforce, Inc. | Parallelism with task dependencies in a curated experience |
| KR102715702B1 (ko) * | 2023-03-30 | 2024-10-11 | 주식회사 딥이티 | 인공지능 모델에 대한 연산 및 메모리 최적화 장치 및 방법 |
| CN116243984A (zh) * | 2023-03-31 | 2023-06-09 | 昆仑芯(北京)科技有限公司 | 数据处理装置、方法、电子设备和存储介质 |
| CN116339958B (zh) * | 2023-05-30 | 2023-09-08 | 支付宝(杭州)信息技术有限公司 | 一种任务执行方法、装置以及设备 |
| CN119294315B (zh) * | 2024-12-10 | 2025-02-18 | 奕行智能科技(广州)有限公司 | 一种并行队列调度电路的验证方法 |
| CN120407124B (zh) * | 2025-06-27 | 2025-10-10 | 北京稀宇极智科技有限公司 | 一种数据链路优化处理方法及数据链路优化处理装置 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020144101A1 (en) * | 2001-03-30 | 2002-10-03 | Hong Wang | Caching DAG traces |
| CN102012844A (zh) * | 2010-11-29 | 2011-04-13 | 上海大学 | 一种面向cmp系统的线程调度方法 |
| CN103077006A (zh) * | 2012-12-27 | 2013-05-01 | 浙江工业大学 | 一种基于多线程的长事务并行执行方法 |
Family Cites Families (60)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6243696B1 (en) * | 1992-11-24 | 2001-06-05 | Pavilion Technologies, Inc. | Automated method for building a model |
| US5768594A (en) * | 1995-07-14 | 1998-06-16 | Lucent Technologies Inc. | Methods and means for scheduling parallel processors |
| US5937037A (en) * | 1998-01-28 | 1999-08-10 | Broadpoint Communications, Inc. | Communications system for delivering promotional messages |
| US7903806B1 (en) * | 2000-01-05 | 2011-03-08 | Canoga Perkins Corp. | Expert call analyzer and next generation telephony network configuration system |
| US7117045B2 (en) * | 2001-09-08 | 2006-10-03 | Colorado State University Research Foundation | Combined proportional plus integral (PI) and neural network (nN) controller |
| US7958507B2 (en) * | 2005-06-16 | 2011-06-07 | Hewlett-Packard Development Company, L.P. | Job scheduling system and method |
| US8010954B2 (en) * | 2007-02-14 | 2011-08-30 | The Mathworks, Inc. | Parallel programming interface to dynamically allocate program portions |
| US8239844B2 (en) * | 2007-02-14 | 2012-08-07 | The Mathworks, Inc. | Method of using parallel processing constructs and dynamically allocating program portions |
| US8255889B2 (en) * | 2007-02-14 | 2012-08-28 | The Mathworks, Inc. | Method of using parallel processing constructs and dynamically allocating program portions |
| JP5545288B2 (ja) * | 2009-02-18 | 2014-07-09 | 日本電気株式会社 | タスク割当装置、タスク割当方法、及び、タスク割当プログラム |
| US8250576B2 (en) * | 2009-09-30 | 2012-08-21 | Microsoft Corporation | Structured task hierarchy for a parallel runtime |
| US9262228B2 (en) * | 2010-09-23 | 2016-02-16 | Microsoft Technology Licensing, Llc | Distributed workflow in loosely coupled computing |
| US9760348B2 (en) * | 2010-11-29 | 2017-09-12 | Microsoft Technology Licensing, Llc | Verification of a dataflow representation of a program through static type-checking |
| US8792939B2 (en) * | 2011-01-03 | 2014-07-29 | Michelle Fisher | Non-wireless bidirectional communication between a mobile device and associated secure element using an audio port |
| US9135065B1 (en) * | 2011-08-31 | 2015-09-15 | The Mathworks, Inc. | Parallel processing of multidimensional arrays |
| US8966457B2 (en) * | 2011-11-15 | 2015-02-24 | Global Supercomputing Corporation | Method and system for converting a single-threaded software program into an application-specific supercomputer |
| WO2013101246A1 (en) * | 2011-12-31 | 2013-07-04 | Intel Corporation | Processor that detects when system management mode attempts to reach program code outside of protected space |
| US9122523B2 (en) * | 2012-05-03 | 2015-09-01 | Nec Laboratories America, Inc. | Automatic pipelining framework for heterogeneous parallel computing systems |
| US9275355B2 (en) * | 2012-09-24 | 2016-03-01 | International Business Machines Corporation | Business process model analyzer and runtime selector |
| US9332083B2 (en) * | 2012-11-21 | 2016-05-03 | International Business Machines Corporation | High performance, distributed, shared, data grid for distributed Java virtual machine runtime artifacts |
| US20140282572A1 (en) * | 2013-03-14 | 2014-09-18 | Samsung Electronics Co., Ltd. | Task scheduling with precedence relationships in multicore systems |
| US9934043B2 (en) * | 2013-08-08 | 2018-04-03 | Linear Algebra Technologies Limited | Apparatus, systems, and methods for providing computational imaging pipeline |
| US10089142B2 (en) * | 2013-08-21 | 2018-10-02 | Hasso-Plattner-Institut Fur Softwaresystemtechnik Gmbh | Dynamic task prioritization for in-memory databases |
| US9304749B2 (en) * | 2013-09-12 | 2016-04-05 | Marvell World Trade Ltd. | Method and system for instruction scheduling |
| US9747547B2 (en) * | 2013-10-22 | 2017-08-29 | In2H2 | Hardware enhancements to radial basis function with restricted coulomb energy learning and/or k-Nearest Neighbor based neural network classifiers |
| US9576072B2 (en) * | 2014-02-13 | 2017-02-21 | Sap Se | Database calculation using parallel-computation in a directed acyclic graph |
| US20150242741A1 (en) * | 2014-02-21 | 2015-08-27 | Qualcomm Incorporated | In situ neural network co-processing |
| US9652286B2 (en) * | 2014-03-21 | 2017-05-16 | Oracle International Corporation | Runtime handling of task dependencies using dependence graphs |
| US9799088B2 (en) * | 2014-08-21 | 2017-10-24 | Qualcomm Incorporated | Render target command reordering in graphics processing |
| CN104239137B (zh) | 2014-08-21 | 2017-12-08 | 东软集团股份有限公司 | 基于dag节点最优路径的多模型并行调度方法及装置 |
| SG11201701588QA (en) * | 2014-09-02 | 2017-03-30 | Ab Initio Technology Llc | Executing graph-based program specifications |
| US9442760B2 (en) * | 2014-10-03 | 2016-09-13 | Microsoft Technology Licensing, Llc | Job scheduling using expected server performance information |
| US10163420B2 (en) * | 2014-10-10 | 2018-12-25 | DimensionalMechanics, Inc. | System, apparatus and methods for adaptive data transport and optimization of application execution |
| US10061577B2 (en) * | 2014-10-14 | 2018-08-28 | Electric Cloud, Inc. | System and method for optimizing job scheduling within program builds |
| AU2015363241A1 (en) * | 2014-12-18 | 2017-06-29 | Exxonmobil Upstream Research Company | Scalable scheduling of parallel iterative seismic jobs |
| CN106156810B (zh) | 2015-04-26 | 2019-12-03 | 阿里巴巴集团控股有限公司 | 通用机器学习算法模型训练方法、系统和计算节点 |
| US20160335119A1 (en) * | 2015-05-12 | 2016-11-17 | minds.ai inc | Batch-based neural network system |
| US9690555B2 (en) * | 2015-06-29 | 2017-06-27 | International Business Machines Corporation | Optimization of application workflow in mobile embedded devices |
| US10102391B2 (en) * | 2015-08-07 | 2018-10-16 | Qualcomm Incorporated | Hardware enforced content protection for graphics processing units |
| JP6636630B2 (ja) * | 2015-10-28 | 2020-01-29 | グーグル エルエルシー | 計算グラフの修正 |
| US10268461B2 (en) * | 2015-11-23 | 2019-04-23 | International Business Machines Corporation | Global data flow optimization for machine learning programs |
| US10331495B2 (en) * | 2016-02-05 | 2019-06-25 | Sas Institute Inc. | Generation of directed acyclic graphs from task routines |
| US11144587B2 (en) * | 2016-03-08 | 2021-10-12 | Shutterstock, Inc. | User drawing based image search |
| US10795725B2 (en) * | 2016-03-24 | 2020-10-06 | Fuji Xerox Co., Ltd. | Image processing device, image processing method, and non-transitory computer readable medium for image processing |
| US20180018610A1 (en) * | 2016-07-14 | 2018-01-18 | Lendinghome Corp. | Systems and methods for optimizing parallel task completion |
| AU2017321776A1 (en) * | 2016-08-31 | 2019-03-07 | Apple Inc. | Systems and methods of swimming analysis |
| US10152349B1 (en) * | 2016-09-27 | 2018-12-11 | Juniper Networks, Inc. | Kernel scheduling based on precedence constraints and/or artificial intelligence techniques |
| US11157814B2 (en) * | 2016-11-15 | 2021-10-26 | Google Llc | Efficient convolutional neural networks and techniques to reduce associated computational costs |
| US10089567B2 (en) * | 2016-12-15 | 2018-10-02 | At&T Intellectual Property I, L.P. | Method and apparatus for providing a communications service using a low powered radio tag |
| US10503775B1 (en) * | 2016-12-28 | 2019-12-10 | Shutterstock, Inc. | Composition aware image querying |
| US11748625B2 (en) * | 2016-12-30 | 2023-09-05 | Intel Corporation | Distributed convolution for neural networks |
| CN107103113B (zh) | 2017-03-23 | 2019-01-11 | 中国科学院计算技术研究所 | 面向神经网络处理器的自动化设计方法、装置及优化方法 |
| US10719760B2 (en) * | 2017-04-09 | 2020-07-21 | Intel Corporation | Neural network scheduling mechanism |
| US10795836B2 (en) * | 2017-04-17 | 2020-10-06 | Microsoft Technology Licensing, Llc | Data processing performance enhancement for neural networks using a virtualized data iterator |
| US10643297B2 (en) * | 2017-05-05 | 2020-05-05 | Intel Corporation | Dynamic precision management for integer deep learning primitives |
| CN107341127B (zh) | 2017-07-05 | 2020-04-14 | 西安电子科技大学 | 基于OpenCL标准的卷积神经网络加速方法 |
| US10817310B2 (en) * | 2017-09-01 | 2020-10-27 | Ab Initio Technology Llc | Executing graph-based program specifications |
| US10586052B1 (en) * | 2017-10-04 | 2020-03-10 | EMC IP Holding Company LLC | Input/output (I/O) inspection methods and systems to detect and defend against cybersecurity threats |
| US11227214B2 (en) * | 2017-11-14 | 2022-01-18 | Advanced Micro Devices, Inc. | Memory bandwidth reduction techniques for low power convolutional neural network inference applications |
| US10452843B2 (en) * | 2018-01-11 | 2019-10-22 | ArecaBay, Inc. | Self-adaptive application programming interface level security monitoring |
-
2018
- 2018-09-28 EP EP19210491.7A patent/EP3651020A1/en not_active Ceased
- 2018-09-28 WO PCT/CN2018/108298 patent/WO2019095873A1/zh not_active Ceased
- 2018-09-28 KR KR1020197037907A patent/KR102569086B1/ko active Active
- 2018-09-28 JP JP2019568198A patent/JP7074777B2/ja active Active
- 2018-09-28 EP EP18878728.7A patent/EP3614260A4/en not_active Ceased
-
2019
- 2019-09-18 US US16/575,344 patent/US11221877B2/en active Active
- 2019-12-03 US US16/702,502 patent/US11113103B2/en active Active
- 2019-12-03 US US16/702,491 patent/US11360811B2/en active Active
- 2019-12-05 US US16/705,190 patent/US11113104B2/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020144101A1 (en) * | 2001-03-30 | 2002-10-03 | Hong Wang | Caching DAG traces |
| CN102012844A (zh) * | 2010-11-29 | 2011-04-13 | 上海大学 | 一种面向cmp系统的线程调度方法 |
| CN103077006A (zh) * | 2012-12-27 | 2013-05-01 | 浙江工业大学 | 一种基于多线程的长事务并行执行方法 |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020263587A1 (en) * | 2019-06-26 | 2020-12-30 | Amazon Technologies, Inc. | Neural network operation reordering for parallel execution |
| US11016775B2 (en) | 2019-06-26 | 2021-05-25 | Amazon Technologies, Inc. | Neural network operation reordering for parallel execution |
| CN114026571A (zh) * | 2019-06-26 | 2022-02-08 | 亚马逊技术股份有限公司 | 用于并行执行的神经网络操作重新排序 |
| US11567778B2 (en) | 2019-06-26 | 2023-01-31 | Amazon Technologies, Inc. | Neural network operation reordering for parallel execution |
| US12013898B2 (en) | 2020-01-07 | 2024-06-18 | Ford Global Technologies, Llc | Method and system for constructing static directed acyclic graphs |
| EP4088259A4 (en) * | 2020-01-07 | 2023-05-24 | Argo AI, LLC | METHOD AND SYSTEM FOR CONSTRUCTING STATIC ORIENTED ACYCLIC GRAPHS |
| CN111782426B (zh) * | 2020-07-10 | 2023-09-22 | 上海淇毓信息科技有限公司 | 一种处理客户端任务的方法、装置和电子设备 |
| CN111782426A (zh) * | 2020-07-10 | 2020-10-16 | 上海淇毓信息科技有限公司 | 一种处理客户端任务的方法、装置和电子设备 |
| US12236331B2 (en) * | 2020-08-13 | 2025-02-25 | Samsung Electronics Co., Ltd. | Method and system of DNN modularization for optimal loading |
| US20230153565A1 (en) * | 2020-08-13 | 2023-05-18 | Samsung Electronics Co., Ltd. | Method and system of dnn modularization for optimal loading |
| CN114185673A (zh) * | 2021-12-10 | 2022-03-15 | 阿波罗智能技术(北京)有限公司 | 分布式资源调度方法、装置及系统 |
| CN114499958B (zh) * | 2021-12-24 | 2024-02-09 | 东软睿驰汽车技术(沈阳)有限公司 | 控制方法及装置、车辆及存储介质 |
| CN114499958A (zh) * | 2021-12-24 | 2022-05-13 | 东软睿驰汽车技术(沈阳)有限公司 | 控制方法及装置、车辆及存储介质 |
| CN115309521A (zh) * | 2022-07-25 | 2022-11-08 | 哈尔滨工业大学(深圳) | 面向海上无人设备的深度强化学习任务调度方法及装置 |
| US20240161125A1 (en) * | 2022-10-31 | 2024-05-16 | Tata Consultancy Services Limited | Method and system for data regulations-aware cloud storage and processing service allocation |
| US12423712B2 (en) * | 2022-10-31 | 2025-09-23 | Tata Consultancy Services Limited | Method and system for data regulations-aware cloud storage and processing service allocation |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3614260A4 (en) | 2020-10-21 |
| US20200104722A1 (en) | 2020-04-02 |
| US20200104162A1 (en) | 2020-04-02 |
| JP2020522824A (ja) | 2020-07-30 |
| US20200125406A1 (en) | 2020-04-23 |
| KR102569086B1 (ko) | 2023-08-22 |
| KR20200087078A (ko) | 2020-07-20 |
| US11113103B2 (en) | 2021-09-07 |
| EP3651020A1 (en) | 2020-05-13 |
| EP3614260A1 (en) | 2020-02-26 |
| US11113104B2 (en) | 2021-09-07 |
| US20200012521A1 (en) | 2020-01-09 |
| JP7074777B2 (ja) | 2022-05-24 |
| US11360811B2 (en) | 2022-06-14 |
| US11221877B2 (en) | 2022-01-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019095873A1 (zh) | 任务并行处理方法、装置、系统、存储介质及计算机设备 | |
| Huang et al. | Taskflow: A lightweight parallel and heterogeneous task graph computing system | |
| KR102860886B1 (ko) | 스케줄러, 스케줄러의 동작 방법 및 이를 포함한 가속기 시스템 | |
| CN109814986B (zh) | 任务并行处理方法、存储介质、计算机设备、装置和系统 | |
| Warneke et al. | Exploiting dynamic resource allocation for efficient parallel data processing in the cloud | |
| US20200249998A1 (en) | Scheduling computation graph heterogeneous computer system | |
| JP5705338B2 (ja) | Mapreduce環境で機械学習アルゴリズムを処理するためのシステムおよび方法 | |
| US20200301739A1 (en) | Maximizing resource utilization of neural network computing system | |
| CN111522640B (zh) | 计算图的并行执行方法和设备 | |
| TW202333052A (zh) | 用於深度學習工作負載之運算密集內核產生器、微內核代碼快取、融合式內核產生器及無循環依賴圖形分割 | |
| Sun et al. | Edge generation scheduling for DAG tasks using deep reinforcement learning | |
| US10838768B2 (en) | Method for optimizing memory access in a microprocessor including several logic cores upon resumption of executing an application, and computer implementing such a method | |
| KR20220049294A (ko) | 스케줄러, 스케줄러의 동작 방법 및 이를 포함한 전자 장치 | |
| Yang et al. | Aero: Design space exploration framework for resource-constrained cnn mapping on tile-based accelerators | |
| CN109213587B (zh) | GPU平台下的多Stream并行DAG图任务映射策略 | |
| CN119783812B (zh) | 面向新一代异构超算大模型并行训练与推理适配优化方法 | |
| CN119597414A (zh) | 算子的调度方法、装置、设备、存储介质及程序产品 | |
| Ghose et al. | A framework for OpenCL task scheduling on heterogeneous multicores | |
| Walter et al. | Real-time Scheduling of I/O Transfers for Massively Parallel Processor Arrays | |
| Kumar | Scheduling of dense linear algebra kernels on heterogeneous resources | |
| Zhang et al. | Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference | |
| Gorlatch et al. | USING THE SPIN MODEL CHECKER FOR AUTO-TUNING HIGH-PERFORMANCE PROGRAMS | |
| Rodríguez et al. | Dynamic management of multikernel multithread accelerators using dynamic partial reconfiguration | |
| Garanina et al. | Auto-tuning high-performance programs using model checking in Promela | |
| Lucas | On the use of hierarchical tasks for heterogeneous architectures |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18878728 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2018878728 Country of ref document: EP Effective date: 20191119 |
|
| ENP | Entry into the national phase |
Ref document number: 2019568198 Country of ref document: JP Kind code of ref document: A |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2018878728 Country of ref document: EP |