WO2025227845A1 - Training task processing method and related device - Google Patents
Training task processing method and related deviceInfo
- Publication number
- WO2025227845A1 WO2025227845A1 PCT/CN2025/072421 CN2025072421W WO2025227845A1 WO 2025227845 A1 WO2025227845 A1 WO 2025227845A1 CN 2025072421 W CN2025072421 W CN 2025072421W WO 2025227845 A1 WO2025227845 A1 WO 2025227845A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- training
- training task
- task
- model
- priority
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/4401—Bootstrapping
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- This application relates to the field of artificial intelligence, and in particular to a method for processing training tasks and related equipment.
- AI artificial intelligence
- ML machine learning
- the requesting device of the machine learning model can send a request message to the training device, which requests the training device to perform the task of training the machine learning model.
- the training device can perform the training task according to the received request message.
- a single training device can receive requests from multiple requesting devices.
- the training task currently being executed by the same training device may include multiple training tasks.
- the computer resources in the training device are limited, if the training device executes multiple training tasks simultaneously, it may overload the training device, resulting in the interruption of the process used to execute the training tasks or the loss of data.
- This application provides a method for processing training tasks and related equipment.
- the training equipment can use the priority of training tasks to decide to pause, delay or refuse to execute certain training tasks, which helps to avoid overloading of the training equipment, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training equipment in the process of executing training tasks.
- embodiments of this application provide a method for processing training tasks.
- This method can be applied to the training phase of a model.
- the method includes: after a training device receives request information from a first device (hereinafter referred to as "first request information" for easy distinction), the first request information is used to request the training device to execute a training task of a first model (hereinafter referred to as the first training task, i.e., the training task of the first model and the first training task can be relatedly replaced); if the priority of the training task of the first model is higher than the priority of the second training task, the training device executes the training task of the first model and suspends the execution of the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the priority of the first training task is higher than the priority of the second training task
- the priority of the first training task being higher than the priority of each training task included in the second training task.
- the training device starting to execute the first training task means that the first training task has been added to the training tasks currently being executed by the training device; "the training device pausing to execute the second training task” can be referred to as the training device suspending the second training task, so that the second training task is no longer included in the training tasks currently being executed by the training device.
- the training device After receiving the first request information from the first device, if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the training device sends a response information to the first device.
- the response information is used to notify the first device to delay or refuse to execute the training task of the first model, or the response information includes information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the training device can also determine whether to execute the training task of the first model based on the priority of the training task of the first model. If the priority of the training task of the first model is higher than the priority of the second training task, the training task of the first model can be executed and the execution of the second training task can be suspended. If the priority of the training task of the first model is lower than or equal to the priority of the second training task, the execution of the training task of the first model can be delayed or rejected.
- the second training task is the training task currently being executed by the training device.
- the training device can use the priority of the training task to decide to pause, delay or reject the execution of certain training tasks, which helps to avoid the overload of the training device, thereby helping to avoid the interruption of the process used to execute the training task or the loss of data, and improving the stability of the training device in the process of executing the training task.
- the second training task may include one or more training tasks with the lowest priority among all currently executed training tasks, and the sum of the idle resources of the training device and the occupied resources of the second training task is greater than or equal to the required resources of the training task of the first model.
- the first request information includes the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the first device determines the priority of the first training task based on a first factor, which may include the inference type of the first model; optionally, the first factor may also include: the resource requirements of the first training task and the immediacy requirements of the first model.
- the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed in the process of "the training device determining the priority of the first training task," allowing the training device to allocate more resources to executing the training task and obtain a model that has completed the training task as quickly as possible.
- the information of the second training task includes the priority of each training task included in the second training task; optionally, the information of the second training task may also include identification information of each training task in the second training task; optionally, it may also include the inference type corresponding to each training task in the second training task.
- the second request information corresponding to the second training task may include the priority of the second training task, and optionally, the training device may also adjust the priority of the second training task.
- the training device will delay or refuse to execute the first training task.
- the response information sent by the training device to the first device includes the priority of the second training task.
- the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority, so it decides to delay or refuse to execute the first training task.
- the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. It also helps to make the use of resources in the training device better meet the needs of the current scenario.
- the method further includes: the training device notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, the training device notifying the second device corresponding to the second training task of the training device's idle resources.
- the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device.
- the training device after suspending the execution of the second training task, the training device will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task, thereby enabling the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
- the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task, including: the training device sending a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task.
- the training device sends a reason value to the second device, which indicates the reason for suspending the execution of the second training task.
- the second device can not only promptly know that the second training task has been suspended, but also know the reason for the suspension. This facilitates the second device in promptly determining the handling method for the second training task after confirming that it has been suspended, and also allows the second device to determine a more suitable handling method based on the reason for suspending the second training task.
- the reason is that the second training task is the lowest priority training task on the training device.
- the second training task that the training device pauses execution is the lowest priority training task on the training device. That is, the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
- the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
- "at least one third training task” can include all training tasks currently being executed by the training device. It should be noted that since the training device may pause executing old training tasks or begin executing new training tasks, the training tasks included in "the training tasks currently being executed by the training device" in this application can vary. If, when the training device sends a cause value to the second device, it has paused executing the second training task and begun executing the first training task, then the training tasks currently being executed by the training device can include the first training task; that is, at least one third training task can include the first training task. Alternatively, “at least one third training task” can include only the first training task. Or, "at least one third training task” can include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc.
- the above cause value may also include identification information for each third training task; alternatively, the above cause value may also include the inference type for each third training task.
- the reason value sent by the training device to the second device includes the priority of each of the at least one third training task currently being executed by the training device.
- the priority of each training task is higher than that of the second training task.
- the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is currently occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device. This makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device with the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
- the method further includes: the training device receiving a processing method for the second training task from the second device, wherein the processing method for the second training task is: termination, waiting, or adjustment of resource usage.
- the second device can send a processing method for the second training task to the training device. Since the second device has a clearer understanding of its needs for the second model, it can determine whether to terminate, wait, or adjust resource usage based on its needs for the second model. This helps improve the adaptability of the processing method for the second training task to specific application scenarios.
- the method further includes: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task.
- the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task.
- the training device reduces the number of parameters of the model corresponding to the second training task, or reduces the accuracy when executing the second training task, or reduces the batch training scale when executing the second training task.
- a training task involves multiple training processes on a model.
- Batch training refers to dividing the multiple training processes of a training task into multiple batches.
- n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1.
- the batch size represents the number of training samples used in the training process of a single batch, which is also the number of n.
- the "batch size" affects the memory resources required by the model in the training process of a single batch.
- "reducing the number of parameters in the model corresponding to the second training task” can be achieved by pruning the second model.
- the specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only the storage resources used but also the processor resources used.
- "Reducing the precision when performing the second training task” can be achieved by replacing the high-precision data format with a low-precision data format when performing the second training task, such as reducing from FP32 to FP16, or from FP64 to mixed precision, etc. This example is only for the purpose of understanding the solution, thereby reducing not only the storage resources used but also the processor resources used.
- "Reducing the batch training size when performing the second training task” can be achieved by reducing the number of training samples used in a single batch of training, thereby reducing the storage resources required for a single batch of training.
- different processing methods are provided for the second training task fed back by the second device, and specific processing schemes of the training device are provided.
- the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the application during execution.
- the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
- the method is applied to the following scenarios: the resource requirements of the first training task exceed the idle resources of the training device; or, the number of training tasks currently being executed by the training device exceeds a preset threshold.
- the resource requirements of the first training task may include the storage resources required by the first training task, and optionally, may also include the processor resources required by the first training task.
- the storage resources include video memory resources and/or system memory resources.
- the training task processing method provided in this application is started when the required resources of the first training task exceed the idle resources of the training device, or when the number of training tasks currently being executed by the training device exceeds a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
- adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training scale when performing the second training task.
- the second device can further indicate how to adjust the resources used when performing the second training task. This helps to make the final second model more compatible with the application scenario of the second model, that is, it helps to obtain a more satisfactory second model under the premise of limited resources.
- each training task currently performed by the training device is a task of training a model in the communication domain.
- This implementation provides a specific application domain for the method of this application, increasing the degree of integration between this application and a specific application domain.
- this application provides a method for processing training tasks, which can be applied to the training phase of a model.
- the method includes: a first device sending request information to a training device, the request information being used to request the execution of a training task of a first model; the first device receiving response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task being one or more training tasks currently being executed by the training device, or the second training task being the lowest priority training task among the training tasks currently being executed by the training device.
- the first device sends the processing method of the training task of the first model to the training device.
- the processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
- adjusting resource usage includes: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
- the determining factors for the processing method of the first model's training task include: the immediacy requirement of the first model, the accuracy requirement of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage.
- the "immediacy requirement of the first model” can be understood as the time requirement for the first model after completing the first training task, or as how quickly the first model needs to be deployed. The shorter the deployment time, the higher the immediacy requirement of the first model.
- the "degree of reduction in the accuracy of the first model caused by adjusting resource usage” can be determined by at least one of the following parameters: the first accuracy range of the first model obtained after adjusting the resources used during the execution of the first training task, and/or, the second accuracy range, where the second accuracy range represents the decrease in accuracy of the final obtained first model due to the adjustment of resource usage by a second accuracy range.
- the processing method for the second training task is determined from two dimensions: time requirements and accuracy requirements. This approach is beneficial for obtaining a better processing method under the premise of limited resources.
- the first device may also perform the steps performed by the first device in the first aspect and various implementations of the first aspect.
- the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the second aspect can all be found in the first aspect, and will not be repeated here.
- this application provides a method for processing training tasks, which can be applied to the training phase of a model.
- the method includes: a second device sending a request message to a training device, the request message being used to request the training device to execute a training task of a second model (hereinafter referred to as the second training task); the second device receiving a notification message from the training device, the notification message indicating to suspend the execution of the training task of the second model, or the notification message indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a training task added by the training device.
- the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
- the method further includes: the first device sending the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
- the determining factors for how the training task of the second model is handled include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting resource usage.
- the second device may also perform the steps performed by the second device in the first aspect and various implementations of the first aspect.
- the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the third aspect can all be found in the first aspect, and will not be repeated here.
- the training task processing apparatus includes: a receiving module, configured to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module, configured to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module, configured to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information of the second training task; wherein the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the requested information includes the priority of the training task for the first model.
- the information for the second training task includes the priority of the second training task.
- the training task processing device further includes: a notification module for notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, a notification module for notifying the second device corresponding to the second training task of the idle resources of the training device.
- the notification module is specifically used to send a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task.
- the reason is that the second training task is the lowest priority training task on the training device.
- the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
- the receiving module is further configured to receive the processing method of the second training task from the second device, wherein the processing method of the second training task is: termination, waiting, or adjustment of occupied resources.
- the training task processing apparatus further includes: a deletion module, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
- a deletion module used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination
- an execution module used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task
- an adjustment module used to reduce the number of parameters of the model corresponding to the second training task, or
- the method is applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
- adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
- each training task currently performed by the training device is a task of training a model in the field of communications.
- the training task processing device can also execute the steps performed by the training device in the first aspect and various implementations of the first aspect.
- the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fourth aspect can all be found in the first aspect, and will not be repeated here.
- the training task processing apparatus includes: a sending module, configured to send request information to the training device, the request information being used to request the execution of a training task for a first model; and a receiving module, configured to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task for the first model, or the response information including information about a second training task; wherein the priority of the first training task is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the lowest priority training task among the training tasks currently being executed by the training device.
- the sending module is further configured to send the processing method of the first training task to the training device, wherein the processing method of the first training task is: termination, waiting, or adjustment of occupied resources.
- adjusting resource usage includes: reducing the number of parameters in the model corresponding to the first training task, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
- the determining factors for how the first training task is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage.
- the training task processing device can also execute the steps executed by the first device in the second aspect and various implementations of the second aspect.
- the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fifth aspect can all be found in the second aspect, and will not be repeated here.
- inventions of this application provide a training task processing apparatus, which can be applied to the training phase of a model and can be used in a second device.
- the training task processing apparatus includes: a sending module, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; and a receiving module, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, and the first training task is a training task added by the training device.
- the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
- the sending module is further configured to send the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjustment of resource usage.
- the training task processing device can also execute the steps performed by the second device in the third aspect and various implementations of the third aspect.
- the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the sixth aspect can all be found in the second aspect, and will not be repeated here.
- embodiments of this application provide an apparatus including a processor and a memory, the processor being coupled to the memory, the memory being used to store a program; and the processor being used to execute the program in the memory, causing the apparatus to perform the methods described in the first, second, or third aspects above.
- embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
- embodiments of this application provide a computer program product, which includes a program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
- this application provides a chip system including a processor for supporting the implementation of the functions involved in the foregoing aspects, such as transmitting or processing data and/or information involved in the foregoing methods.
- the chip system further includes a memory for storing program instructions and data necessary for a terminal device or communication device.
- This chip system may be composed of chips or may include chips and other discrete devices.
- Figure 1 is a schematic diagram of a structure
- Figure 2 is a schematic diagram of the training and application phases of the model
- Figure 3a is a schematic diagram of an architecture for a training task processing system
- Figure 3b is a schematic diagram of another architecture for the training task processing system
- Figure 3c is a schematic diagram of another architecture for the training task processing system
- Figure 4 is a schematic diagram of a training task processing method provided in an embodiment of this application.
- FIG. 5 is another schematic diagram of the training task processing method provided in the embodiments of this application.
- Figure 6 is another schematic diagram of the training task processing method provided in the embodiments of this application.
- Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application.
- Figure 8 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application.
- Figure 9 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application.
- Figure 10 is a schematic diagram of the structure of a device provided in an embodiment of this application.
- “send” and “receive” refer to the direction of signal transmission.
- “send information to device XX” can be understood as the destination of the information being device XX, which may include direct transmission via the air interface or indirect transmission by other units or modules via the air interface.
- “Receive information from device YY” can be understood as the source of the information being device YY, which may include direct reception from device YY via the air interface or indirect reception from device YY via other units or modules via the air interface.
- “Send” can also be understood as the "output” of the chip interface
- “receive” can also be understood as the "input” of the chip interface.
- sending and receiving can occur between devices or within devices, for example, through buses, traces, or interfaces between components, modules, chips, software modules, or hardware modules within a device. It is understood that information may undergo necessary processing, such as encoding and modulation, between the source and destination of information transmission, but the destination can understand the valid information from the source. Similar expressions in this application can be understood in a similar way and will not be elaborated further.
- instruction can include direct and indirect instructions, as well as explicit and implicit instructions.
- the information indicated by a certain piece of information (hereinafter referred to as instruction information) is called the information to be instructed.
- instruction information The information indicated by a certain piece of information
- instruction information is called the information to be instructed.
- there are many ways to indicate the information to be instructed such as, but not limited to, directly indicating the information to be instructed, such as the information to be instructed itself or its index. It can also indirectly indicate the information to be instructed by indicating other information, where there is an association between the other information and the information to be instructed; or it can indicate only a part of the information to be instructed, while the other parts are known or pre-agreed upon.
- the instruction can be implemented by using a pre-agreed (e.g., protocol predefined) arrangement of various information, thereby reducing the instruction overhead to a certain extent.
- a pre-agreed e.g., protocol predefined
- This application does not limit the specific method of instruction. It is understood that for the sender of the instruction information, the instruction information can be used to indicate the information to be instructed; for the receiver of the instruction information, the instruction information can be used to determine the information to be instructed.
- Figure 1 is a structural diagram.
- the framework of the aforementioned artificial intelligence theme is then elaborated from two dimensions: the “Intelligent Information Chain” (horizontal axis) and the “IT Value Chain” (vertical axis).
- the "Intelligent Information Chain” reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output.
- the "IT Value Chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.
- the infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. Communication with the outside world is achieved through sensors; computing power is provided by intelligent chips, which can specifically employ hardware acceleration chips such as central processing units (CPUs), neural network processing units (NPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs).
- intelligent chips which can specifically employ hardware acceleration chips such as central processing units (CPUs), neural network processing units (NPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs).
- the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.
- the data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.
- Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.
- machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.
- Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.
- Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.
- the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
- Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They encapsulate overall artificial intelligence solutions, productize intelligent information decision-making, and realize practical applications. Their application areas mainly include: communications, intelligent driving, intelligent terminals, intelligent transportation, smart homes, intelligent healthcare, intelligent security, intelligent manufacturing, and smart cities.
- the method provided in this application can be applied to various application fields of artificial intelligence technology, specifically to the training stage of models in various application fields.
- the method provided in this application can be applied to application scenarios where multiple training tasks are performed by the same training device, with each training task being a task of training a model.
- the "model” in this application can also be called a "machine learning model.”
- the machine learning model in this application can specifically be represented as a neural network, or as a non-neural network model, etc., which can be determined based on the actual application scenario.
- models can be deployed in core network equipment, network equipment, and/or terminal equipment.
- the functions implemented by the models may include, but are not limited to: predicting the movement trajectory of terminal equipment, compressing the codebook in the Channel State Index Reference Signal (CSI-RS), decompressing the codebook in the CSI-RS, beamforming, load balancing of network equipment, or other functions.
- CSI-RS Channel State Index Reference Signal
- the specific devices in which models are deployed to implement which functions can be determined based on the actual application scenario.
- core network equipment refers to cloud servers that carry various network functions.
- the functions implemented through core network equipment include, but are not limited to: network data analytics function (NWDAF), access and mobility management function (AMF), session management function (SMF), authentication server function (AUSF), network exposure function (NEF), network repository function (NRF), network slice selection function (NSSF), unified data management (UDM), user plane function (UPF), etc., which will not be listed exhaustively here.
- Network equipment can refer to devices that provide wireless access services in a wireless network.
- a network device can be a device that connects terminal devices to a wireless network, and can also be called a base station; the aforementioned base station can be various forms of macro base stations, micro base stations, relay stations, or access points, etc.
- the names of network devices with base station functions may differ.
- a base station can be called an evolved Node B (eNB), a Node B (NB), the next-generation Node B (gNB) in a 5th generation (5G) communication system, a home base station (e.g., home evolved Node B, or home Node B, HNB), a base band unit (BBU), a wireless fidelity (Wi-Fi) access point (AP), a transmission reception point (TRP), or a radio network controller (RNC), etc.
- the terminal device can achieve wireless access through the cooperation of multiple network nodes, with each network node performing a portion of the base station's functions.
- network nodes can be central units (CU), distributed units (DU), CU-control plane (CP), CU-user plane (UP), or radio units (RU), etc.
- CU and DU can be set up separately or included in the same network element, such as a baseband unit (BBU).
- BBU baseband unit
- RU can be included in radio frequency equipment or radio frequency units, such as remote radio units (RRU), active antenna units (AAU), or remote radio heads (RRH).
- RRU remote radio units
- AAU active antenna units
- RRH remote radio heads
- CU or CU-CP and CU-UP
- DU or RU may have different names, but their meanings will be understood by those skilled in the art.
- a CU can also be called an open CU (O-CU), a DU can also be called an open DU (O-DU), a CU-CP can also be called an open CU-CP (O-CU-CP), a CU-UP can also be called an open CU-UP (O-CU-UP), and an RU can also be called an open RU (O-RU).
- O-CU open CU
- CU-CP open CU-CP
- CU-UP can also be called an open CU-UP
- an RU can also be called an open RU (O-RU).
- Any of the CU (or CU-CP, CU-UP), DU, and RU units can be implemented through software modules, hardware modules, or a combination of software and hardware modules. This application does not limit the specific device form of the network equipment.
- a terminal device refers to a wireless terminal device capable of receiving scheduling and instruction information sent by network devices.
- Terminal devices can be handheld devices, vehicle-mounted devices, wearable devices, computing devices, or other processing devices with wireless communication capabilities, etc., without exhaustive list.
- Terminal devices can communicate with one or more core network devices or the Internet via a wireless access network (RAN).
- RAN wireless access network
- terminal devices can be portable, pocket-sized, handheld, computer-embedded, or vehicle-mounted mobile devices that exchange voice and/or data with the RAN.
- a terminal device can be a user agent, a cellular phone, a smartphone, a personal digital assistant (PDA), a tablet PC, a modem, a handset, a laptop computer, a personal communication service (PCS) phone, a remote station, an access point (AP), a remote terminal, an access terminal, a customer premises equipment (CPE), a terminal, a user equipment (UE), or a mobile terminal (MT), etc.
- PDA personal digital assistant
- PCS personal communication service
- AP access point
- CPE customer premises equipment
- UE user equipment
- MT mobile terminal
- terminal devices can also be wearable devices, such as glasses, gloves, watches, clothing, and shoes.
- terminal devices can also be drones, robots, terminal devices in device-to-device (D2D) communication, vehicle-to-everything (V2X) communication, virtual reality (VR) devices, augmented reality (AR) devices, wireless terminals in industrial control, terminal devices in self-driving, remote medical care, smart grids, smart cities, and smart homes.
- D2D device-to-device
- V2X vehicle-to-everything
- VR virtual reality
- AR augmented reality
- wireless terminals in industrial control terminal devices in self-driving, remote medical care, smart grids, smart cities, and smart homes.
- the terminal device can also be a terminal device in a communication system after 5G (such as a sixth-generation (6G) communication system) or a terminal device in a future evolved public land mobile network (PLMN), etc.
- 5G such as a sixth-generation (6G) communication system
- PLMN public land mobile network
- functions implemented using models include: predicting the trajectory of obstacles around the vehicle, determining the vehicle's lateral and longitudinal directions, predicting the location areas the vehicle may reach in the future, performing trajectory planning, and other functions.
- functions implemented using models include: image style transfer, image inpainting, predicting the category of objects in an image, speech recognition, text translation, and other functions.
- Figure 2 is a schematic diagram of the training and application phases of the model. As shown in Figure 2, the entire system may include a request device 200, a training device 210, a database 220, an execution device 230, and a data storage system 240.
- the execution device 230 includes a computing module 231.
- the requesting device 200 and the training device 210 can be communicatively connected.
- the requesting device 200 is a device that sends request information to the training device 210, which is used to request the execution of a training task for the model.
- the requesting device 200 can be a terminal device, a network device, or a core network device.
- the requesting device 200 can be a terminal device responsible for the operation, management, and maintenance of the network. Technicians can directly interact with the aforementioned terminal device to achieve network management and maintenance.
- Database 220 stores a training dataset.
- training device 210 acquires model 201 before the training operation is performed, and uses the training dataset to iteratively train model 201 until the preset convergence condition is met, thus obtaining model 201 after the training operation is performed.
- Model 201 after the training operation is performed can also be called model 201 after training.
- iteratively training model 201 can be understood as updating the weight parameters in model 201 multiple times.
- the trained model 201 can be deployed to the computing module 231 of the execution device 230.
- the execution device 230 can access data, code, etc., from the data storage system 240, and can also store data, instructions, etc., in the data storage system 240.
- the data storage system 240 can be located within the execution device 230, or it can be an external storage device relative to the execution device 230.
- the execution device 230 can input the data to be processed into the model 201 to obtain the prediction information generated by the model 201, thereby realizing the function of the model 201.
- Figure 3a is a schematic diagram of an architecture of a training task processing system.
- the training device is taken as a core network device in the communication field
- the execution device is taken as a base station in the communication field.
- the requesting device can be a terminal device that communicates with the core network device.
- the requesting device can send request information to the core network device. This request information is used to request the execution of the training task.
- the core network device When the core network device completes the training task based on the request information, it can obtain a model that has undergone training operations.
- the model that has undergone training operations is deployed to multiple base stations. It should be understood that the example in Figure 3a is only for the convenience of understanding this scheme and is not intended to limit this scheme.
- Figure 3a is merely a schematic diagram of one architecture for the training task processing system, and the positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation.
- the training device 210 and the execution device 230 can also be integrated into the same device.
- Figure 3b is a schematic diagram of another architecture for the training task processing system.
- the training device and the execution device are both the same core network device as an example.
- the requesting device can be a terminal device that communicates with the core network device. As shown in Figure 3b, the requesting device can send request information to the core network device.
- the core network device When the core network device completes the training task based on the request information, it can obtain the model that has undergone training operations.
- the model that has undergone training operations is deployed to the core network device. It should be understood that the example in Figure 3b is only for the convenience of understanding this solution and is not intended to limit this solution.
- Figure 3c is a schematic diagram of another architecture of the training task processing system.
- the training device and the execution device are both the same base station as an example.
- the requesting device can be a terminal device that communicates with the core network device.
- the requesting device can send request information to the core network device, and the core network device will then forward the request information to each base station.
- the base station After receiving the request information, if the base station completes the training task based on the request information, it can obtain the model that has been trained.
- the model that has been trained is deployed to the base station.
- the requesting device can also be integrated with the training device in the same device.
- the core network device determines to perform a training operation on a certain model
- it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same core network device.
- the base station determines to perform a training operation on a certain model
- it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same base station, etc.
- the specific product forms of the "requesting device", "training device” and “execution device” can be determined in combination with the actual application scenario.
- Figures 3a, 3b, and 3c above are only examples of applications in the field of communication.
- the method provided by this application can also be applied to other fields.
- the requesting device and the training device are integrated into the same device, both of which are cloud servers, and the executing device is a vehicle, etc.
- the situations in each application field will not be listed here one by one.
- this application discloses that after receiving a request information for executing a training task of a first model (hereinafter referred to as "first request information" for ease of distinction), the training device can determine whether to execute the first training task based on its priority.
- the first training task includes the training task of the first model. If the priority of the first training task is higher than the priority of the second training task, the first training task can be executed, and the execution of the second training task can be suspended.
- the execution of the first training task can be delayed or rejected.
- the second training task is the training task currently being executed by the training device.
- the training device can use the priority of training tasks to decide whether to pause, delay, or reject the execution of certain training tasks, which helps to avoid overloading the training device, thereby preventing process interruption or data loss and improving the stability of the training device during the execution of training tasks.
- Figure 4 is a schematic diagram of a training task processing method provided in an embodiment of this application.
- the training task processing method provided in this application may include:
- the training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
- the first request information can be a single message or information within a message, without restriction.
- training task of the first model and the first training model in this application can be interchanged without restriction.
- the first device when it determines to train the first model, it can send the first request information to the training device.
- the training device receives the first request information and, based on it, determines that it requests to execute the training task of the first model. Therefore, it can create a first training task, which includes the training task of the first model. It should be noted that the creation of the first training task by the training device does not mean that the training device immediately begins executing the first training task.
- the training device can determine whether to begin executing the first training task in subsequent steps; the specific determination process will be described in subsequent steps.
- the training device creating a first training task may include: the training device generating identification information for the first training task.
- the training device can then place the first training task into a waiting queue; for example, the training device can place the identification information of the first training task into the waiting queue, and optionally, the training device can place the identification of the first training task and first request information into the waiting queue.
- the first device can be a requesting device.
- the first device i.e., the requesting device
- the first device can specifically be a terminal device, a network device, a core network device, or other types of devices, depending on the actual application scenario.
- the training device can be a network device, a core network device, a cloud server, or other types of devices.
- the first training task is the training task of the first model, which can also be called the first machine learning model.
- the first model can be a neural network or a non-neural network model.
- the first request information may include information indicating the inference type of the first model; the training device can determine what kind of model the first training task is based on the inference type of the first model.
- the inference type of the first model can also be referred to as the type of task performed by the first model, or other names, etc., which will not be exhaustively listed here.
- the inference type of the first model can indicate the function implemented by the first model.
- the inference type of the first model is to predict the movement trajectory of the terminal device
- the function implemented by the first model is to predict the movement trajectory of the terminal device
- the inference type of the first model is to compress the codebook in CSI-RS
- the function implemented by the first model is to compress the codebook in CSI-RS
- the inference type of the first model is image classification
- the function implemented by the first model is to predict the category of objects in the image, and so on.
- the first request information indicates the inference type of the first model by including at least one of the following: a description of the inference type of the first model, an identifier of the inference type of the first model, a first model that has not yet been trained, an identifier of the first model, or other information.
- the training device upon acquiring the untrained first model, can determine what kind of information the first model outputs, that is, determine the inference type of the first model. For example, if the first model outputs the movement trajectory of the terminal device over a future period, the inference type of the first model is predicting the movement trajectory of the terminal device; as another example, if the first model outputs compressed information of the codebook in CSI-RS, the inference type of the first model is compressing the codebook in CSI-RS; as yet another example, if the first model outputs the category of an object in an image, the inference type of the first model is image classification, etc. These examples are only provided to facilitate understanding of how to determine the inference type of the first model.
- the training device can determine the first model corresponding to the identification information of the first model according to the correspondence. After obtaining the first model, it can determine what kind of information the first model outputs, that is, it can determine the inference type of the first model.
- the first request information may further include: the accuracy requirement of the first training task, and/or, the batch training size requirement of the first training task.
- the "accuracy requirement of the first training task" can be used to indicate the precision to be used when performing the first training task.
- the precision used when performing a training task can be single-precision floating-point (FP32), half-precision floating-point (FP16), double-precision floating-point (FP64), mixed precision, or other precision, etc., which are not limited here.
- a training task can include multiple training processes for a model.
- Batch training refers to dividing the multiple training processes of a training task into multiple batches.
- n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1.
- the batch size represents the number of training samples used in the training process of a single batch, which is also the number of n. For example, the batch size affects the GPU memory resources required by the model in the training process of a single batch.
- the training device can determine the required precision when performing the first training task based on the precision requirements of the first training task, and/or, the training device can determine the number of training samples used in the training process of a single batch based on the batch training scale requirements of the first training task.
- the first request information may also include the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the priority of the first training task may be expressed as a number, such as 1, 2, 3, 4, 5 or other numbers, with the larger the number, the higher the priority; or, the priority of the first training task may be expressed as text, such as first level, second level, third level, etc.
- the first device determines the priority of the first training task before sending the first request information to the training device.
- the determination of this priority can refer to the following two cases.
- Scenario 1 The first device determines the priority of the first training task based on the first factor.
- the first factor may include at least one of the following: the inference type of the first model, the resource requirements of the first training task, or the immediacy requirements of the first model.
- the impact of the inference type of the first model on the priority of the first training task can be described in the following description. For example, the less resources the first training task requires, the higher its priority can be; the more resources the first training task requires, the lower its priority can be. The higher the immediacy requirement of the first model, the higher its priority can be; the lower the immediacy requirement of the first model, the lower its priority can be.
- the resource requirements for the first training task may include storage resources needed to execute the first training task, which may include video memory and/or system memory resources, without limitation.
- the resource requirements for the first training task may also include processor resources or other resources needed to execute the first training task.
- the immediacy requirement of the first model can be understood as the time requirement for the first model after completing the first training task, or as the time within which the first model needs to be deployed.
- the shorter the deployment time the higher the immediacy requirement of the first model.
- Model 1 is used to compress or decompress the codebook in CSI-RS, then Model 1 has a high immediacy requirement; as another example, if Model 2 is used to implement load balancing of base stations, since Model 2 is updated periodically, then Model 2 has a lower immediacy requirement, etc.
- the examples here are only for the convenience of understanding this scheme.
- the first factor includes the inference type of the first model; assuming the first device stores a correspondence between inference types and priorities, the first device can determine the priority corresponding to the inference type of the first model, i.e., the priority of the first training task, based on the correspondence.
- the first factor includes not only the inference type of the first model, but also: the resource requirements of the first training task, and/or the immediacy requirements of the first model.
- the first device acquires a first score corresponding to the inference type of the first model, and acquires a second score corresponding to the resource requirements of the first training task, and/or acquires a third score corresponding to the immediacy requirements of the first model.
- the first device can perform a weighted summation of all the acquired scores to obtain the total score of the first training task, thereby determining the priority of the first training task. All acquired scores may include the first score, as well as the second and/or third scores. The higher the total score of the first training task, the higher its priority; the lower the total score of the first training task, the lower its priority.
- the first device can determine the first score corresponding to the inference type of the first model based on the correspondence.
- the first device can store a correspondence between required resources and scores, and then the first device can determine the second score corresponding to the required resources of the first training task based on this correspondence.
- the first device inputs the required resources of the first training task into a first preset algorithm to obtain the second score corresponding to the required resources of the first training task, whereby the first preset algorithm indicates the mapping relationship between required resources and scores.
- the first device can store the correspondence between immediacy requirements and scores, and then the first device can determine the third score corresponding to the immediacy requirement of the first model based on the correspondence.
- the first device inputs the immediacy requirement of the first model into a second preset algorithm to obtain the third score corresponding to the immediacy requirement of the first model.
- the second preset algorithm indicates the mapping relationship between immediacy requirements and scores.
- the first factor may include more or fewer elements.
- the examples above are only for the convenience of understanding this solution and are not intended to limit this solution.
- Scenario 2 The first device determines the priority of the first training task based on the first operation.
- the first device can receive a first operation input by the user, and then determine the priority of the first training task based on the first operation.
- the first operation can be a selection operation for the priority of the first training task, or it can be the priority of the first training task input through a text box, or it can be the priority of the first training task input by voice, etc.
- the specific operation can be determined according to the actual product form.
- the first request information may also include other information, such as the training data set of the first model, or the performance requirements of the first model, etc., which will not be listed here.
- the training device can create a first training task based on the first request information, and then determine whether to execute the first training task.
- the training device can determine whether to execute the first training task based on the priority of the first training task; when the training device is not in a first scenario, the training device can start executing the first training task; for example, the first scenario includes: the resource requirement of the first training task is greater than the idle resources of the training device, and/or, the number of training tasks currently being executed by the training device is greater than a preset threshold, which will be described below.
- Scenario 1 The first scenario involves the resource requirements of the first training task exceeding the available resources of the training equipment.
- the training device can determine whether the resource requirement of the first training task is greater than the available resources of the training device. If the resource requirement of the first training task is greater than the available resources of the training device, the training device can determine the second training task from all currently executing training tasks, and then determine whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the resource requirement of the first training task is less than or equal to the available resources of the training device, then the first training task can be started.
- the training device can determine the resource requirements of the first training task in the following three ways.
- the first request information may further include information indicating the required resources for the first training task, so that the training device can determine the required resources for the first training task based on the first request information.
- the training device can determine the resource requirements of the first training task based on the inference type of the first model, the accuracy requirements of the first training task, and the batch training scale requirements of the first training task.
- the inference type of the first model indicates which type of first model to train; the more parameters the first model has, the more resources the first training task requires; the higher the accuracy requirements of the first training task, the more resources it requires; and the larger the batch training scale requirements of the first training task, the more GPU memory resources are required during the training of a single batch.
- the training device can store a correspondence between inference types and required resources, and then the training device can determine the required resources of the first training task corresponding to the inference type of the first model based on the correspondence.
- the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device, or it may include other types of idle resources, etc., which can be determined according to the actual application scenario.
- the training device's determination of whether the resources required by the first training task are greater than the idle resources of the training device may include: the training device determining whether the storage resources required by the first training task are greater than the idle storage resources in the training device, for example, the aforementioned storage resources may include video memory resources and/or system memory resources.
- it may also include: the training device determining whether the processor resources required by the first training task are greater than the idle processor resources in the training device, etc., which can be set according to the actual application scenario.
- the second training task is the lowest priority training task among all training tasks currently being executed by the training device. Therefore, the higher priority of the first training task than the second training task can also be understood as the higher priority of the first training task than the lowest priority training task among all currently executed training tasks.
- the second training task comprises S training tasks currently being executed by the training device, where S is an integer greater than or equal to 1. That is, the second training task includes one or more training tasks currently being executed by the training device.
- the higher priority of the first training task than the second training task indicates that the first training task has a higher priority than any of the S training tasks.
- the lower or equal priority of the first training task than the second training task indicates that the first training task has a lower or equal priority than any of the S training tasks.
- the training device can determine a first difference between the resource requirement of the first training task and the idle resources of the training device. Based on the first difference, the training device determines S training tasks from all currently executed training tasks, which are the second training tasks.
- the S training tasks can be the S lowest priority training tasks among all currently executed training tasks, and the sum of the resources occupied by the S training tasks is greater than or equal to the first difference, that is, the sum of the idle resources of the training device and the resources occupied by all the training tasks among the S training tasks is greater than or equal to the resource requirement of the first training task.
- the resources occupied by the training task may include the storage resources occupied by the training task, and optionally, the processor resources occupied by the training task, or other types of resources, which can be determined according to the actual application scenario.
- S can be a preset value, and the training device can determine the S lowest priority training tasks from all currently executed training tasks.
- S can be a preset value, and the training device can determine the S resource-intensive training tasks from all currently executed training tasks.
- Another implementation can also involve the training device randomly determining S training tasks from all currently executed training tasks, etc. The specific implementation method for the training device to determine the second training task can be determined based on the actual application scenario.
- the first scenario includes the number of training tasks currently being executed by the training device being greater than a preset threshold.
- the training device can determine whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device determines whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started.
- the preset threshold can be an integer greater than or equal to 1, for example, the preset threshold can be 4, 5, 6 or other values, which can be determined according to the actual application scenario.
- the first scenario includes situations where the resource requirements of the first training task exceed the available resources of the training device, and the number of training tasks currently being executed by the training device exceeds a preset threshold.
- the training device can determine whether the resource requirement of the first training task is greater than the idle resources of the training device, and whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the resource requirement of the first training task is greater than the idle resources of the training device, or the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device will determine whether the priority of the first training task is higher than the priority of the second training task. If the resource requirement of the first training task is less than or equal to the idle resources of the training device, and the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started.
- the training task processing method provided in this application is started when the required resources of the first training task are greater than the idle resources of the training device, or when the number of training tasks currently being executed by the training device is greater than a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
- the training device can obtain the priority of the first training task from the first request information. In another implementation, the training device can determine the priority of the first training task based on a first factor. The specific implementation methods of the aforementioned steps can be found in the above description of the first device determining the priority of the first training task based on the first factor, and will not be repeated here.
- the training device can also adjust the priority of the first training task.
- the priority in the first request information and the priority of the first training task determined by the training device based on the first factor can both be understood as the initial priority of the first training task.
- the training device can increase the initial priority of the first training task according to the first duration to obtain the current priority of the first training task.
- the first duration is the duration between the time when the training device receives the first request information and the current time. For example, the longer the first duration, the higher the current priority of the first training task, and the shorter the first duration, the lower the current priority of the first training task.
- the specific implementation method for determining the priority of each training task included in the second training task by the training device is similar to the specific implementation method for determining the priority of the first training task by the training device, and will not be repeated here.
- the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed by the training device in determining the priority of the first training task. This allows the training device to allocate more resources to executing the training task, which is beneficial for obtaining a model that has completed the training task as quickly as possible.
- each training task currently performed by the training device is a task of training a model in the communications domain.
- models in the communications domain can be found in the above description and will not be repeated here. This provides a specific application domain for the method in this application, improving the degree of integration between this application and specific application domains.
- the training device executes the training task of the first model and suspends the execution of the second training task.
- the second training task can be one or more training tasks currently being executed by the training device, or the second training task can be the training task with the lowest priority among the training tasks currently being executed by the training device.
- the execution of the first training task by the training device means that the first training task has been added to the training tasks that the training device is currently executing;
- the suspension of the execution of the second training task by the training device can be referred to as the suspension of the second training task by the training device, so that the second training task is no longer included in the training tasks currently being executed by the training device.
- the training device may also maintain a waiting queue.
- the training device pauses the execution of the second training task, the second request information corresponding to the second training task can be put into the waiting queue.
- the training device can also store information about the first training task.
- This information may include the identification information of the first training task, and optionally, it may also include the priority of the first training task. If the training device can adjust the priority of the first training task, then the information about the first training task includes its current priority.
- the information about the first training task may also include other types of information, such as the inference type of the first model corresponding to the first training task.
- the specific information that the first training task's information may include can be determined based on the actual application scenario.
- the training device sends a response message to the first device.
- the response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information can include information about the second training task.
- the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the first device can receive response information from the training device, which indicates that the training device has not directly executed the first training task.
- the response information may include an indication value corresponding to delay or rejection.
- the indication value corresponding to delay or rejection could be 1111, representing delay or rejection of executing the first training task; or, the response information may include delay or rejection, thereby informing the first device to delay or reject the execution of the first training task, etc.
- the information of the second training task may include the information of each of the S training tasks, and the information of each of the S training tasks may include the identification information of each of the S training tasks; optionally, it may also include the priority of each of the S training tasks, that is, the information of the second training task may include the priority of the second training task; optionally, it may also include the inference type corresponding to each of the S training tasks, etc., all of which can be determined in combination with the actual application scenario.
- the response information may include information about all training tasks currently being executed by the training device that have a higher priority than the first training task, wherein all training tasks currently being executed by the training device that have a higher priority than the first training task include the second training task.
- the training device can use the priority of training tasks to decide to pause, delay, or refuse to execute certain training tasks, which helps to avoid overloading of the training device, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training device in the process of executing training tasks.
- the training device will delay or refuse to execute the first training task.
- the response information sent by the training device to the first device includes the priority of the second training task.
- the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority.
- the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. This also helps to make the use of resources in the training device better meet the needs of the current scenario.
- Figure 5 is another schematic diagram of the training task processing method provided by the embodiments of this application.
- the training task processing method provided by this application may include:
- the second device sends a second request message to the training device.
- the second device and the first device are different requesting devices.
- the second device determines to train the second model, it can send a second request message to the training device.
- the second request information is used to request the execution of the training task for the second model.
- the training task for the second model can be called the second training task, and the two can be interchanged.
- the training device can receive the second request information from the second device, and then determine the second training task based on the second request information; the meaning of the terms in step 501 and the specific implementation of the steps can be referred to the description of step 401 in the embodiment corresponding to Figure 4 above, the difference being that "first device” is replaced with “second device”, “first model” is replaced with “second model”, and “first request information” is replaced with “second request information”, which will not be repeated here.
- the second training task may include one or more training tasks.
- the second training task includes one training task, in which case the second request information may include one request message, and the second device may include one device.
- the second training task includes at least two training tasks, in which case the second request information may also include at least two request messages corresponding one-to-one with the at least two training tasks.
- Each of the at least two request messages is used to request the execution of one of the training tasks in the second training task, and the second device may include all devices that sent the aforementioned at least two request messages.
- the training device can determine whether to start executing the second training task, and then start executing the second training task. For example, after receiving each request information included in the second request information, the training device can determine whether to start executing the training task corresponding to each request information, and then start executing the training task corresponding to each request information.
- the process by which the training device determines whether to start executing the training task can be referred to the description of the training device determining whether to start executing the first training task in the embodiment corresponding to Figure 4 above, and will not be repeated here.
- the training device receives the first request information from the first device.
- the first request information can be used to request the execution of the training task of the first model.
- the training device shall execute the training task of the first model and suspend the execution of the second training task.
- the training task of the first model can be referred to as the first training task.
- the training equipment sends a notification message to the second equipment corresponding to the second training task.
- the notification information can be used to notify the second device to suspend the training task of the second model, or it can be used to notify the second device of the idle resources of the training device.
- step 504 is an optional step.
- the training device can also send a notification message to the second device corresponding to the second training task, and the second device can receive the notification message from the training device.
- the notification information is used to instruct the second device to suspend the execution of the second training task, that is, in step 504, the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task.
- the training device notifying the second device to suspend the execution of the second training task may include: the training device sending a reason value to the second device, that is, the notification information includes a reason value, which indicates the reason for suspending the execution of the second training task.
- the second device may receive the reason value from the training device.
- the sending of a cause value by the training device to the second device can be understood as the training device sending the cause value to each device included in the second device.
- the receiving of a cause value from the training device by the second device can be understood as each device included in the second device receiving a cause value from the training device.
- this reason includes the second training task being the lowest priority training task on the training device.
- the second training task that the training device suspends execution from is the lowest priority training task on the training device, meaning that the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
- the aforementioned cause value includes the priority of each of at least one third training task currently being performed by the training device, where the priority of each third training task is higher than the priority of the second training task.
- At least one third training task may include all training tasks currently being executed by the training device. It should be noted that since the training device may pause the execution of old training tasks or begin executing new training tasks, the training tasks currently being executed by the training device in this application can vary. When the training device sends a cause value to the second device, the training device has paused the execution of the second training task and begun executing the first training task; therefore, the training tasks currently being executed by the training device may include the first training task, that is, at least one third training task may include the first training task. Alternatively, at least one third training task may include only the first training task. Or, at least one third training task may include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc., which can be determined based on the actual application scenario.
- the above-mentioned cause value may also include the identification information of each third training task; alternatively, the above-mentioned cause value may also include the inference type of each third training task, etc., which can be determined in combination with the actual application scenario.
- the reason value sent by the training device to the second device includes the priority of each of the at least one third training tasks currently being executed by the training device.
- the priority of each training task is higher than that of the second training task.
- the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device, which makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device and the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
- the aforementioned reason value can also be represented by a letter, such as "LL", which means that the execution of the second training task is paused because the second training task is the lowest priority training task on the training device.
- the aforementioned reason value can also be expressed as a number.
- the reason value can be expressed as "000000", which means that the second training task is paused because it is the lowest priority training task on the training device.
- the aforementioned reason value can also be expressed in other forms, which can be set according to the actual application scenario.
- the training device after determining that the second training task has been paused, the training device sends a reason value to the second device.
- This reason value indicates the reason for pausing the second training task. This allows the second device to promptly know that the second training task has been paused and, more importantly, the reason for the pause. This facilitates the second device in determining the appropriate handling method for the second training task after confirming its pause, and also allows it to determine a more suitable handling method based on the reason for the pause.
- the notification information may include an indication value corresponding to the pause.
- the indication value corresponding to the pause may be 2222, representing the pause of the second training task.
- the notification information may include "pause execution," thereby informing the second device to pause the execution of the second training task. The specific details can be determined based on the actual application scenario.
- the training device after the training device suspends the execution of the second training task, it will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task. This allows the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
- the notification information is used to inform the second device of the idle resources of the training device. That is, in step 504, the training device notifies the second device of the idle resources of the training device.
- the second device can determine that the training device has paused the execution of the second training task, and the second device can know how many idle resources are in the training device.
- the concept of the idle resources of the training device can be referred to the description in Figure 4 above, and will not be repeated here.
- the notification information may include the number of idle storage resources in the training device; optionally, the notification information may also include the number of idle processor resources in the training device, or the notification information may also include the number of other types of idle resources in the training device, which can be determined in combination with the actual application scenario.
- the training device after the training device suspends the execution of the second training task, it can notify the second device of the idle resources of the training device.
- the second device can not only know that the training device has suspended the execution of the second training task, but also determine how to handle the situation of the training device suspending the execution of the second training task based on the idle resources of the training device, which is conducive to obtaining a processing method that is more suitable for the current state of the training device.
- the method for the second device to send a second training task to the training device is not limited.
- the second training task can be handled by terminating, waiting, or adjusting the resources used.
- step 505 is an optional step. After determining that the training device has paused the execution of the second training task, the second device can determine the processing method for the second training task and then send the processing method for the second training task to the training device.
- the processing method of the second device sending the second training task to the training device can be understood as the processing method of each device included in the second device sending any one of the training tasks in the second training task to the training device.
- the processing method of the second training task can be understood as the processing method of any one of the training tasks included in the second training task.
- the second device can send an indication value corresponding to termination to the training device.
- the indication value corresponding to termination can be 000000, DDDDD, or other types of indication values.
- the second device can send feedback information to the training device to indicate "cancel the execution of this training task".
- the specific implementation method can be determined according to the actual application scenario.
- the second device can send an indication value corresponding to waiting to the training device; or, the second device can send feedback information to the training device to indicate "waiting", etc., without limitation.
- this resource adjustment can further include: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when executing the second training task, or reducing the batch training scale when executing the second training task.
- reducing the number of parameters in the model corresponding to the second training task can be achieved by pruning the second model.
- the specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only storage resources but also processor resources. Reducing the accuracy when executing the second training task can be achieved by replacing the high-precision data format with a low-precision data format, such as reducing from FP32 to FP16, or from FP64 to mixed precision.
- the second device can send first feedback information to the training device.
- the first feedback information instructs the training device to process the second training task in a first manner.
- the first manner is to reduce the number of parameters of the model corresponding to the second training task, reduce the accuracy when performing the second training task, or reduce the batch training scale when performing the second training task. That is, the second device not only instructs the processing method of the second training task to adjust the occupied resources, but also instructs what method to use to adjust the occupied resources when performing the second training task.
- the second device can send the processing method of the second training task to the training device. Since the second device has a clearer understanding of the requirements of the second model, it can determine whether to terminate, wait, or adjust the resource usage of the second training task based on the requirements of the second model. This is beneficial to improving the adaptability between the processing method of the second training task and the specific application scenario.
- the second device can further indicate how to adjust the resources occupied when executing the second training task. This is beneficial to make the final second model more compatible with the application scenario of the second model, that is, to obtain a more satisfactory second model under the premise of limited resources.
- the second device can send an indication value to the training device corresponding to adjusting the occupied resources; or, the second device can send feedback information to the training device to indicate "adjusting the occupied resources," so that the training device can know that the processing method of the second training task is to adjust the occupied resources, and then the training device can determine how to adjust the occupied resources when performing the second training task.
- the factors determining the processing method of the second training task include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting the resources used.
- the training device can terminate the process, allowing it to promptly request other devices to execute the second training task.
- the training device can wait.
- the training device can adjust resource usage.
- the immediacy requirement of the second model can be referred to the description of the "immediacy requirement of the first model” in the corresponding embodiment of Figure 4 above, except that "first model” is replaced with “second model”, which will not be repeated here.
- the "accuracy requirement of the second model” can be understood as the accuracy requirement of the second model after performing the second training task, or it can also be understood as the performance requirement of the second model after performing the second training task, etc.
- Model accuracy” or “model performance” can be understood as the accuracy of the prediction information generated by the model.
- the degree of reduction in the accuracy of the second model caused by adjusting the occupied resources can be determined by at least one of the following parameters: the first accuracy range of the second model obtained after adjusting the resources occupied when performing the second training task, and/or the second accuracy range, where the second accuracy range represents the decrease in the accuracy of the final second model caused by adjusting the occupied resources, or may include other parameters, etc., which are not exhaustively listed in this application embodiment.
- the immediacy requirement of the second model is that the second model, after completing the second training task, needs to be online within a second duration.
- the accuracy requirement of the second model is that the accuracy of the second model after completing the second training task needs to be within a first preset accuracy range.
- the second device can determine whether the second duration is greater than the preset duration. If the second duration is greater than the preset duration, the second device can determine that the processing method for the second training task is to wait. If the second duration is less than the preset duration, in one implementation, the second device can determine whether the first accuracy range is within the first preset accuracy range. If the first accuracy range is within the first preset accuracy range, the second device can determine that the processing method for the second training task is to adjust the occupied resources.
- the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
- the second device can determine whether the second precision range is within the second preset precision range. If the second precision range is within the second preset precision range, the second device can determine that the processing method for the second training task is to adjust the occupied resources. If the second precision range is not within the second preset precision range, that is, if there is a precision outside the second preset precision range, the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
- the second device can determine the processing method of the second training task based on the immediacy requirements of the second model, the accuracy requirements of the second model, and/or the degree of reduction in the accuracy of the second model caused by the use of resources. That is, the processing method of the second training task is determined from the two dimensions of time requirements and accuracy requirements, which is beneficial to obtain a better processing method under the premise of limited resources.
- the training equipment processes data based on the processing method of the second training task.
- step 506 is an optional step.
- Step 506 may include: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task from the waiting queue.
- the training device may delete the second request information corresponding to the second training task from the waiting queue.
- the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task.
- the second request information corresponding to the second training task can be placed in a waiting queue.
- the head of the waiting queue is the second request information, and the idle resources of the training device are greater than or equal to the resources required by the second training task, the second training task can continue to be executed.
- Step 506 may include: when the processing method of the second training task is to adjust the occupied resources, regardless of whether the second device or the training device determines the method to adjust the occupied resources when executing the second training task, the training device can know the method to adjust the occupied resources when executing the second training task. In this way, the training device can reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task.
- the specific implementation methods of the above three methods can be referred to the description in step 505 above, which will not be repeated here.
- different processing methods for the second training task fed back by the second device are provided, and specific processing schemes for the training device are provided.
- the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the execution process of this application.
- the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
- the training device can also adjust the priority of the second training task, for example, by increasing the priority of the second training task.
- Figure 6 is another schematic diagram of the training task processing method provided by the embodiments of this application.
- the training task processing method provided by this application may include:
- the second device sends a second request message to the training device.
- step 601 the meaning of the terms in step 601 and the specific implementation of the steps can be found in the description of step 501 in the embodiment corresponding to Figure 5 above, and will not be repeated here.
- the training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
- the training device sends a response message to the first device.
- the training task of the first model can be replaced by a first training task.
- the response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information may include information about the second training task.
- the training task of the first model can be handled by terminating, waiting, or adjusting the resources used.
- the training equipment processes data based on the processing method of the first training task.
- Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application.
- the training task processing device 700 can be applied in a training device.
- the training task processing device 700 includes: a receiving module 701, used to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module 702, used to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module 703, used to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the requested information may include the priority of the training task for the first model.
- the information for the second training task includes the priority of the second training task.
- the training task processing device 700 further includes: a notification module 704, used to notify the second device corresponding to the second training task to suspend the execution of the second training task; or, the notification module 704, used to notify the second device corresponding to the second training task of the idle resources of the training device.
- a notification module 704 used to notify the second device corresponding to the second training task to suspend the execution of the second training task
- the notification module 704 used to notify the second device corresponding to the second training task of the idle resources of the training device.
- the notification module 704 is specifically used to send a reason value to the second device, the reason value being used to indicate the reason for suspending the execution of the second training task.
- the reason may include the second training task being the lowest priority training task on the training device.
- the reason value includes the priority of each of the at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
- the receiving module 701 is also used to receive the processing mode of the second training task from the second device, wherein the processing mode of the second training task is: termination, waiting, or adjusting the occupied resources.
- the training task processing device 700 further includes: a deletion module 705, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module 702, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module 706, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
- a deletion module 705 used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination
- an execution module 702 used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task
- an adjustment module 706 used to reduce the number of parameters of the
- the method can be applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
- adjusting the resources used may include: reducing the number of parameters of the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
- each training task currently being performed by the training device is a task of training a model in the field of communications.
- FIG 8 is a schematic diagram of another structure of the training task processing device provided in this application embodiment.
- the training task processing device 800 can be applied in a first device.
- the training task processing device 800 includes: a sending module 801, used to send request information to the training device, the request information being used to request the execution of a training task of a first model; and a receiving module 802, used to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein, the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
- the sending module 801 is also used to send the processing method of the training task of the first model to the training device.
- the processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
- adjusting the resources used may include: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the training task of the first model, or reducing the batch training scale when performing the training task of the first model.
- the determining factors for how the training task of the first model is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting the resources used.
- Figure 9 is another structural schematic diagram of the training task processing device provided in this application embodiment.
- the training task processing device 900 can be applied to a second device.
- the training task processing device 900 includes: a sending module 901, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; a receiving module 902, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating idle resources of the training device; the second training task includes the training task of the second model; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a newly added training task by the training device.
- the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
- the sending module 901 is also used to send the processing method of the training task of the second model to the training device.
- the processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
- FIG 10 is a schematic diagram of the structure of a device provided in an embodiment of this application.
- the device 1000 can specifically be the training device, the first device, or the second device in the above embodiments.
- the device 1000 includes at least one processor 1001 and at least one memory 1002.
- the processor 1001 and the memory 1002 are connected, for example, through a bus.
- the memory 1002 is primarily used to store software programs.
- the memory 1002 can exist independently and be connected to the processor 1001.
- the memory 1002 can be integrated with the processor 1001, for example, integrated within a single chip.
- the memory 1002 can store program code that executes the technical solutions of the embodiments of this application, and its execution is controlled by the processor 1001.
- the various types of computer program code being executed can also be considered as drivers for the processor 1001.
- the processor 1001 mainly executes the software program stored in the memory 1002 to implement the functions corresponding to the training device, the first device, or the second device in any of the embodiments shown in Figures 4 to 6.
- Figure 10 shows only one memory and one processor. In actual devices, there can be multiple processors and multiple memories. Memory can also be called storage medium or storage device, etc. Memory can be a storage element on the same chip as the processor, i.e., an on-chip storage element, or it can be a separate storage element; this application does not limit this.
- This application also provides a computer-readable storage medium storing a program that, when run on a computer, causes the computer to perform the steps executed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps executed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps executed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
- This application also provides a computer program product, which includes a program that, when run on a computer, causes the computer to perform the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
- This application also provides a circuit system including a processing circuit configured to perform the method described in the embodiments shown in Figures 4 to 6 above.
- This application also provides a training processing system, which includes a training device, a first device, and a second device.
- the training device is used to execute the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6.
- the first device is used to execute the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6.
- the second device is used to execute the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
- the training device, first device, second device, or training task processing apparatus provided in this application embodiment can specifically be a chip.
- the chip includes a processing unit and a communication unit.
- the processing unit can be, for example, a processor, and the communication unit can be, for example, an input/output interface, pins, or circuits.
- the processing unit can execute computer execution instructions stored in the storage unit to cause the chip to execute the methods described in the embodiments shown in Figures 1 to 6.
- the storage unit is a storage unit within the chip, such as a register or cache.
- the storage unit can be a storage unit located outside the chip within the wireless access device, such as read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).
- ROM read-only memory
- RAM random access memory
- the processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of a program in the first aspect of the method.
- the device embodiments described above are merely illustrative.
- the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units.
- Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
- the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
- This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
- a computer device which may be a personal computer, server, or network device, etc.
- implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof.
- software When implemented in software, it can be implemented, in whole or in part, as a computer program product.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
- the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
- wired e.g., coaxial cable, fiber optic, digital subscriber line (DSL)
- wireless e.g., infrared, wireless, microwave, etc.
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media.
- the available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Stored Programmes (AREA)
- Computer And Data Communications (AREA)
Abstract
Description
本申请要求于2024年04月30日提交国家知识产权局、申请号为202410547036.5、申请名称为“一种训练任务的处理方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 202410547036.5, filed on April 30, 2024, entitled "A Method for Processing Training Tasks and Related Equipment", the entire contents of which are incorporated herein by reference.
本申请涉及人工智能领域,尤其涉及一种训练任务的处理方法以及相关设备。This application relates to the field of artificial intelligence, and in particular to a method for processing training tasks and related equipment.
随着智能化水平的不断发展,人工智能(Artificial Intelligence,AI)技术正在被越来越多的领域应用,也即机器学习(Machine Learning,ML)模型在越来越多的领域中被应用,在机器学习模型被应用之前,需要对机器学习模型进行训练。With the continuous development of intelligence, artificial intelligence (AI) technology is being applied in more and more fields, that is, machine learning (ML) models are being applied in more and more fields. Before a machine learning model is applied, it needs to be trained.
机器学习模型的请求设备可以向训练设备发送请求信息,该请求信息用于请求训练设备执行对机器学习模型进行训练的任务,训练设备可以根据接收到的请求信息执行训练任务。The requesting device of the machine learning model can send a request message to the training device, which requests the training device to perform the task of training the machine learning model. The training device can perform the training task according to the received request message.
随着AI的广泛应用,同一个训练设备可以接收来自多个请求设备的请求信息,换言之,同一个训练设备当前执行的训练任务可能包括多个训练任务。但,由于训练设备中的计算机资源是有限的,若训练设备同时执行多个训练任务,可能会导致该训练设备过载,从而导致用于执行训练任务的进程中断或数据的丢失。With the widespread application of AI, a single training device can receive requests from multiple requesting devices. In other words, the training task currently being executed by the same training device may include multiple training tasks. However, since the computer resources in the training device are limited, if the training device executes multiple training tasks simultaneously, it may overload the training device, resulting in the interruption of the process used to execute the training tasks or the loss of data.
本申请实施例提供了一种训练任务的处理方法以及相关设备,训练设备可以利用训练任务的优先级来决定暂停、延迟或拒绝执行某些训练任务,有利于避免发生训练设备过载的情况,进而有利于避免用于执行训练任务的进程中断或者避免数据的丢失,提高了训练设备在执行训练任务过程的稳定性。This application provides a method for processing training tasks and related equipment. The training equipment can use the priority of training tasks to decide to pause, delay or refuse to execute certain training tasks, which helps to avoid overloading of the training equipment, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training equipment in the process of executing training tasks.
本申请实施例提供以下技术方案:The embodiments of this application provide the following technical solutions:
第一方面,本申请实施例提供一种训练任务的处理方法,该方法可应用于模型的训练阶段,方法包括:训练设备接收来自第一设备的请求信息(为方便进行区分,后续称为“第一请求信息”)后,该第一请求信息用于请求训练设备执行第一模型的训练任务(后续可以简称为第一训练任务,即第一模型的训练任务与第一训练任务之间可以相关替换);若第一模型的训练任务的优先级高于第二训练任务的优先级,则训练设备执行第一模型的训练任务,暂停执行第二训练任务;其中,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。In a first aspect, embodiments of this application provide a method for processing training tasks. This method can be applied to the training phase of a model. The method includes: after a training device receives request information from a first device (hereinafter referred to as "first request information" for easy distinction), the first request information is used to request the training device to execute a training task of a first model (hereinafter referred to as the first training task, i.e., the training task of the first model and the first training task can be relatedly replaced); if the priority of the training task of the first model is higher than the priority of the second training task, the training device executes the training task of the first model and suspends the execution of the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
示例性地,若第二训练任务为训练设备当前执行的多个训练任务,则“第一训练任务的优先级高于第二训练任务的优先级”可以理解为第一训练任务的优先级高于第二训练任务包括的每个训练任务的优先级。训练设备开始执行第一训练任务代表:训练设备正在执行的训练任务中增加了第一训练任务;“训练设备暂停执行第二训练任务”可以称为训练设备挂起(suspend)第二训练任务,从而训练设备当前执行的训练任务中不再包括第二训练任务。For example, if the second training task is one of multiple training tasks currently being executed by the training device, then "the priority of the first training task is higher than the priority of the second training task" can be understood as the priority of the first training task being higher than the priority of each training task included in the second training task. The training device starting to execute the first training task means that the first training task has been added to the training tasks currently being executed by the training device; "the training device pausing to execute the second training task" can be referred to as the training device suspending the second training task, so that the second training task is no longer included in the training tasks currently being executed by the training device.
和/或,And/or,
训练设备接收来自第一设备的第一请求信息后,若第一模型的训练任务的优先级低于或等于第二训练任务的优先级,则训练设备向第一设备发送响应信息,响应信息用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息包括第二训练任务的信息;其中,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。After receiving the first request information from the first device, if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the training device sends a response information to the first device. The response information is used to notify the first device to delay or refuse to execute the training task of the first model, or the response information includes information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
本实现方式中,训练设备在接收到第一请求信息之后,还可以根据第一模型的训练任务的优先级来确定是否执行第一模型的训练任务,若第一模型的训练任务的优先级高于第二训练任务的优先级,则可以执行第一模型的训练任务,暂停执行第二训练任务;若第一模型的训练任务的优先级低于或等于第二训练任务的优先级,则可以延迟或拒绝执行第一模型的训练任务,第二训练任务是训练设备当前执行的训练任务,也即训练设备可以利用训练任务的优先级来决定暂停、延迟或拒绝执行某些训练任务,有利于避免发生训练设备过载的情况,进而有利于避免用于执行训练任务的进程中断或者避免数据的丢失,提高了训练设备在执行训练任务过程的稳定性。In this implementation, after receiving the first request information, the training device can also determine whether to execute the training task of the first model based on the priority of the training task of the first model. If the priority of the training task of the first model is higher than the priority of the second training task, the training task of the first model can be executed and the execution of the second training task can be suspended. If the priority of the training task of the first model is lower than or equal to the priority of the second training task, the execution of the training task of the first model can be delayed or rejected. The second training task is the training task currently being executed by the training device. That is, the training device can use the priority of the training task to decide to pause, delay or reject the execution of certain training tasks, which helps to avoid the overload of the training device, thereby helping to avoid the interruption of the process used to execute the training task or the loss of data, and improving the stability of the training device in the process of executing the training task.
在一种可能实现方式中,第二训练任务可以包括当前执行的所有训练任务优先级最低的一个或多个训练任务,且训练设备的空闲资源与第二训练任务的占用资源之和大于或等于第一模型的训练任务的需求资源。In one possible implementation, the second training task may include one or more training tasks with the lowest priority among all currently executed training tasks, and the sum of the idle resources of the training device and the occupied resources of the second training task is greater than or equal to the required resources of the training task of the first model.
在一种可能实现方式中,第一请求信息包括第一模型的训练任务的优先级,也可以理解为第一请求信息包括第一训练任务的优先级;示例性地,第一设备基于第一因素确定第一训练任务的优先级,第一因素可以包括第一模型的推断类型;可选地,第一因素还可以包括:第一训练任务的需求资源、第一模型的即时性要求。In one possible implementation, the first request information includes the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the first device determines the priority of the first training task based on a first factor, which may include the inference type of the first model; optionally, the first factor may also include: the resource requirements of the first training task and the immediacy requirements of the first model.
本实现方式中,若第一请求信息包括第一训练任务的优先级,则训练设备可以从直接从第一请求信息中获取到第一训练任务的优先级,进而能够更加快速的实现将第一训练任务的优先级与其他训练任务的优先级之间的对比,由于在确定了第一训练任务的优先级与第二训任务的优先级之间的对比结果之后,才可以确定是否执行第一训练任务,从而也有利于更加快速的确定是否执行第一训练任务;此外,训练设备直接从第一请求信息中获取第一训练任务的优先级,也有利于减少“训练设备确定第一训练任务的优先级”这一过程所占用的资源,从而有利于使得训练设备能够将更多的资源应用于执行训练任务上,有利于尽快得到执行完训练任务的模型。In this implementation, if the first request information includes the priority of the first training task, the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed in the process of "the training device determining the priority of the first training task," allowing the training device to allocate more resources to executing the training task and obtain a model that has completed the training task as quickly as possible.
在一种可能实现方式中,第二训练任务的信息包括第二训练任务包括的每个训练任务的优先级;可选地,第二训练任务的信息还可以包括第二训练任务中每个训练任务的标识信息;可选地,还可以包括与第二训练任务中每个训练任务对应的推断类型。与第二训练任务对应的第二请求信息中可以包括第二训练任务的优先级,可选地,训练设备还可以调整第二训练任务的优先级。In one possible implementation, the information of the second training task includes the priority of each training task included in the second training task; optionally, the information of the second training task may also include identification information of each training task in the second training task; optionally, it may also include the inference type corresponding to each training task in the second training task. The second request information corresponding to the second training task may include the priority of the second training task, and optionally, the training device may also adjust the priority of the second training task.
本实现方式中,在第一训练任务的优先级低于或等于第二训练任务的优先级的情况下,训练设备会延迟或拒绝执行第一训练任务,而训练设备向第一设备发送的响应信息中包括第二训练任务的优先级,则第一设备不仅能够得知训练设备决定延迟或拒绝执行第一训练任务,而且还能够得知训练设备是因为正在执行优先级更高的第二训练任务,所以决定延迟或拒绝执行第一训练任务,也即第一设备能够得知训练设备的当前情况,进而便于第一设备能够结合训练设备的当前情况以及第一模型的需求情况,确定出更加适配的处理方式,也有利于使得训练设备中的资源的使用情况更能满足当前场景的需求。In this implementation, if the priority of the first training task is lower than or equal to the priority of the second training task, the training device will delay or refuse to execute the first training task. The response information sent by the training device to the first device includes the priority of the second training task. Thus, the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority, so it decides to delay or refuse to execute the first training task. In other words, the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. It also helps to make the use of resources in the training device better meet the needs of the current scenario.
在一种可能实现方式中,训练设备在暂停执行第二训练任务之后,方法还包括:训练设备通知第二训练任务对应的第二设备暂停执行第二训练任务;或者,训练设备通知第二训练任务对应的第二设备训练设备的空闲资源。示例性地,训练设备的空闲资源可以包括训练设备中空闲的存储资源;可选地,还可以包括训练设备中空闲的处理器资源。本实现方式中,训练设备在暂停执行第二训练任务之后,会及时通知第二训练任务所对应的第二设备该训练设备已经暂停执行第二训练任务,从而便于第二设备能够快速的得知训练设备已经暂停执行第二训练任务,使得第二训练任务的执行过程更加具有可控性。In one possible implementation, after the training device suspends the execution of the second training task, the method further includes: the training device notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, the training device notifying the second device corresponding to the second training task of the training device's idle resources. For example, the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device. In this implementation, after suspending the execution of the second training task, the training device will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task, thereby enabling the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
在一种可能实现方式中,训练设备通知与第二训练任务对应的第二设备暂停执行第二训练任务,包括:训练设备向第二设备发送原因值,原因值用于指示暂停执行第二训练任务的原因。本实现方式中,训练设备在确定暂停执行第二训练任务之后,会向第二设备发送原因值,该原因值用于指示暂停执行第二训练任务的原因,从而第二设备不仅能够及时得知训练设备中已经暂停执行第二训练任务,而且能够得知训练设备暂停执行第二训练任务的原因,不仅便于第二设备在确定了训练设备已经暂停执行第二训练任务后,及时确定第二训练任务的处理方式,而且第二设备还能根据暂停执行第二训练任务的原因,来确定更加适配的处理方式。In one possible implementation, the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task, including: the training device sending a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task. In this implementation, after determining that the second training task has been suspended, the training device sends a reason value to the second device, which indicates the reason for suspending the execution of the second training task. Thus, the second device can not only promptly know that the second training task has been suspended, but also know the reason for the suspension. This facilitates the second device in promptly determining the handling method for the second training task after confirming that it has been suspended, and also allows the second device to determine a more suitable handling method based on the reason for suspending the second training task.
在一种可能实现方式中,原因包括第二训练任务为训练设备上优先级最低的训练任务。本实现方式中,训练设备暂停执行的第二训练任务时训练设备上优先级最低的训练任务,也即训练设备中的资源能够尽量先分配给优先级更高的训练任务,使得训练设备中的资源能够被更加高效的使用。In one possible implementation, the reason is that the second training task is the lowest priority training task on the training device. In this implementation, the second training task that the training device pauses execution is the lowest priority training task on the training device. That is, the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
在一种可能实现方式中,原因值包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个第三训练任务的优先级均高于第二训练任务的优先级。In one possible implementation, the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
示例性地,“至少一个第三训练任务”可以包括训练设备当前执行的所有训练任务,需要说明的是,由于训练设备可能会暂停执行旧的训练任务,也有可能会开始执行新的训练任务,所以本申请中“训练设备当前执行的训练任务”中包括的训练任务是可以变化的,在训练设备向第二设备发送原因值时,训练设备已经暂停执行第二训练任务且开始执行第一训练任务,则训练设备当前执行的训练任务可以包括第一训练任务,也即至少一个第三训练任务中可以包括第一训练任务。或者,“至少一个第三训练任务”可以仅包括第一训练任务。或者,“至少一个第三训练任务”可以包括从训练设备当前执行的所有训练任务中选取的预设个数的训练任务等。For example, "at least one third training task" can include all training tasks currently being executed by the training device. It should be noted that since the training device may pause executing old training tasks or begin executing new training tasks, the training tasks included in "the training tasks currently being executed by the training device" in this application can vary. If, when the training device sends a cause value to the second device, it has paused executing the second training task and begun executing the first training task, then the training tasks currently being executed by the training device can include the first training task; that is, at least one third training task can include the first training task. Alternatively, "at least one third training task" can include only the first training task. Or, "at least one third training task" can include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc.
可选地,上述原因值还包括每个第三训练任务的标识信息;可选地,上述原因值还包括每个第三训练任务的推断类型。Optionally, the above cause value may also include identification information for each third training task; alternatively, the above cause value may also include the inference type for each third training task.
本实现方式中,训练设备向第二设备发送的原因值中包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个训练任务的优先级均高于第二训练任务的优先级,则第二设备在接收到该原因值之后,不仅能够得知因为优先级低导致暂停执行第二训练任务,而且能够得知正在占用训练设备的资源的第三训练任务的优先级,也即第二设备能够得知训练设备的更加详细的资源使用情况,进而便于第二设备能够结合训练设备的资源使用情况以及第二模型的需求情况,确定出更加适配的处理方式,也有利于使得训练设备中的资源的使用情况更能满足当前场景的需求。In this implementation, the reason value sent by the training device to the second device includes the priority of each of the at least one third training task currently being executed by the training device. The priority of each training task is higher than that of the second training task. After receiving the reason value, the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is currently occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device. This makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device with the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
在一种可能实现方式中,方法还包括:训练设备接收来自第二设备的第二训练任务的处理方式,第二训练任务的处理方式为:终止,等待,或者调整占用资源。本实现方式中,第二设备在得知暂停执行第二训练任务之后,可以向训练设备发送第二训练任务的处理方式,由于第二设备更加明确对第二模型的需求情况,从而第二设备能够根据对第二模型的需求情况,来确定第二训练任务的处理方式是终止、等待还是调整占用资源,有利于提高第二训练任务的处理方式与具体的应用场景之间的适配度。In one possible implementation, the method further includes: the training device receiving a processing method for the second training task from the second device, wherein the processing method for the second training task is: termination, waiting, or adjustment of resource usage. In this implementation, after learning that the execution of the second training task is paused, the second device can send a processing method for the second training task to the training device. Since the second device has a clearer understanding of its needs for the second model, it can determine whether to terminate, wait, or adjust resource usage based on its needs for the second model. This helps improve the adaptability of the processing method for the second training task to specific application scenarios.
在一种可能实现方式中,方法还包括:当第二训练任务的处理方式为终止时,训练设备删除与第二训练任务对应的第二请求信息;示例性地,训练设备可以删除与第二训练任务对应的第二请求信息。或者,当第二训练任务的处理方式为等待时,训练设备等待至空闲资源大于或等于第二训练任务所需资源后,继续执行第二训练任务。或者,当第二训练任务的处理方式为调整占用资源时,训练设备减少与第二训练任务对应的模型的参数量,或降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。In one possible implementation, the method further includes: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task. Alternatively, when the processing mode of the second training task is waiting, the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task. Alternatively, when the processing mode of the second training task is adjusting resource usage, the training device reduces the number of parameters of the model corresponding to the second training task, or reduces the accuracy when executing the second training task, or reduces the batch training scale when executing the second training task.
一个训练任务包括对一个模型的多次训练过程,批训练(batch training)指的是将一个训练任务包括的多次训练过程分成多个批次(batch)的训练过程,在单个批次的训练过程中会从训练数据集合中获取n个训练样本,n为大于或等于1的整数,批训练规模能够代表单个批次的训练过程中采用的训练样本的数量,也即代表n的数量;例如,“批训练规模”会影响模型的单个批次的训练过程中需求的显存资源。A training task involves multiple training processes on a model. Batch training refers to dividing the multiple training processes of a training task into multiple batches. In the training process of a single batch, n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1. The batch size represents the number of training samples used in the training process of a single batch, which is also the number of n. For example, the "batch size" affects the memory resources required by the model in the training process of a single batch.
示例性地,“减少与第二训练任务对应的模型的参数量”可以通过对第二模型进行剪枝的方式实现,具体采用的剪枝算法可以结合实际应用场景灵活确定,从而不仅能够减少占用的存储资源,还可以减少占用的处理器资源。“降低执行第二训练任务时的精度”可以为在执行第二训练任务时可以从高精度的数据格式替换为低精度的数据格式,例如,从FP32降低为FP16,从FP64降低为混合精度等,此处示例仅为方便理解本方案,从而不仅能够减少占用的存储资源,还可以减少占用的处理器资源。“降低执行第二训练任务时的批训练规模”可以为减少单个批次的训练过程中所采用的训练样本的数量,从而能够减少单个批次的训练过程中所需要的存储资源。For example, "reducing the number of parameters in the model corresponding to the second training task" can be achieved by pruning the second model. The specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only the storage resources used but also the processor resources used. "Reducing the precision when performing the second training task" can be achieved by replacing the high-precision data format with a low-precision data format when performing the second training task, such as reducing from FP32 to FP16, or from FP64 to mixed precision, etc. This example is only for the purpose of understanding the solution, thereby reducing not only the storage resources used but also the processor resources used. "Reducing the batch training size when performing the second training task" can be achieved by reducing the number of training samples used in a single batch of training, thereby reducing the storage resources required for a single batch of training.
本实现方式中,提供了对于第二设备反馈的第二训练任务的不同的处理方式,训练设备具体的处理方案,从而无论第二设备反馈什么样的处理方式,训练设备都存在对应的处理方案,有利于提高本申请在执行过程中的顺畅性和稳定性;进一步地,当第二训练任务的处理方式为终止时,则删除第二请求信息,从而及时释放训练设备中与第二请求信息相关的所有资源,有利于避免训练设备中的资源被浪费。In this implementation, different processing methods are provided for the second training task fed back by the second device, and specific processing schemes of the training device are provided. Thus, no matter what processing method the second device feeds back, the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the application during execution. Furthermore, when the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
在一种可能实现方式中,方法应用于以下场景:第一训练任务的需求资源大于训练设备的空闲资源;或者,训练设备当前执行的训练任务的个数大于预设阈值。可选地,第一训练任务的需求资源可以包括第一训练任务需求的存储资源,可选地,还可以包括第一训练任务需求的处理器资源,存储资源包括显存资源和/或内存资源。In one possible implementation, the method is applied to the following scenarios: the resource requirements of the first training task exceed the idle resources of the training device; or, the number of training tasks currently being executed by the training device exceeds a preset threshold. Optionally, the resource requirements of the first training task may include the storage resources required by the first training task, and optionally, may also include the processor resources required by the first training task. The storage resources include video memory resources and/or system memory resources.
本实现方式中,在第一训练任务的需求资源大于训练设备的空闲资源时,或者,在训练设备当前执行的训练任务的个数大于预设阈值时,开始执行本申请提供的训练任务的处理方法,也即在前述场景中,训练设备才会暂停、延迟或拒绝执行某些训练任务,提供了本申请的两种应用场景,提高了本方案的实现灵活性;此外,前述两种应用场景下都快到达了训练设备的负载极限,有利于实现在最大程度上的使用训练设备中的资源的前提下,避免发生训练设备过载的情况。In this implementation, the training task processing method provided in this application is started when the required resources of the first training task exceed the idle resources of the training device, or when the number of training tasks currently being executed by the training device exceeds a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
在一种可能实现方式中,调整占用资源包括:减少与第二训练任务对应的模型的参数量,降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。本实现方式中,当第二训练任务的处理方式为调整占用资源时,第二设备还可以进一步指示采用什么方式来调整执行第二训练任务时占用的资源,有利于使得最终得到的第二模型与第二模型的应用场景之间更加适配,也即有利于在有限资源的前提下,得到更满意的第二模型。In one possible implementation, adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training scale when performing the second training task. In this implementation, when the processing method for the second training task is to adjust resource usage, the second device can further indicate how to adjust the resources used when performing the second training task. This helps to make the final second model more compatible with the application scenario of the second model, that is, it helps to obtain a more satisfactory second model under the premise of limited resources.
在一种可能实现方式中,训练设备当前执行的每个训练任务均为对通信领域的模型进行训练的任务。本实现方式中,提供了本申请中方法的一个具体的应用领域,提高了本申请与具体应用领域的结合程度。In one possible implementation, each training task currently performed by the training device is a task of training a model in the communication domain. This implementation provides a specific application domain for the method of this application, increasing the degree of integration between this application and a specific application domain.
第二方面,本申请提供了一种训练任务的处理方法,该方法可应用于模型的训练阶段,方法包括:第一设备向训练设备发送请求信息,请求信息用于请求执行第一模型的训练任务;第一设备接收来自训练设备的响应信息,响应信息用于通知第一设备延迟或拒绝执行所述第一模型的训练任务,或,响应信息包括第二训练任务的信息;其中,第一模型的训练任务的优先级低于或等于第二训练任务的优先级,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Secondly, this application provides a method for processing training tasks, which can be applied to the training phase of a model. The method includes: a first device sending request information to a training device, the request information being used to request the execution of a training task of a first model; the first device receiving response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task being one or more training tasks currently being executed by the training device, or the second training task being the lowest priority training task among the training tasks currently being executed by the training device.
在一种可能实现方式中,第一设备向训练设备发送第一模型的训练任务的处理方式,第一模型的训练任务的处理方式为:终止,等待,或者调整占用资源。In one possible implementation, the first device sends the processing method of the training task of the first model to the training device. The processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
在一种可能实现方式中,调整占用资源包括:减少与第一模型的训练任务对应的模型的参数量,降低执行第一训练任务时的精度,或降低执行第一训练任务时的批训练规模。In one possible implementation, adjusting resource usage includes: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
在一种可能实现方式中,第一模型的训练任务的处理方式的确定因素包括:第一模型的即时性要求,第一模型的精度要求,和/或,调整占用资源导致的第一模型的精度的降低程度。“第一模型的即时性要求”可以理解为对完成第一训练任务后的第一模型在时间维度上的要求,也可以理解为第一模型需要在多久内上线,第一模型需要在越短的时间内上线,则代表第一模型的即时性要求越高。“调整占用资源导致的第一模型的精度的降低程度”可以通过如下至少一个参数确定:调整执行第一训练任务时占用的资源后得到的第一模型的第一精度范围,和/或,第二精度范围,第二精度范围代表调整占用资源导致最终得到的第一模型的精度下降了第二精度范围的精度。In one possible implementation, the determining factors for the processing method of the first model's training task include: the immediacy requirement of the first model, the accuracy requirement of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage. The "immediacy requirement of the first model" can be understood as the time requirement for the first model after completing the first training task, or as how quickly the first model needs to be deployed. The shorter the deployment time, the higher the immediacy requirement of the first model. The "degree of reduction in the accuracy of the first model caused by adjusting resource usage" can be determined by at least one of the following parameters: the first accuracy range of the first model obtained after adjusting the resources used during the execution of the first training task, and/or, the second accuracy range, where the second accuracy range represents the decrease in accuracy of the final obtained first model due to the adjustment of resource usage by a second accuracy range.
本实现方式中,从时间要求和精度要求两个维度上来确定第二训练任务的处理方式,有利于在资源有限的前提下能够得到更优的处理方式。In this implementation, the processing method for the second training task is determined from two dimensions: time requirements and accuracy requirements. This approach is beneficial for obtaining a better processing method under the premise of limited resources.
本申请第二方面中,第一设备还可以执行第一方面以及第一方面的各种实现方式中第一设备执行的步骤,第二方面中步骤的具体实现方式、名词的含义以及所带来的有益效果,均可以参阅第一方面,此处不再赘述。In the second aspect of this application, the first device may also perform the steps performed by the first device in the first aspect and various implementations of the first aspect. The specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the second aspect can all be found in the first aspect, and will not be repeated here.
第三方面,本申请提供了一种训练任务的处理方法,该方法可应用于模型的训练阶段,方法包括:第二设备向训练设备发送请求信息,请求信息用于请求训练设备执行第二模型的训练任务(后续可以简称为第二训练任务);第二设备接收来自于训练设备的通知信息,通知信息指示暂停执行第二模型的训练任务,或者,通知信息指示训练设备的空闲资源;其中,第二训练任务的优先级低于第一训练任务的优先级,第一训练任务为训练设备新增的训练任务。Thirdly, this application provides a method for processing training tasks, which can be applied to the training phase of a model. The method includes: a second device sending a request message to a training device, the request message being used to request the training device to execute a training task of a second model (hereinafter referred to as the second training task); the second device receiving a notification message from the training device, the notification message indicating to suspend the execution of the training task of the second model, or the notification message indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a training task added by the training device.
在一种可能实现方式中,通知信息包括原因值,原因值用于指示暂停执行第二模型的训练任务的原因。In one possible implementation, the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
在一种可能实现方式中,方法还包括:第一设备向训练设备发送第二模型的训练任务的处理方式,第二模型的训练任务的处理方式为:终止,等待,或者调整占用资源。In one possible implementation, the method further includes: the first device sending the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
在一种可能实现方式中,第二模型的训练任务的处理方式的确定因素包括:第二模型的即时性要求,第二模型的精度要求,和/或,调整占用资源导致的第二模型的精度的降低程度。In one possible implementation, the determining factors for how the training task of the second model is handled include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting resource usage.
本申请第三方面中,第二设备还可以执行第一方面以及第一方面的各种实现方式中第二设备执行的步骤,第三方面中步骤的具体实现方式、名词的含义以及所带来的有益效果,均可以参阅第一方面,此处不再赘述。In the third aspect of this application, the second device may also perform the steps performed by the second device in the first aspect and various implementations of the first aspect. The specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the third aspect can all be found in the first aspect, and will not be repeated here.
第四方面,本申请实施例提供一种训练任务的处理装置,该装置可应用于模型的训练阶段,该装置可以应用于训练设备中,训练任务的处理装置包括:接收模块,用于接收来自第一设备的请求信息,请求信息用于请求执行第一模型的训练任务;执行模块,用于若第一模型的训练任务的优先级高于第二训练任务的优先级,则执行第一模型的训练任务,暂停执行第二训练任务;和/或,发送模块,用于若第一模型的训练任务的优先级低于或等于第二训练任务的优先级,则向第一设备发送响应信息,响应信息用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息包括第二训练任务的信息;其中,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Fourthly, embodiments of this application provide a training task processing apparatus, which can be applied to the training phase of a model and can be used in a training device. The training task processing apparatus includes: a receiving module, configured to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module, configured to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module, configured to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information of the second training task; wherein the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
在一种可能实现方式中,请求信息包括第一模型的训练任务的优先级。In one possible implementation, the requested information includes the priority of the training task for the first model.
在一种可能实现方式中,第二训练任务的信息包括第二训练任务的优先级。In one possible implementation, the information for the second training task includes the priority of the second training task.
在一种可能实现方式中,训练任务的处理装置还包括:通知模块,用于通知第二训练任务对应的第二设备暂停执行第二训练任务;或者,通知模块,用于通知第二训练任务对应的第二设备训练设备的空闲资源。In one possible implementation, the training task processing device further includes: a notification module for notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, a notification module for notifying the second device corresponding to the second training task of the idle resources of the training device.
在一种可能实现方式中,通知模块,具体用于向第二设备发送原因值,原因值用于指示暂停执行第二训练任务的原因。In one possible implementation, the notification module is specifically used to send a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task.
在一种可能实现方式中,原因包括第二训练任务为训练设备上优先级最低的训练任务。In one possible implementation, the reason is that the second training task is the lowest priority training task on the training device.
在一种可能实现方式中,原因值包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个第三训练任务的优先级均高于第二训练任务的优先级。In one possible implementation, the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
在一种可能实现方式中,接收模块,还用于接收来自第二设备的第二训练任务的处理方式,第二训练任务的处理方式为:终止,等待,或者调整占用资源。In one possible implementation, the receiving module is further configured to receive the processing method of the second training task from the second device, wherein the processing method of the second training task is: termination, waiting, or adjustment of occupied resources.
在一种可能实现方式中,训练任务的处理装置还包括:删除模块,用于当第二训练任务的处理方式为终止时,删除与第二训练任务对应的请求信息;或者,执行模块,还用于当第二训练任务的处理方式为等待时,等待至空闲资源大于或等于第二训练任务所需资源后,继续执行第二训练任务;或者,调整模块,用于当第二训练任务的处理方式为调整占用资源时,减少与第二训练任务对应的模型的参数量,或降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。In one possible implementation, the training task processing apparatus further includes: a deletion module, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
在一种可能实现方式中,方法应用于以下场景:第一训练任务的需求资源大于训练设备的空闲资源;或者,训练设备当前执行的训练任务的个数大于预设阈值。In one possible implementation, the method is applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
在一种可能实现方式中,调整占用资源包括:减少与第二训练任务对应的模型的参数量,降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。In one possible implementation, adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
在一种可能实现方式中,训练设备当前执行的每个训练任务均为对通信领域的模型进行训练的任务。In one possible implementation, each training task currently performed by the training device is a task of training a model in the field of communications.
本申请第四方面中,训练任务的处理装置还可以执行第一方面以及第一方面的各种实现方式中训练设备执行的步骤,第四方面中步骤的具体实现方式、名词的含义以及所带来的有益效果,均可以参阅第一方面,此处不再赘述。In the fourth aspect of this application, the training task processing device can also execute the steps performed by the training device in the first aspect and various implementations of the first aspect. The specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fourth aspect can all be found in the first aspect, and will not be repeated here.
第五方面,本申请实施例提供一种训练任务的处理装置,该装置可应用于模型的训练阶段,该装置可以应用于第一设备中,训练任务的处理装置包括:发送模块,用于向训练设备发送请求信息,请求信息用于请求执行第一模型的训练任务;接收模块,用于接收来自训练设备的响应信息,响应信息用于通知第一设备延迟或拒绝执行第一模型的训练任务或,响应信息包括第二训练任务的信息;其中,第一训练任务的优先级低于或等于第二训练任务的优先级,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Fifthly, embodiments of this application provide a training task processing apparatus, which can be applied to the training phase of a model. This apparatus can be used in a first device. The training task processing apparatus includes: a sending module, configured to send request information to the training device, the request information being used to request the execution of a training task for a first model; and a receiving module, configured to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task for the first model, or the response information including information about a second training task; wherein the priority of the first training task is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the lowest priority training task among the training tasks currently being executed by the training device.
在一种可能实现方式中,发送模块,还用于向训练设备发送第一训练任务的处理方式,第一训练任务的处理方式为:终止,等待,或者调整占用资源。In one possible implementation, the sending module is further configured to send the processing method of the first training task to the training device, wherein the processing method of the first training task is: termination, waiting, or adjustment of occupied resources.
在一种可能实现方式中,调整占用资源包括:减少与第一训练任务对应的模型的参数量,降低执行第一训练任务时的精度,或降低执行第一训练任务时的批训练规模。In one possible implementation, adjusting resource usage includes: reducing the number of parameters in the model corresponding to the first training task, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
在一种可能实现方式中,第一训练任务的处理方式的确定因素包括:第一模型的即时性要求,第一模型的精度要求,和/或,调整占用资源导致的第一模型的精度的降低程度。In one possible implementation, the determining factors for how the first training task is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage.
本申请第五方面中,训练任务的处理装置还可以执行第二方面以及第二方面的各种实现方式中第一设备执行的步骤,第五方面中步骤的具体实现方式、名词的含义以及所带来的有益效果,均可以参阅第二方面,此处不再赘述。In the fifth aspect of this application, the training task processing device can also execute the steps executed by the first device in the second aspect and various implementations of the second aspect. The specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fifth aspect can all be found in the second aspect, and will not be repeated here.
第六方面,本申请实施例提供一种训练任务的处理装置,该装置可应用于模型的训练阶段,该装置可以应用于第二设备中,训练任务的处理装置包括:发送模块,用于向训练设备发送请求信息,请求信息用于请求训练设备执行第二模型的训练任务;接收模块,用于接收来自于训练设备的通知信息,通知信息指示暂停执行第二模型的训练任务,或者,通知信息指示训练设备的空闲资源;其中,第二训练任务的优先级低于第一训练任务的优先级,第一训练任务为训练设备新增的训练任务。Sixthly, embodiments of this application provide a training task processing apparatus, which can be applied to the training phase of a model and can be used in a second device. The training task processing apparatus includes: a sending module, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; and a receiving module, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, and the first training task is a training task added by the training device.
在一种可能实现方式中,通知信息包括原因值,原因值用于指示暂停执行第二模型的训练任务的原因。In one possible implementation, the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
在一种可能实现方式中,发送模块,还用于向训练设备发送第二模型的训练任务的处理方式,第二模型的训练任务的处理方式为:终止,等待,或者调整占用资源。In one possible implementation, the sending module is further configured to send the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjustment of resource usage.
本申请第六方面中,训练任务的处理装置还可以执行第三方面以及第三方面的各种实现方式中第二设备执行的步骤,第六方面中步骤的具体实现方式、名词的含义以及所带来的有益效果,均可以参阅第二方面,此处不再赘述。In the sixth aspect of this application, the training task processing device can also execute the steps performed by the second device in the third aspect and various implementations of the third aspect. The specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the sixth aspect can all be found in the second aspect, and will not be repeated here.
第七方面,本申请实施例提供了一种设备,该设备包括处理器和存储器,处理器与存储器耦合,存储器,用于存储程序;处理器,用于执行存储器中的程序,使得设备执行上述第一方面、第二方面或第三方面所述的方法。In a seventh aspect, embodiments of this application provide an apparatus including a processor and a memory, the processor being coupled to the memory, the memory being used to store a program; and the processor being used to execute the program in the memory, causing the apparatus to perform the methods described in the first, second, or third aspects above.
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面、第二方面或第三方面所述的方法。Eighthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
第九方面,本申请实施例提供了一种计算机程序产品,计算机程序产品包括程序,当该程序在计算机上运行时,使得计算机执行上述第一方面、第二方面或第三方面所述的方法。Ninthly, embodiments of this application provide a computer program product, which includes a program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
第十方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存终端设备或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a tenth aspect, this application provides a chip system including a processor for supporting the implementation of the functions involved in the foregoing aspects, such as transmitting or processing data and/or information involved in the foregoing methods. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for a terminal device or communication device. This chip system may be composed of chips or may include chips and other discrete devices.
图1为一种结构示意图;Figure 1 is a schematic diagram of a structure;
图2为模型的训练阶段和应用阶段的一种示意图;Figure 2 is a schematic diagram of the training and application phases of the model;
图3a为训练任务的处理系统的一种架构示意图;Figure 3a is a schematic diagram of an architecture for a training task processing system;
图3b为训练任务的处理系统的另一种架构示意图;Figure 3b is a schematic diagram of another architecture for the training task processing system;
图3c为训练任务的处理系统的另一种架构示意图;Figure 3c is a schematic diagram of another architecture for the training task processing system;
图4为本申请实施例提供的训练任务的处理方法的一种示意图;Figure 4 is a schematic diagram of a training task processing method provided in an embodiment of this application;
图5为本申请实施例提供的训练任务的处理方法的另一种示意图;Figure 5 is another schematic diagram of the training task processing method provided in the embodiments of this application;
图6为本申请实施例提供的训练任务的处理方法的另一种示意图;Figure 6 is another schematic diagram of the training task processing method provided in the embodiments of this application;
图7为本申请实施例提供的训练任务的处理装置的一种结构示意图;Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application;
图8为本申请实施例提供的训练任务的处理装置的另一种结构示意图;Figure 8 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application;
图9为本申请实施例提供的训练任务的处理装置的另一种结构示意图;Figure 9 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application;
图10为本申请实施例提供的设备的一种结构示意图。Figure 10 is a schematic diagram of the structure of a device provided in an embodiment of this application.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of this application will now be described with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.
本申请实施例中的“发送”和“接收”,表示信号传递的走向。例如,“向XX设备发送信息”可以理解为该信息的目的端是XX设备,可以包括通过空口直接发送,也包括其他单元或模块通过空口间接发送。“接收来自YY设备的信息”可以理解为该信息的源端是YY设备,可以包括通过空口直接从YY设备接收,也可以包括通过空口从其他单元或模块间接地从YY设备接收。“发送”也可以理解为芯片接口的“输出”,“接收”也可以理解为芯片接口的“输入”。换言之,发送和接收可以是在设备之间进行的,也可以是在设备内进行的,例如,通过总线、走线或接口在设备内的部件之间、模组之间、芯片之间、软件模块或者硬件模块之间发送或接收。可以理解的是,信息在信息发送的源端和目的端之间可能会被进行必要的处理,比如编码、调制等,但目的端可以理解来自源端的有效信息。本申请中类似的表述可以做相似的理解,不再赘述。In the embodiments of this application, "send" and "receive" refer to the direction of signal transmission. For example, "send information to device XX" can be understood as the destination of the information being device XX, which may include direct transmission via the air interface or indirect transmission by other units or modules via the air interface. "Receive information from device YY" can be understood as the source of the information being device YY, which may include direct reception from device YY via the air interface or indirect reception from device YY via other units or modules via the air interface. "Send" can also be understood as the "output" of the chip interface, and "receive" can also be understood as the "input" of the chip interface. In other words, sending and receiving can occur between devices or within devices, for example, through buses, traces, or interfaces between components, modules, chips, software modules, or hardware modules within a device. It is understood that information may undergo necessary processing, such as encoding and modulation, between the source and destination of information transmission, but the destination can understand the valid information from the source. Similar expressions in this application can be understood in a similar way and will not be elaborated further.
在本申请实施例中,“指示”可以包括直接指示和间接指示,也可以包括显式指示和隐式指示。将某一信息(如下文所述的指示信息)所指示的信息称为待指示信息,则具体实现过程中,对待指示信息进行指示的方式有很多种,例如但不限于,可以直接指示待指示信息,如待指示信息本身或者该待指示信息的索引等。也可以通过指示其他信息来间接指示待指示信息,其中该其他信息与待指示信息之间存在关联关系;还可以仅仅指示待指示信息的一部分,而待指示信息的其他部分则是已知的或者提前约定的,例如可以借助预先约定(例如协议预定义)的各个信息的排列顺序来实现对特定信息的指示,从而在一定程度上降低指示开销。本申请对于指示的具体方式不作限定。可以理解的是,对于该指示信息的发送方来说,该指示信息可用于指示待指示信息,对于指示信息的接收方来说,该指示信息可用于确定待指示信息。In the embodiments of this application, "instruction" can include direct and indirect instructions, as well as explicit and implicit instructions. The information indicated by a certain piece of information (hereinafter referred to as instruction information) is called the information to be instructed. In specific implementation, there are many ways to indicate the information to be instructed, such as, but not limited to, directly indicating the information to be instructed, such as the information to be instructed itself or its index. It can also indirectly indicate the information to be instructed by indicating other information, where there is an association between the other information and the information to be instructed; or it can indicate only a part of the information to be instructed, while the other parts are known or pre-agreed upon. For example, the instruction can be implemented by using a pre-agreed (e.g., protocol predefined) arrangement of various information, thereby reducing the instruction overhead to a certain extent. This application does not limit the specific method of instruction. It is understood that for the sender of the instruction information, the instruction information can be used to indicate the information to be instructed; for the receiver of the instruction information, the instruction information can be used to determine the information to be instructed.
首先对人工智能系统总体工作流程进行描述,请参见图1,图1为一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of an artificial intelligence system is described, as shown in Figure 1, which is a structural diagram. The framework of the aforementioned artificial intelligence theme is then elaborated from two dimensions: the "Intelligent Information Chain" (horizontal axis) and the "IT Value Chain" (vertical axis). The "Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT Value Chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,该智能芯片具体可以采用中央处理器(central processing unit,CPU)、嵌入式神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. Communication with the outside world is achieved through sensors; computing power is provided by intelligent chips, which can specifically employ hardware acceleration chips such as central processing units (CPUs), neural network processing units (NPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). The basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5)智能产品及行业应用(5) Smart Products and Industry Applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:通信领域、智能驾驶、智能终端、智能交通、智能家居、智能医疗、智能安防、智能制造、智慧城市等。Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They encapsulate overall artificial intelligence solutions, productize intelligent information decision-making, and realize practical applications. Their application areas mainly include: communications, intelligent driving, intelligent terminals, intelligent transportation, smart homes, intelligent healthcare, intelligent security, intelligent manufacturing, and smart cities.
本申请提供的方法可以应用于人工智能技术的各种应用领域中,具体可以应用于各种应用领域的模型的训练阶段,可选地,本申请提供的方法可以应用于由同一个训练设备执行多个训练任务的应用场景中,每个训练任务为对一个模型进行训练的任务。本申请中的“模型”也可以称为“机器学习模型”,示例性地,本申请中的机器学习模型具体可以表现为神经网络,也可以表现为非神经网络的模型等,具体可以结合实际应用场景确定。The method provided in this application can be applied to various application fields of artificial intelligence technology, specifically to the training stage of models in various application fields. Optionally, the method provided in this application can be applied to application scenarios where multiple training tasks are performed by the same training device, with each training task being a task of training a model. The "model" in this application can also be called a "machine learning model." For example, the machine learning model in this application can specifically be represented as a neural network, or as a non-neural network model, etc., which can be determined based on the actual application scenario.
例如,在通信领域中,核心网(core network)设备、网络设备和/或终端设备中均可以部署模型,利用模型实现的功能可以包括但不限于:对终端设备的移动轨迹进行预测,对信道状态指数参考信号(Chanel State Index Reference Signal,CSI-RS)中的码本进行压缩,对CSI-RS中的码本进行解压缩,波束赋形,网络设备的负载均衡,或者其他功能等,具体在哪些设备中部署用于实现哪些功能的模型可以结合实际应用场景确定。For example, in the field of communications, models can be deployed in core network equipment, network equipment, and/or terminal equipment. The functions implemented by the models may include, but are not limited to: predicting the movement trajectory of terminal equipment, compressing the codebook in the Channel State Index Reference Signal (CSI-RS), decompressing the codebook in the CSI-RS, beamforming, load balancing of network equipment, or other functions. The specific devices in which models are deployed to implement which functions can be determined based on the actual application scenario.
示例性地,核心网设备指的是承载各类网络功能的云服务器,例如,通过核心网设备实现的功能包括但不限于:网络数据分析功能(network data analytics function,NWDAF)、接入和移动性管理功能(access and mobility management function,AMF)、会话管理功能(session management function,SMF)、认证服务功能(authentication server function,AUSF)、网络开放功能(network exposure function,NEF)、网络存储库功能(network repository function,NRF)、网络切片选择功能(network slice selection function,NSSF)、统一数据管理功能(unified data management,UDM)、用户面功能(user plane function,UPF)等,此处不做穷举。For example, core network equipment refers to cloud servers that carry various network functions. For instance, the functions implemented through core network equipment include, but are not limited to: network data analytics function (NWDAF), access and mobility management function (AMF), session management function (SMF), authentication server function (AUSF), network exposure function (NEF), network repository function (NRF), network slice selection function (NSSF), unified data management (UDM), user plane function (UPF), etc., which will not be listed exhaustively here.
网络设备可以是指无线网络中提供无线接入服务的设备。示例性地,网络设备可以为将终端设备接入到无线网络的设备,又可以称为基站;前述基站可以为各种形式的宏基站、微基站、中继站或接入点等等。在采用不同的无线接入技术的无线通信系统中,具备基站功能的网络设备的名称可能会有所不同,例如,基站可以为称为演进型节点B(evolved Node B,eNB)、节点B(Node B,NB)、第五代(5th generation,5G)通信系统中的下一代基站(the next Generation Node B,gNB)、家庭基站(例如,home evolved Node B,或home Node B,HNB)、基带单元(base band unit,BBU)、无线保真(wireless fidelity,Wi-Fi)接入点(Access Point,AP)、传输接收点(transmission reception point,TRP)或无线网络控制器(radio network controller,RNC)等等。在另一种可能的场景中,终端设备可以由多个网络节点协作协助实现无线接入,不同网络节点分别实现基站的部分功能。例如,网络节点可以是集中式单元(central unit,CU),分布式单元(distributed unit,DU),CU-控制面(control plane,CP),CU-用户面(user plane,UP),或者无线单元(radio unit,RU)等。CU和DU可以是单独设置,或者也可以包括在同一个网元中,例如基带单元(baseband unit,BBU)中。RU可以包括在射频设备或者射频单元中,例如包括在射频拉远单元(remote radio unit,RRU)、有源天线处理单元(active antenna unit,AAU)或远程射频头(remote radio head,RRH)中。在不同系统中,CU(或CU-CP和CU-UP)、DU或RU也可以有不同的名称,但是本领域的技术人员可以理解其含义。例如,在ORAN系统中,CU也可以称为开放式CU(O-CU),DU也可以称为开放式DU(O-DU),CU-CP也可以称为开放式CU-CP(O-CU-CP),CU-UP也可以称为开放式CU-UP(O-CU-UP),RU也可以称为开放式RU(O-RU)。其中,CU(或CU-CP、CU-UP)、DU和RU中的任一单元,可以是通过软件模块、硬件模块、或者软件模块与硬件模块结合来实现。本申请实施例对网络设备的具体设备形态不做限定。Network equipment can refer to devices that provide wireless access services in a wireless network. For example, a network device can be a device that connects terminal devices to a wireless network, and can also be called a base station; the aforementioned base station can be various forms of macro base stations, micro base stations, relay stations, or access points, etc. In wireless communication systems employing different wireless access technologies, the names of network devices with base station functions may differ. For example, a base station can be called an evolved Node B (eNB), a Node B (NB), the next-generation Node B (gNB) in a 5th generation (5G) communication system, a home base station (e.g., home evolved Node B, or home Node B, HNB), a base band unit (BBU), a wireless fidelity (Wi-Fi) access point (AP), a transmission reception point (TRP), or a radio network controller (RNC), etc. In another possible scenario, the terminal device can achieve wireless access through the cooperation of multiple network nodes, with each network node performing a portion of the base station's functions. For example, network nodes can be central units (CU), distributed units (DU), CU-control plane (CP), CU-user plane (UP), or radio units (RU), etc. CU and DU can be set up separately or included in the same network element, such as a baseband unit (BBU). RU can be included in radio frequency equipment or radio frequency units, such as remote radio units (RRU), active antenna units (AAU), or remote radio heads (RRH). In different systems, CU (or CU-CP and CU-UP), DU, or RU may have different names, but their meanings will be understood by those skilled in the art. For example, in an ORAN system, a CU can also be called an open CU (O-CU), a DU can also be called an open DU (O-DU), a CU-CP can also be called an open CU-CP (O-CU-CP), a CU-UP can also be called an open CU-UP (O-CU-UP), and an RU can also be called an open RU (O-RU). Any of the CU (or CU-CP, CU-UP), DU, and RU units can be implemented through software modules, hardware modules, or a combination of software and hardware modules. This application does not limit the specific device form of the network equipment.
终端设备是指能够接收网络设备发送的调度信息和指示信息的无线终端设备(terminal)。终端设备可以是具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或其他处理设备等,此处不做穷举。终端设备可以经无线接入网(wireless access network,RAN)与一个或多个核心网设备或者互联网进行通信。例如,终端设备可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置,它们与无线接入网交换语音和/或数据。示例性地,终端设备可以是用户单元(user agent)、蜂窝电话(cellular phone)、智能手机(smart phone)、个人数字助理(personal digital assistant,PDA)、平板电脑(Tablet Personal Computer,Tablet PC)、无线调制解调器(modem)、手持设备(handset)、膝上型电脑(laptop computer)、个人通信业务(personal communication service,PCS)电话、远程站(remote station)、接入点(access point,AP)、远程终端设备(remote terminal)、接入终端设备(access terminal)、用户端设备(customer premises equipment,CPE)、终端(terminal)、用户设备(user equipment,UE)或移动终端(mobile terminal,MT)等等。A terminal device refers to a wireless terminal device capable of receiving scheduling and instruction information sent by network devices. Terminal devices can be handheld devices, vehicle-mounted devices, wearable devices, computing devices, or other processing devices with wireless communication capabilities, etc., without exhaustive list. Terminal devices can communicate with one or more core network devices or the Internet via a wireless access network (RAN). For example, terminal devices can be portable, pocket-sized, handheld, computer-embedded, or vehicle-mounted mobile devices that exchange voice and/or data with the RAN. For example, a terminal device can be a user agent, a cellular phone, a smartphone, a personal digital assistant (PDA), a tablet PC, a modem, a handset, a laptop computer, a personal communication service (PCS) phone, a remote station, an access point (AP), a remote terminal, an access terminal, a customer premises equipment (CPE), a terminal, a user equipment (UE), or a mobile terminal (MT), etc.
又例如,终端设备还可以是可穿戴设备,如眼镜、手套、手表、服饰及鞋等。又例如,终端设备还可以是无人机、机器人、设备到设备通信(device-to-device,D2D)中的终端设备、车到一切(vehicle to everything,V2X)中的终端设备、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的终端设备、远程医疗(remote medical)中的终端设备、智能电网(smart grid)中的终端设备、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的终端设备等。For example, terminal devices can also be wearable devices, such as glasses, gloves, watches, clothing, and shoes. Furthermore, terminal devices can also be drones, robots, terminal devices in device-to-device (D2D) communication, vehicle-to-everything (V2X) communication, virtual reality (VR) devices, augmented reality (AR) devices, wireless terminals in industrial control, terminal devices in self-driving, remote medical care, smart grids, smart cities, and smart homes.
此外,终端设备也可以是5G通信系统之后的通信系统(例如第六代(6th generation,6G)通信系统等)中的终端设备或者未来演进的公共陆地移动网络(public land mobile network,PLMN)中的终端设备等等,本申请实施例不限定终端设备的设备形态。In addition, the terminal device can also be a terminal device in a communication system after 5G (such as a sixth-generation (6G) communication system) or a terminal device in a future evolved public land mobile network (PLMN), etc. The embodiments of this application do not limit the device form of the terminal device.
又例如,在智能驾驶领域中,利用模型实现的功能包括:对自车周围的障碍物进行轨迹预测,确定自车在横向和纵向上的决策,对自车在未来一段时间内可能到达的位置区域进行预测,对自车进行轨迹规划,或者其他功能等。又例如,在智能终端领域中,利用模型实现的功能包括:图像风格迁移,图像修复、预测图像中物体的类别,语音识别,文本翻译,或者其他功能等等,此处不再对各种应用领域中利用模型实现的功能进行一一列举。For example, in the field of intelligent driving, functions implemented using models include: predicting the trajectory of obstacles around the vehicle, determining the vehicle's lateral and longitudinal directions, predicting the location areas the vehicle may reach in the future, performing trajectory planning, and other functions. Similarly, in the field of smart terminals, functions implemented using models include: image style transfer, image inpainting, predicting the category of objects in an image, speech recognition, text translation, and other functions. These are just a few examples of the various functions implemented using models in different application areas.
为了便于理解本方案,请先参阅图2,图2为模型的训练阶段和应用阶段的一种示意图,如图2所示,整个系统可以包括请求设备200、训练设备210、数据库220、执行设备230和数据存储系统240,执行设备230中包括计算模块231。To facilitate understanding of this solution, please refer to Figure 2 first. Figure 2 is a schematic diagram of the training and application phases of the model. As shown in Figure 2, the entire system may include a request device 200, a training device 210, a database 220, an execution device 230, and a data storage system 240. The execution device 230 includes a computing module 231.
其中,请求设备200和训练设备210之间可以通信连接,请求设备200为向训练设备210发送请求信息的设备,该请求信息用于请求执行对模型的训练任务。示例性地,在通信领域中,例如请求设备200可以为终端设备、网络设备或者核心网设备,可选地,请求设备200可以为负责网络的运行、管理和维护的终端设备,技术人员可以与前述终端设备直接进行交互,以实现对网络的管理和维护。The requesting device 200 and the training device 210 can be communicatively connected. The requesting device 200 is a device that sends request information to the training device 210, which is used to request the execution of a training task for the model. For example, in the field of communication, the requesting device 200 can be a terminal device, a network device, or a core network device. Optionally, the requesting device 200 can be a terminal device responsible for the operation, management, and maintenance of the network. Technicians can directly interact with the aforementioned terminal device to achieve network management and maintenance.
数据库220中存储有训练数据集合,在模型201的训练阶段,训练设备210获取执行训练操作之前的模型201,利用训练数据集合对模型201进行迭代训练直至满足预设的收敛条件,得到执行过训练操作的模型201,“执行过训练操作的模型201”也可以称为训练后的模型201;示例性地,“对模型201进行迭代训练”可以理解为对模型201中的权重参数多次进行更新。Database 220 stores a training dataset. During the training phase of model 201, training device 210 acquires model 201 before the training operation is performed, and uses the training dataset to iteratively train model 201 until the preset convergence condition is met, thus obtaining model 201 after the training operation is performed. "Model 201 after the training operation is performed" can also be called model 201 after training. For example, "iteratively training model 201" can be understood as updating the weight parameters in model 201 multiple times.
上述执行过训练操作的模型201可以被部署至执行设备230的计算模块231中,执行设备230可以调用数据存储系统240中的数据、代码等,也可以将数据、指令等存入数据存储系统240中。数据存储系统240可以置于执行设备230中,也可以为数据存储系统240相对执行设备230是外部存储器。在模型201的应用阶段,执行设备230可以将待处理数据输入模型201中,得到模型201生成的预测信息,以实现该模型201的功能。The trained model 201 can be deployed to the computing module 231 of the execution device 230. The execution device 230 can access data, code, etc., from the data storage system 240, and can also store data, instructions, etc., in the data storage system 240. The data storage system 240 can be located within the execution device 230, or it can be an external storage device relative to the execution device 230. During the application phase of the model 201, the execution device 230 can input the data to be processed into the model 201 to obtain the prediction information generated by the model 201, thereby realizing the function of the model 201.
请参阅图2,请求设备200和训练设备210可以为分别独立的设备,训练设备210和执行设备230可以为分别独立的设备,为更直观地理解本方案,请参阅图3a,图3a为训练任务的处理系统的一种架构示意图,图3a中以训练设备为通信领域中的核心网设备,执行设备为通信领域中的基站为例,请求设备可以为与核心网设备之间通信连接的终端设备,如图3a所示,请求设备可以向核心网设备发送请求信息,该请求信息用于请求执行训练任务;在核心网设备基于该请求信息执行完训练任务的情况下,能够得到执行过训练操作的模型,执行过训练操作的模型被部署至多个基站中,应理解,图3a中的示例仅为方便理解本方案,不用于限定本方案。Please refer to Figure 2. The requesting device 200 and the training device 210 can be independent devices, and the training device 210 and the execution device 230 can be independent devices. For a more intuitive understanding of this scheme, please refer to Figure 3a. Figure 3a is a schematic diagram of an architecture of a training task processing system. In Figure 3a, the training device is taken as a core network device in the communication field, and the execution device is taken as a base station in the communication field. The requesting device can be a terminal device that communicates with the core network device. As shown in Figure 3a, the requesting device can send request information to the core network device. This request information is used to request the execution of the training task. When the core network device completes the training task based on the request information, it can obtain a model that has undergone training operations. The model that has undergone training operations is deployed to multiple base stations. It should be understood that the example in Figure 3a is only for the convenience of understanding this scheme and is not intended to limit this scheme.
值得注意的,图3a仅是训练任务的处理系统的一种架构示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。例如,训练设备210和执行设备230也可以集成于同一设备中。为更直观地理解本方案,请参阅图3b,图3b为训练任务的处理系统的另一种架构示意图,图3b中以训练设备和执行设备均为同一个核心网设备为例,请求设备可以为与核心网设备之间通信连接的终端设备,如图3b所示,请求设备可以向核心网设备发送请求信息,在核心网设备基于该请求信息执行完训练任务的情况下,能够得到执行过训练操作的模型,执行过训练操作的模型被部署至核心网设备中,应理解,图3b中的示例仅为方便理解本方案,不用于限定本方案。It is worth noting that Figure 3a is merely a schematic diagram of one architecture for the training task processing system, and the positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation. For example, the training device 210 and the execution device 230 can also be integrated into the same device. For a more intuitive understanding of this solution, please refer to Figure 3b, which is a schematic diagram of another architecture for the training task processing system. In Figure 3b, the training device and the execution device are both the same core network device as an example. The requesting device can be a terminal device that communicates with the core network device. As shown in Figure 3b, the requesting device can send request information to the core network device. When the core network device completes the training task based on the request information, it can obtain the model that has undergone training operations. The model that has undergone training operations is deployed to the core network device. It should be understood that the example in Figure 3b is only for the convenience of understanding this solution and is not intended to limit this solution.
请继续参阅图3c,图3c为训练任务的处理系统的另一种架构示意图,图3b中以训练设备和执行设备均为同一个基站为例,请求设备可以为与核心网设备之间通信连接的终端设备,请求设备可以向核心网设备发送请求信息,核心网设备再将请求信息转发给每个基站,基站在接收到请求信息后,若基于该请求信息执行完训练任务,则能够得到执行过训练操作的模型,执行过训练操作的模型被部署至基站中,应理解,图3c中的示例仅为方便理解本方案,不用于限定本方案。Please refer to Figure 3c. Figure 3c is a schematic diagram of another architecture of the training task processing system. In Figure 3b, the training device and the execution device are both the same base station as an example. The requesting device can be a terminal device that communicates with the core network device. The requesting device can send request information to the core network device, and the core network device will then forward the request information to each base station. After receiving the request information, if the base station completes the training task based on the request information, it can obtain the model that has been trained. The model that has been trained is deployed to the base station. It should be understood that the example in Figure 3c is only for the convenience of understanding this scheme and is not intended to limit this scheme.
需要说明的是,在其他实施例中,请求设备也可以与训练设备集成于同一设备中,例如,在通信领域的一些应用场景中,核心网设备在确定对某一模型执行训练操作时,也可以生成请求信息,进而基于该请求信息执行训练任务,也即请求设备和训练设备可以集成于同一个核心网设备中;又例如,在通信领域的另一些应用场景中,基站在确定对某一模型执行训练操作时,也可以生成请求信息,进而基于该请求信息执行训练任务,也即请求设备和训练设备可以集成于同一个基站中等,“请求设备”、“训练设备”以及“执行设备”的具体产品形态均可以结合实际应用场景确定。It should be noted that in other embodiments, the requesting device can also be integrated with the training device in the same device. For example, in some application scenarios in the field of communication, when the core network device determines to perform a training operation on a certain model, it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same core network device. For another example, in some application scenarios in the field of communication, when the base station determines to perform a training operation on a certain model, it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same base station, etc. The specific product forms of the "requesting device", "training device" and "execution device" can be determined in combination with the actual application scenario.
此外,上述图3a、图3b以及图3c仅以本申请的应用领域为通信领域进行举例,本申请提供的方法也可以应用于其他领域中,例如,在智能驾驶领域中,请求设备和训练设备集成于同一设备中,请求设备和训练设备均为云端服务器,执行设备为车辆等,此处不再对各个应用领域的情况进行一一列举。Furthermore, Figures 3a, 3b, and 3c above are only examples of applications in the field of communication. The method provided by this application can also be applied to other fields. For example, in the field of intelligent driving, the requesting device and the training device are integrated into the same device, both of which are cloud servers, and the executing device is a vehicle, etc. The situations in each application field will not be listed here one by one.
由于同一个训练设备可以接收来自多个请求设备的请求信息,从而同一个训练设备当前执行的训练任务可能会包括多个训练任务,但,同一个训练设备中的计算机资源是有限的,若执行多个训练任务,可能会导致训练设备过载,从而导致用于执行训练任务的进程中断或数据的丢失。为了解决前述问题,本申请披露:训练设备在接收到用于请求执行第一模型的训练任务的请求信息(为方便进行区分,后续称为“第一请求信息”)之后,可以根据第一训练任务的优先级来确定是否执行第一训练任务,第一训练任务包括第一模型的训练任务,若第一训练任务的优先级高于第二训练任务的优先级,则可以执行第一训练任务,暂停执行第二训练任务;若第一训练任务的优先级低于或等于第二训练任务的优先级,则可以延迟或拒绝执行第一训练任务,第二训练任务是训练设备当前执行的训练任务,也即训练设备可以利用训练任务的优先级来决定暂停、延迟或拒绝执行某些训练任务,有利于避免发生训练设备过载的情况,进而有利于避免用于执行训练任务的进程中断或者避免数据的丢失,提高了训练设备在执行训练任务过程的稳定性。Since a single training device can receive request information from multiple requesting devices, the training task currently being executed by the same training device may include multiple training tasks. However, the computer resources within the same training device are limited. Executing multiple training tasks may overload the training device, leading to process interruption or data loss. To address the aforementioned issues, this application discloses that after receiving a request information for executing a training task of a first model (hereinafter referred to as "first request information" for ease of distinction), the training device can determine whether to execute the first training task based on its priority. The first training task includes the training task of the first model. If the priority of the first training task is higher than the priority of the second training task, the first training task can be executed, and the execution of the second training task can be suspended. If the priority of the first training task is lower than or equal to the priority of the second training task, the execution of the first training task can be delayed or rejected. The second training task is the training task currently being executed by the training device. In other words, the training device can use the priority of training tasks to decide whether to pause, delay, or reject the execution of certain training tasks, which helps to avoid overloading the training device, thereby preventing process interruption or data loss and improving the stability of the training device during the execution of training tasks.
结合上述描述,如下对本申请提供的训练任务的处理方法的详细实现过程进行介绍,具体的,请参阅图4,图4为本申请实施例提供的训练任务的处理方法的一种示意图,如图4所示,本申请提供的训练任务的处理方法可以包括:Based on the above description, the detailed implementation process of the training task processing method provided in this application will be described below. Specifically, please refer to Figure 4, which is a schematic diagram of a training task processing method provided in an embodiment of this application. As shown in Figure 4, the training task processing method provided in this application may include:
401、训练设备接收来自第一设备的第一请求信息,第一请求信息用于请求执行第一模型的训练任务。401. The training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
其中,该第一请求信息具体可以为一个消息,也可以为某一个消息中的信息,不予限制。The first request information can be a single message or information within a message, without restriction.
需要指出的是,本申请中第一模型的训练任务与第一训练模型之间可以相互替换,不予限制。It should be noted that the training task of the first model and the first training model in this application can be interchanged without restriction.
具体地,当第一设备确定对第一模型进行训练时,可以向训练设备发送该第一请求信息;对应的,训练设备接收该第一请求信息,可以根据第一请求信息确定该第一设备请求执行第一模型的训练任务,进而可以创建第一训练任务,第一训练任务包括第一模型的训练任务。需要说明的是,训练设备创建第一训练任务不代表训练设备就马上开始执行第一训练任务,训练设备可以在后续步骤中确定是否开始执行第一训练任务,具体的确定过程将在后续步骤中描述。Specifically, when the first device determines to train the first model, it can send the first request information to the training device. Correspondingly, the training device receives the first request information and, based on it, determines that it requests to execute the training task of the first model. Therefore, it can create a first training task, which includes the training task of the first model. It should be noted that the creation of the first training task by the training device does not mean that the training device immediately begins executing the first training task. The training device can determine whether to begin executing the first training task in subsequent steps; the specific determination process will be described in subsequent steps.
示例性地,训练设备创建第一训练任务可以包括:训练设备生成第一训练任务的标识信息。进而训练设备可以将第一训练任务放入等待队列中;示例性地,训练设备可以将第一训练任务的标识信息放入等待队列中,可选地,训练设备可以将第一训练任务的标识和第一请求信息放入等待队列中。For example, the training device creating a first training task may include: the training device generating identification information for the first training task. The training device can then place the first training task into a waiting queue; for example, the training device can place the identification information of the first training task into the waiting queue, and optionally, the training device can place the identification of the first training task and first request information into the waiting queue.
其中,第一设备可以为一个请求设备,结合上述图2至图3c的描述可知,第一设备(也即请求设备)具体可以为终端设备、网络设备、核心网设备或者其他类型的设备等,具体可以根据实际应用场景确定。训练设备可以为网络设备、核心网络设备、云端服务器或其他类型的设备等。The first device can be a requesting device. As described in Figures 2 to 3c above, the first device (i.e., the requesting device) can specifically be a terminal device, a network device, a core network device, or other types of devices, depending on the actual application scenario. The training device can be a network device, a core network device, a cloud server, or other types of devices.
其中,第一训练任务为第一模型的训练任务,第一模型也可以称为第一机器学习模型,第一模型可以为神经网络,也可以为非神经网络的模型。The first training task is the training task of the first model, which can also be called the first machine learning model. The first model can be a neural network or a non-neural network model.
示例性地,第一请求信息可以包括用于指示第一模型的推断类型(inference type)的信息;训练设备可以根据第一模型的推断类型来确定第一训练任务是什么样的模型的训练任务。For example, the first request information may include information indicating the inference type of the first model; the training device can determine what kind of model the first training task is based on the inference type of the first model.
需要指出的是,该第一模型的推断类型也可以称为通过第一模型执行的任务的类型,或其他名称等,此处不做穷举。第一模型的推断类型能够指示通过第一模型实现的功能,例如,若第一模型的推断类型为预测终端设备的移动轨迹,则通过第一模型实现的功能为对终端设备的移动轨迹进行预测;又例如,若第一模型的推断类型为压缩CSI-RS中的码本,则通过第一模型实现的功能为对CSI-RS中的码本进行压缩;又例如,若第一模型的推断类型为图像分类,则通过第一模型实现的功能为对图像中物体的类别进行预测等等,应理解,此处举例仅为方便理解“第一模型的推断类型”与“通过第一模型实现的功能”之间的关系。It should be noted that the inference type of the first model can also be referred to as the type of task performed by the first model, or other names, etc., which will not be exhaustively listed here. The inference type of the first model can indicate the function implemented by the first model. For example, if the inference type of the first model is to predict the movement trajectory of the terminal device, then the function implemented by the first model is to predict the movement trajectory of the terminal device; for another example, if the inference type of the first model is to compress the codebook in CSI-RS, then the function implemented by the first model is to compress the codebook in CSI-RS; for yet another example, if the inference type of the first model is image classification, then the function implemented by the first model is to predict the category of objects in the image, and so on. It should be understood that the examples here are only for the convenience of understanding the relationship between "the inference type of the first model" and "the function implemented by the first model".
示例性地,第一请求信息通过包含如下至少一项信息来指示第一模型的推断类型:第一模型的推断类型的描述信息,第一模型的推断类型的标识信息,尚未进行训练的第一模型,第一模型的标识信息,或者其他信息等。For example, the first request information indicates the inference type of the first model by including at least one of the following: a description of the inference type of the first model, an identifier of the inference type of the first model, a first model that has not yet been trained, an identifier of the first model, or other information.
其中,若第一请求信息包括尚未执行训练的第一模型,训练设备在获取到尚未进行训练的第一模型之后,就能够确定该第一模型输出的是什么样的信息,也即能够确定第一模型的推断类型。例如,若第一模型输出的是终端设备在未来一段时间内的移动轨迹,则第一模型的推断类型为预测终端设备的移动轨迹;又例如,若第一模型输出的是CSI-RS中的码本的压缩信息,则第一模型的推断类型为压缩CSI-RS中的码本;又例如,若第一模型输出的是图像中物体的类别,则第一模型的推断类型为图像分类等,此处举例仅为方便理解如何确定第一模型的推断类型。If the first request information includes a first model that has not yet been trained, the training device, upon acquiring the untrained first model, can determine what kind of information the first model outputs, that is, determine the inference type of the first model. For example, if the first model outputs the movement trajectory of the terminal device over a future period, the inference type of the first model is predicting the movement trajectory of the terminal device; as another example, if the first model outputs compressed information of the codebook in CSI-RS, the inference type of the first model is compressing the codebook in CSI-RS; as yet another example, if the first model outputs the category of an object in an image, the inference type of the first model is image classification, etc. These examples are only provided to facilitate understanding of how to determine the inference type of the first model.
若第一请求信息包括第一模型的标识信息,假设训练设备中存储有多个模型的标识信息与多个模型之间的一一对应关系,则训练设备在获取到第一模型的标识信息之后,可以根据该对应关系,确定与第一模型的标识信息对应的第一模型,在获取到该第一模型之后,能够确定第一模型输出的是什么样的信息,也即能够确定第一模型的推断类型。If the first request information includes the identification information of the first model, assuming that the training device stores the identification information of multiple models and the one-to-one correspondence between multiple models, then after obtaining the identification information of the first model, the training device can determine the first model corresponding to the identification information of the first model according to the correspondence. After obtaining the first model, it can determine what kind of information the first model outputs, that is, it can determine the inference type of the first model.
可选地,第一请求信息还包括:第一训练任务的精度需求,和/或,第一训练任务的批训练规模需求。其中,“第一训练任务的精度需求”可以用来指示在执行第一训练任务时采用什么样的精度,例如,在执行某一训练任务时采用的精度可以为单精度浮点数(FP32)、半精度浮点数(FP16)、双精度浮点数(FP64)、混合精度或者其他精度等,此处不做限定。Optionally, the first request information may further include: the accuracy requirement of the first training task, and/or, the batch training size requirement of the first training task. The "accuracy requirement of the first training task" can be used to indicate the precision to be used when performing the first training task. For example, the precision used when performing a training task can be single-precision floating-point (FP32), half-precision floating-point (FP16), double-precision floating-point (FP64), mixed precision, or other precision, etc., which are not limited here.
一个训练任务可以包括对一个模型的多次训练过程,批训练(batch training)指的是将一个训练任务包括的多次训练过程分成多个批次(batch)的训练过程,在单个批次的训练过程中会从训练数据集合中获取n个训练样本,n为大于或等于1的整数,批训练规模能够代表单个批次的训练过程中采用的训练样本的数量,也即代表n的数量;例如,批训练规模会影响模型的单个批次的训练过程中需求的显存资源。A training task can include multiple training processes for a model. Batch training refers to dividing the multiple training processes of a training task into multiple batches. In the training process of a single batch, n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1. The batch size represents the number of training samples used in the training process of a single batch, which is also the number of n. For example, the batch size affects the GPU memory resources required by the model in the training process of a single batch.
训练设备可以根据第一训练任务的精度需求,确定在执行第一训练任务时需要采用什么样的精度,和/或,训练设备可以根据第一训练任务的批训练规模需求,确定执行单个批次的训练过程中采用的训练样本的数量。The training device can determine the required precision when performing the first training task based on the precision requirements of the first training task, and/or, the training device can determine the number of training samples used in the training process of a single batch based on the batch training scale requirements of the first training task.
可选地,第一请求信息还包括第一模型的训练任务的优先级,也可以理解为第一请求信息还包括第一训练任务的优先级;示例性地,第一训练任务的优先级可以表现为数字,例如1、2、3、4、5或其他数字等,数字越大代表优先级越高;或者,第一训练任务的优先级可以表现为文字,例如第一级、第二级、第三级等等。Optionally, the first request information may also include the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the priority of the first training task may be expressed as a number, such as 1, 2, 3, 4, 5 or other numbers, with the larger the number, the higher the priority; or, the priority of the first training task may be expressed as text, such as first level, second level, third level, etc.
可选地,第一设备在向训练设备发送第一请求信息之前,确定第一训练任务的优先级。具体地,该优选级的确定可以参考以下两种情况。Optionally, the first device determines the priority of the first training task before sending the first request information to the training device. Specifically, the determination of this priority can refer to the following two cases.
情况1,第一设备根据第一因素,确定第一训练任务的优先级。Scenario 1: The first device determines the priority of the first training task based on the first factor.
其中,第一因素可以包括以下至少一种:第一模型的推断类型、第一训练任务的需求资源、或第一模型的即时性要求。The first factor may include at least one of the following: the inference type of the first model, the resource requirements of the first training task, or the immediacy requirements of the first model.
第一模型的推断类型对第一训练任务的优先级的影响可以参阅后续描述。示例性地,第一训练任务的需求资源越少,第一训练任务的优先级可以越高,第一训练任务的需求资源越多,第一训练任务的优先级可以越低;第一模型的即时性要求越高,第一训练任务的优先级可以越高,第一模型的即时性要求越低,第一训练任务的优先级可以越低。The impact of the inference type of the first model on the priority of the first training task can be described in the following description. For example, the less resources the first training task requires, the higher its priority can be; the more resources the first training task requires, the lower its priority can be. The higher the immediacy requirement of the first model, the higher its priority can be; the lower the immediacy requirement of the first model, the lower its priority can be.
可选地,第一训练任务的需求资源可以包括执行第一训练任务所需的存储资源,该存储资源可以包括显存资源和/或内存资源,不予限制。或者,第一训练任务的需求资源还可以包括执行第一训练任务所需的处理器资源,或其他资源等。Optionally, the resource requirements for the first training task may include storage resources needed to execute the first training task, which may include video memory and/or system memory resources, without limitation. Alternatively, the resource requirements for the first training task may also include processor resources or other resources needed to execute the first training task.
示例性地,第一模型的即时性要求可以理解为对完成第一训练任务后的第一模型在时间维度上的要求,也可以理解为第一模型需要在多久时间内上线,第一模型需要在越短的时间内上线,则代表第一模型的即时性要求越高。例如,若模型1为用于对CSI-RS中的码本进行压缩或者解压缩的模型,则模型1的即时性要求较高;又例如,若模型2为用于实现基站的负载均衡的模型,由于该模型2为周期性更新,则模型2的即时性要求较低等,此处示例仅为方便理解本方案。For example, the immediacy requirement of the first model can be understood as the time requirement for the first model after completing the first training task, or as the time within which the first model needs to be deployed. The shorter the deployment time, the higher the immediacy requirement of the first model. For instance, if Model 1 is used to compress or decompress the codebook in CSI-RS, then Model 1 has a high immediacy requirement; as another example, if Model 2 is used to implement load balancing of base stations, since Model 2 is updated periodically, then Model 2 has a lower immediacy requirement, etc. The examples here are only for the convenience of understanding this scheme.
在一种可选的实现方式中,第一因素包括第一模型的推断类型;假设第一设备存储有推断类型与优先级之间的对应关系,则第一设备可以根据该对应关系,确定与第一模型的推断类型对应的优先级,也即第一训练任务的优先级。In one alternative implementation, the first factor includes the inference type of the first model; assuming the first device stores a correspondence between inference types and priorities, the first device can determine the priority corresponding to the inference type of the first model, i.e., the priority of the first training task, based on the correspondence.
在另一种可选的实现方式中,假设第一因素不仅包括第一模型的推断类型,还包括:第一训练任务的需求资源、和/或第一模型的即时性要求。第一设备获取与第一模型的推断类型对应的第一分值,并获取与第一训练任务的需求资源对应的第二分值,和/或,获取与第一模型的即时性要求对应的第三分值;第一设备可以对前述获取到的所有分值进行加权求和,得到第一训练任务的总分值,进而能够确定第一训练任务的优先级。其中,获取到的所有分值可以包括第一分值,还包括第二分值和/或第三分值。其中,第一训练任务的总分值越高,第一训练任务的优先级可以越高;第一训练任务的总分值越低,第一训练任务的优先级可以越低。In another alternative implementation, it is assumed that the first factor includes not only the inference type of the first model, but also: the resource requirements of the first training task, and/or the immediacy requirements of the first model. The first device acquires a first score corresponding to the inference type of the first model, and acquires a second score corresponding to the resource requirements of the first training task, and/or acquires a third score corresponding to the immediacy requirements of the first model. The first device can perform a weighted summation of all the acquired scores to obtain the total score of the first training task, thereby determining the priority of the first training task. All acquired scores may include the first score, as well as the second and/or third scores. The higher the total score of the first training task, the higher its priority; the lower the total score of the first training task, the lower its priority.
示例性地,假设第一设备存储有推断类型与分值之间的对应关系,则第一设备可以根据该对应关系,确定与第一模型的推断类型对应的第一分值。For example, assuming the first device stores a correspondence between inference types and scores, the first device can determine the first score corresponding to the inference type of the first model based on the correspondence.
其中,第一训练任务的需求资源越多,第二分值越低,第一训练任务的需求资源越少,第二分值越高;在一种可选的实现方式中,第一设备可以存储有需求资源与分值之间的对应关系,则第一设备可以根据该对应关系,确定与第一训练任务的需求资源对应的第二分值;在另一种可选的实现方式中,第一设备将第一训练任务的需求资源带入第一预设算法中,得到与第一训练任务的需求资源对应的第二分值,该第一预设算法指示需求资源与分值之间的映射关系。The more resources required for the first training task, the lower the second score; conversely, the fewer resources required for the first training task, the higher the second score. In one optional implementation, the first device can store a correspondence between required resources and scores, and then the first device can determine the second score corresponding to the required resources of the first training task based on this correspondence. In another optional implementation, the first device inputs the required resources of the first training task into a first preset algorithm to obtain the second score corresponding to the required resources of the first training task, whereby the first preset algorithm indicates the mapping relationship between required resources and scores.
第一模型的即时性要求越高,第一训练任务的第三分值越高,第一模型的即时性要求越低,第一训练任务的第三分值越低;在一种可选的实现方式中,第一设备可以存储有即时性要求与分值之间的对应关系,则第一设备可以根据该对应关系,确定与第一模型的即时性要求对应的第三分值;在另一种可选的实现方式中,第一设备将第一模型的即时性要求带入第二预设算法中,得到与第一模型的即时性要求对应的第三分值,该第二预设算法指示即时性要求与分值之间的映射关系。The higher the immediacy requirement of the first model, the higher the third score of the first training task; the lower the immediacy requirement of the first model, the lower the third score of the first training task. In one optional implementation, the first device can store the correspondence between immediacy requirements and scores, and then the first device can determine the third score corresponding to the immediacy requirement of the first model based on the correspondence. In another optional implementation, the first device inputs the immediacy requirement of the first model into a second preset algorithm to obtain the third score corresponding to the immediacy requirement of the first model. The second preset algorithm indicates the mapping relationship between immediacy requirements and scores.
需要说明的是,第一因素还可以包括更多或更少的元素,上述举例仅为方便理解本方案,不用于限定本方案。It should be noted that the first factor may include more or fewer elements. The examples above are only for the convenience of understanding this solution and are not intended to limit this solution.
情况2,第一设备基于第一操作确定第一训练任务的优先级。Scenario 2: The first device determines the priority of the first training task based on the first operation.
具体地,第一设备可以接收用户输入的第一操作,进而可以基于第一操作确定第一训练任务的优先级。Specifically, the first device can receive a first operation input by the user, and then determine the priority of the first training task based on the first operation.
其中,第一操作可以为对第一训练任务的优先级的选择操作,或者,第一操作也可以为通过文本框输入的第一训练任务的优先级等,或者,第一操作可以为采用语音的形式输入的第一训练任务的优先级等,具体可以结合实际产品形态确定。The first operation can be a selection operation for the priority of the first training task, or it can be the priority of the first training task input through a text box, or it can be the priority of the first training task input by voice, etc. The specific operation can be determined according to the actual product form.
需要说明的是,第一请求信息还可以包括其他信息,例如第一请求信息中还可以包括第一模型的训练数据集合,又例如第一请求信息还可以包括第一模型的性能需求信息等,此处不再进行一一列举。It should be noted that the first request information may also include other information, such as the training data set of the first model, or the performance requirements of the first model, etc., which will not be listed here.
示例性地,训练设备在接收到第一请求信息之后,可以根据第一请求信息创建第一训练任务,进而确定是否执行第一训练任务。可选地,在训练设备处于第一场景的情况下,训练设备可以根据第一训练任务的优先级来确定是否执行第一训练任务;在训练设备未处于第一场景的情况下,训练设备可以开始执行第一训练任务;示例性地,第一场景包括:第一训练任务的需求资源大于训练设备的空闲资源,和/或,训练设备当前执行的训练任务的个数大于预设阈值,以下分别进行描述。For example, after receiving the first request information, the training device can create a first training task based on the first request information, and then determine whether to execute the first training task. Optionally, when the training device is in a first scenario, the training device can determine whether to execute the first training task based on the priority of the first training task; when the training device is not in a first scenario, the training device can start executing the first training task; for example, the first scenario includes: the resource requirement of the first training task is greater than the idle resources of the training device, and/or, the number of training tasks currently being executed by the training device is greater than a preset threshold, which will be described below.
情况1:第一场景包括第一训练任务的需求资源大于训练设备的空闲资源。Scenario 1: The first scenario involves the resource requirements of the first training task exceeding the available resources of the training equipment.
具体地,训练设备可以判断第一训练任务的需求资源是否大于训练设备的空闲资源,若第一训练任务的需求资源大于训练设备的空闲资源,则训练设备可以从当前执行的所有训练任务中确定第二训练任务,进而判断第一训练任务的优先级是否高于第二训练任务的优先级,若第一训练任务的优先级高于第二训练任务的优先级,则进入步骤402;若第一训练任务的优先级低于或等于第二训练任务的优先级,则进入步骤403。若第一训练任务的需求资源小于或等于训练设备的空闲资源,则可以开始执行第一训练任务。Specifically, the training device can determine whether the resource requirement of the first training task is greater than the available resources of the training device. If the resource requirement of the first training task is greater than the available resources of the training device, the training device can determine the second training task from all currently executing training tasks, and then determine whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the resource requirement of the first training task is less than or equal to the available resources of the training device, then the first training task can be started.
示例性地,训练设备确定第一训练任务的需求资源的方式可以参考以下三种实现方式。For example, the training device can determine the resource requirements of the first training task in the following three ways.
在一种可选的实现方式中,第一请求信息还可以包括用于指示第一训练任务的需求资源的信息,则训练设备可以根据第一请求信息确定第一训练任务的需求资源。In one alternative implementation, the first request information may further include information indicating the required resources for the first training task, so that the training device can determine the required resources for the first training task based on the first request information.
在另一种可选的实现方式中,训练设备可以根据第一模型的推断类型、第一训练任务的精度需求以及第一训练任务的批训练规模需求,确定第一训练任务的需求资源。其中,第一模型的推断类型能够指示对什么样的第一模型进行训练,第一模型的参数量越多,第一训练任务的需求资源越多;第一训练任务的精度需求越高,第一训练任务的需求资源越多;第一训练任务的批训练规模需求越大,单个批次的训练过程中需求的显存资源越多。In another alternative implementation, the training device can determine the resource requirements of the first training task based on the inference type of the first model, the accuracy requirements of the first training task, and the batch training scale requirements of the first training task. Specifically, the inference type of the first model indicates which type of first model to train; the more parameters the first model has, the more resources the first training task requires; the higher the accuracy requirements of the first training task, the more resources it requires; and the larger the batch training scale requirements of the first training task, the more GPU memory resources are required during the training of a single batch.
在另一种可选的实现方式中,训练设备可以存储有推断类型与需求资源之间的对应关系,则训练设备可以根据该对应关系,确定与第一模型的推断类型对应的第一训练任务的需求资源。In another alternative implementation, the training device can store a correspondence between inference types and required resources, and then the training device can determine the required resources of the first training task corresponding to the inference type of the first model based on the correspondence.
可选地,训练设备的空闲资源可以包括训练设备中空闲的存储资源;可选地,还可以包括训练设备中空闲的处理器资源,或者,还可以包括其他类型的空闲资源等,具体可以结合实际应用场景确定。示例性地,训练设备判断第一训练任务的需求资源是否大于训练设备的空闲资源,可以包括:训练设备判断第一训练任务需求的存储资源是否大于训练设备中空闲的存储资源,例如,前述存储资源可以包括显存资源和/或内存资源。可选地,还可以包括:训练设备判断第一训练任务需求的处理器资源是否大于训练设备中空闲的处理器资源等,具体可以结合实际应用场景设定。Optionally, the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device, or it may include other types of idle resources, etc., which can be determined according to the actual application scenario. For example, the training device's determination of whether the resources required by the first training task are greater than the idle resources of the training device may include: the training device determining whether the storage resources required by the first training task are greater than the idle storage resources in the training device, for example, the aforementioned storage resources may include video memory resources and/or system memory resources. Optionally, it may also include: the training device determining whether the processor resources required by the first training task are greater than the idle processor resources in the training device, etc., which can be set according to the actual application scenario.
其中,在一种情况中,第二训练任务为训练设备当前执行的所有训练任务中优先级最低的一个训练任务,则第一训练任务的优先级高于第二训练任务的优先级也可以理解为第一训练任务的优先级高于当前执行的所有训练任务中优先级最低的一个训练任务。In one scenario, the second training task is the lowest priority training task among all training tasks currently being executed by the training device. Therefore, the higher priority of the first training task than the second training task can also be understood as the higher priority of the first training task than the lowest priority training task among all currently executed training tasks.
在另一种情况中,第二训练任务包括训练设备当前执行的S个训练任务,S为大于或等于1的整数,也即第二训练任务包括训练设备当前执行的一个或多个训练任务,则第一训练任务的优先级高于第二训练任务的优先级可以表明第一训练任务的优先级高于S个训练任务中每个训练任务的优先级。其中,第一训练任务的优先级低于或等于第二训练任务的优先级可以表明第一训练任务的优先级低于或等于S个训练任务中任意一个训练任务的优先级。In another scenario, the second training task comprises S training tasks currently being executed by the training device, where S is an integer greater than or equal to 1. That is, the second training task includes one or more training tasks currently being executed by the training device. In this case, the higher priority of the first training task than the second training task indicates that the first training task has a higher priority than any of the S training tasks. Conversely, the lower or equal priority of the first training task than the second training task indicates that the first training task has a lower or equal priority than any of the S training tasks.
针对训练设备确定第二训练任务的具体实现方式,在一种实现方式中,训练设备在确定了第一训练任务的需求资源大于训练设备的空闲资源之后,可以确定第一训练任务的需求资源与训练设备的空闲资源之间的第一差值;训练设备基于第一差值,从当前执行的所有训练任务中确定S个训练任务,也即确定了第二训练任务。其中,S个训练任务可以为当前执行的所有训练任务优先级最低的S个训练任务,且S个训练任务的占用资源之和大于或等于第一差值,也即训练设备的空闲资源与S个训练任务中所有训练任务的占用资源之和大于或等于第一训练任务的需求资源。Regarding the specific implementation method for determining the second training task by the training device, in one implementation, after determining that the resource requirement of the first training task is greater than the idle resources of the training device, the training device can determine a first difference between the resource requirement of the first training task and the idle resources of the training device. Based on the first difference, the training device determines S training tasks from all currently executed training tasks, which are the second training tasks. The S training tasks can be the S lowest priority training tasks among all currently executed training tasks, and the sum of the resources occupied by the S training tasks is greater than or equal to the first difference, that is, the sum of the idle resources of the training device and the resources occupied by all the training tasks among the S training tasks is greater than or equal to the resource requirement of the first training task.
可选地,训练任务的占用资源可以包括该训练任务占用的存储资源,可选地,还可以包括该训练任务占用的处理器资源,或者,还可以包括占用的其他类型的资源等,具体可以结合实际应用场景确定。Optionally, the resources occupied by the training task may include the storage resources occupied by the training task, and optionally, the processor resources occupied by the training task, or other types of resources, which can be determined according to the actual application scenario.
在另一种实现方式中,S可以为预设值,训练设备可以从当前执行的所有训练任务中确定优先级最低的S个训练任务。在另一种实现方式中,S可以为预设值,训练设备可以从当前执行的所有训练任务中确定占用资源最多的S个训练任务。在另一种实现方式中,训练设备还可以从当前执行的所有训练任务中随机确定S个训练任务等,训练设备确定第二训练任务的具体实现方式可以结合实际应用场景确定。In another implementation, S can be a preset value, and the training device can determine the S lowest priority training tasks from all currently executed training tasks. Alternatively, S can be a preset value, and the training device can determine the S resource-intensive training tasks from all currently executed training tasks. Another implementation can also involve the training device randomly determining S training tasks from all currently executed training tasks, etc. The specific implementation method for the training device to determine the second training task can be determined based on the actual application scenario.
情况2:第一场景包括训练设备当前执行的训练任务的个数大于预设阈值。Scenario 2: The first scenario includes the number of training tasks currently being executed by the training device being greater than a preset threshold.
具体地,训练设备可以判断训练设备当前执行的训练任务的个数是否大于预设阈值,若训练设备当前执行的训练任务的个数大于预设阈值,则训练设备判断第一训练任务的优先级是否高于第二训练任务的优先级,若第一训练任务的优先级高于第二训练任务的优先级,则进入步骤402;若第一训练任务的优先级低于或等于第二训练任务的优先级,则进入步骤403。若训练设备当前执行的训练任务的个数小于或等于预设阈值,则可以开始执行第一训练任务。预设阈值可以为大于或等于1的整数,例如,预设阈值可以为4、5、6或其他取值等,具体可以结合实际应用场景确定。Specifically, the training device can determine whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device determines whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started. The preset threshold can be an integer greater than or equal to 1, for example, the preset threshold can be 4, 5, 6 or other values, which can be determined according to the actual application scenario.
情况3:第一场景包括第一训练任务的需求资源大于训练设备的空闲资源,和训练设备当前执行的训练任务的个数大于预设阈值Scenario 3: The first scenario includes situations where the resource requirements of the first training task exceed the available resources of the training device, and the number of training tasks currently being executed by the training device exceeds a preset threshold.
具体地,训练设备可以判断第一训练任务的需求资源是否大于训练设备的空闲资源,且判断训练设备当前执行的训练任务的个数是否大于预设阈值,若第一训练任务的需求资源大于训练设备的空闲资源,或者训练设备当前执行的训练任务的个数大于预设阈值,则训练设备均会判断第一训练任务的优先级是否高于第二训练任务的优先级;若第一训练任务的需求资源小于或等于训练设备的空闲资源,且训练设备当前执行的训练任务的个数小于或等于预设阈值,则可以开始执行第一训练任务。Specifically, the training device can determine whether the resource requirement of the first training task is greater than the idle resources of the training device, and whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the resource requirement of the first training task is greater than the idle resources of the training device, or the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device will determine whether the priority of the first training task is higher than the priority of the second training task. If the resource requirement of the first training task is less than or equal to the idle resources of the training device, and the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started.
本申请实施例中,在第一训练任务的需求资源大于训练设备的空闲资源时,或者,在训练设备当前执行的训练任务的个数大于预设阈值时,开始执行本申请提供的训练任务的处理方法,也即在前述场景中,训练设备才会暂停、延迟或拒绝执行某些训练任务,提供了本申请的两种应用场景,提高了本方案的实现灵活性;此外,前述两种应用场景下都快到达了训练设备的负载极限,有利于实现在最大程度上的使用训练设备中的资源的前提下,避免发生训练设备过载的情况。In this embodiment, the training task processing method provided in this application is started when the required resources of the first training task are greater than the idle resources of the training device, or when the number of training tasks currently being executed by the training device is greater than a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
针对训练设备确定第一训练任务的优先级的具体实现方式,在一种实现方式中,训练设备可以从第一请求信息中获取第一训练任务的优先级。在另一种实现方式中,训练设备可以根据第一因素确定第一训练任务的优先级,前述步骤的具体实现方式可以参阅上述对于第一设备根据第一因素确定第一训练任务的优先级的描述,此处不做赘述。Regarding the specific implementation methods for determining the priority of the first training task by the training device, in one implementation, the training device can obtain the priority of the first training task from the first request information. In another implementation, the training device can determine the priority of the first training task based on a first factor. The specific implementation methods of the aforementioned steps can be found in the above description of the first device determining the priority of the first training task based on the first factor, and will not be repeated here.
可选地,训练设备还可以调整第一训练任务的优先级,则第一请求信息中的优先级,以及训练设备根据第一因素确定的第一训练任务的优先级,均可以理解为第一训练任务的初始优先级,训练设备可以根据第一时长,对第一训练任务的初始优先级进行调高,得到第一训练任务的当前优先级,第一时长为训练设备接收到第一请求信息的时刻与当前时刻之间的时长;示例性地,第一时长越长,第一训练任务的当前优先级越高,第一时长越短,第一训练任务的当前优先级越低。Optionally, the training device can also adjust the priority of the first training task. The priority in the first request information and the priority of the first training task determined by the training device based on the first factor can both be understood as the initial priority of the first training task. The training device can increase the initial priority of the first training task according to the first duration to obtain the current priority of the first training task. The first duration is the duration between the time when the training device receives the first request information and the current time. For example, the longer the first duration, the higher the current priority of the first training task, and the shorter the first duration, the lower the current priority of the first training task.
其中,训练设备确定第二训练任务包括的每个训练任务的优先级的具体实现方式与训练设备确定第一训练任务的优先级的具体实现方式类似,此处不再进行赘述。The specific implementation method for determining the priority of each training task included in the second training task by the training device is similar to the specific implementation method for determining the priority of the first training task by the training device, and will not be repeated here.
本申请实施例中,若第一请求信息包括第一训练任务的优先级,则训练设备可以从直接从第一请求信息中获取到第一训练任务的优先级,进而能够更加快速的实现将第一训练任务的优先级与其他训练任务的优先级之间的对比,由于在确定了第一训练任务的优先级与第二训任务的优先级之间的对比结果之后,才可以确定是否执行第一训练任务,从而也有利于更加快速的确定是否执行第一训练任务;此外,训练设备直接从第一请求信息中获取第一训练任务的优先级,也有利于减少训练设备确定第一训练任务的优先级这一过程所占用的资源,从而有利于使得训练设备能够将更多的资源应用于执行训练任务上,有利于尽快得到执行完训练任务的模型。In this embodiment, if the first request information includes the priority of the first training task, the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed by the training device in determining the priority of the first training task. This allows the training device to allocate more resources to executing the training task, which is beneficial for obtaining a model that has completed the training task as quickly as possible.
可选地,训练设备当前执行的每个训练任务均为对通信领域的模型进行训练的任务,对于通信领域的模型的举例可以参阅上述描述,此处不再进行赘述。提供了本申请中方法的一个具体的应用领域,提高了本申请与具体应用领域的结合程度。Optionally, each training task currently performed by the training device is a task of training a model in the communications domain. Examples of models in the communications domain can be found in the above description and will not be repeated here. This provides a specific application domain for the method in this application, improving the degree of integration between this application and specific application domains.
402、若第一训练任务的优先级高于第二训练任务的优先级,则训练设备执行第一模型的训练任务,暂停执行第二训练任务。402. If the priority of the first training task is higher than that of the second training task, the training device executes the training task of the first model and suspends the execution of the second training task.
其中,第二训练任务可以为训练设备当前执行的一个或多个训练任务,或,第二训练任务可以为训练设备当前执行的训练任务中优先级最低的训练任务。The second training task can be one or more training tasks currently being executed by the training device, or the second training task can be the training task with the lowest priority among the training tasks currently being executed by the training device.
本申请实施例中,训练设备执行第一训练任务代表:训练设备正在执行的训练任务中增加了第一训练任务;训练设备暂停执行第二训练任务可以称为训练设备挂起(suspend)第二训练任务,从而训练设备当前执行的训练任务中不再包括第二训练任务。In this embodiment of the application, the execution of the first training task by the training device means that the first training task has been added to the training tasks that the training device is currently executing; the suspension of the execution of the second training task by the training device can be referred to as the suspension of the second training task by the training device, so that the second training task is no longer included in the training tasks currently being executed by the training device.
示例性地,训练设备中还可以维护有等待队列,当训练设备暂停执行第二训练任务后,可以将与第二训练任务对应的第二请求信息放入等待队列中。For example, the training device may also maintain a waiting queue. When the training device pauses the execution of the second training task, the second request information corresponding to the second training task can be put into the waiting queue.
可选地,训练设备还可以存储第一训练任务的信息,第一训练任务的信息可以包括第一训练任务的标识信息,可选地,第一训练任务的信息还可以包括第一训练任务的优先级;若训练设备能够调整第一训练任务的优先级,则第一训练任务的信息中包括的为第一训练任务的当前优先级。第一训练任务的信息还可以包括其他类型的信息,例如与第一训练任务对应的第一模型的推断类型等等,第一训练任务的信息具体可以包括哪些信息可以结合实际应用场景确定。Optionally, the training device can also store information about the first training task. This information may include the identification information of the first training task, and optionally, it may also include the priority of the first training task. If the training device can adjust the priority of the first training task, then the information about the first training task includes its current priority. The information about the first training task may also include other types of information, such as the inference type of the first model corresponding to the first training task. The specific information that the first training task's information may include can be determined based on the actual application scenario.
403、若第一训练任务的优先级低于或等于第二训练任务的优先级,则训练设备向第一设备发送响应信息。403. If the priority of the first training task is lower than or equal to the priority of the second training task, the training device sends a response message to the first device.
其中,该响应信息可以用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息可以包括第二训练任务的信息。The response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information can include information about the second training task.
其中,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Wherein, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
本申请实施例中,若第一训练任务的优先级低于或等于第二训练任务的优先级,则不会影响训练设备执行第二训练任务,且训练设备向第一设备发送响应信息,也即训练设备不会直接执行第一训练任务。对应的,第一设备可以接收到来自训练设备的响应信息,该响应信息用于指示训练设备没有直接执行第一训练任务。示例性地,该响应信息可以包括与延迟或拒绝对应的指示值,例如,与延迟或拒绝对应的指示值可以为1111,代表延迟或拒绝执行第一训练任务;或者,该响应信息包括延迟或者拒绝,从而能够告知第一设备延迟或拒绝执行第一训练任务等,具体均可以结合实际应用场景确定。In this embodiment, if the priority of the first training task is lower than or equal to the priority of the second training task, it will not affect the training device's execution of the second training task, and the training device will send response information to the first device, meaning the training device will not directly execute the first training task. Correspondingly, the first device can receive response information from the training device, which indicates that the training device has not directly executed the first training task. For example, the response information may include an indication value corresponding to delay or rejection. For instance, the indication value corresponding to delay or rejection could be 1111, representing delay or rejection of executing the first training task; or, the response information may include delay or rejection, thereby informing the first device to delay or reject the execution of the first training task, etc. The specific details can be determined based on the actual application scenario.
示例性地,第二训练任务的信息可以包括S个训练任务中每个训练任务的信息,S个训练任务中每个训练任务的信息可以包括S个训练任务中每个训练任务的标识信息;可选地,还包括S个训练任务中每个训练任务的优先级,也即第二训练任务的信息可以包括第二训练任务的优先级;可选地,还包括S个训练任务中每个训练任务所对应的推断类型等,具体均可以结合实际应用场景确定。For example, the information of the second training task may include the information of each of the S training tasks, and the information of each of the S training tasks may include the identification information of each of the S training tasks; optionally, it may also include the priority of each of the S training tasks, that is, the information of the second training task may include the priority of the second training task; optionally, it may also include the inference type corresponding to each of the S training tasks, etc., all of which can be determined in combination with the actual application scenario.
可选地,该响应信息可以包括训练设备当前执行的训练任务中高于第一训练任务的优先级的所有训练任务的信息,其中,训练设备当前执行的训练任务中高于第一训练任务的优先级的所有训练任务包括第二训练任务。Optionally, the response information may include information about all training tasks currently being executed by the training device that have a higher priority than the first training task, wherein all training tasks currently being executed by the training device that have a higher priority than the first training task include the second training task.
本申请实施例中,训练设备可以利用训练任务的优先级来决定暂停、延迟或拒绝执行某些训练任务,有利于避免发生训练设备过载的情况,进而有利于避免用于执行训练任务的进程中断或者避免数据的丢失,提高了训练设备在执行训练任务过程的稳定性。In this embodiment, the training device can use the priority of training tasks to decide to pause, delay, or refuse to execute certain training tasks, which helps to avoid overloading of the training device, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training device in the process of executing training tasks.
可选地,在第一训练任务的优先级低于或等于第二训练任务的优先级的情况下,训练设备会延迟或拒绝执行第一训练任务,而训练设备向第一设备发送的响应信息包括第二训练任务的优先级,则第一设备不仅能够得知训练设备决定延迟或拒绝执行第一训练任务,而且还能够得知训练设备是因为正在执行优先级更高的第二训练任务,所以决定延迟或拒绝执行第一训练任务,也即第一设备能够得知训练设备的当前情况,进而便于第一设备能够结合训练设备的当前情况以及第一模型的需求情况,确定出更加适配的处理方式,也有利于使得训练设备中的资源的使用情况更能满足当前场景的需求。Optionally, if the priority of the first training task is lower than or equal to the priority of the second training task, the training device will delay or refuse to execute the first training task. The response information sent by the training device to the first device includes the priority of the second training task. Thus, the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority. In other words, the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. This also helps to make the use of resources in the training device better meet the needs of the current scenario.
可选地,在上述图4对应的实施例的基础上,请参阅图5,图5为本申请实施例提供的训练任务的处理方法的另一种示意图,如图5所示,本申请提供的训练任务的处理方法可以包括:Optionally, based on the embodiment corresponding to Figure 4 above, please refer to Figure 5. Figure 5 is another schematic diagram of the training task processing method provided by the embodiments of this application. As shown in Figure 5, the training task processing method provided by this application may include:
501、第二设备向训练设备发送第二请求信息。501. The second device sends a second request message to the training device.
本申请实施例中,第二设备和第一设备为不同的请求设备,当第二设备确定对第二模型进行训练时,可以向训练设备发送第二请求信息。In this embodiment of the application, the second device and the first device are different requesting devices. When the second device determines to train the second model, it can send a second request message to the training device.
其中,第二请求信息用于请求执行第二模型的训练任务。第二模型的训练任务可以称为第二训练任务,两者之间可以相互替换。The second request information is used to request the execution of the training task for the second model. The training task for the second model can be called the second training task, and the two can be interchanged.
对应的,训练设备可以接收来自第二设备的第二请求信息,进而根据第二请求信息确定第二训练任务;步骤501中名词的含义以及步骤的具体实现方式可以参阅上述图4对应实施例中对步骤401的描述,区别在于,将“第一设备”替换为“第二设备”,将“第一模型”替换为“第二模型”,将“第一请求信息”替换为“第二请求信息”,此处不再进行赘述。Correspondingly, the training device can receive the second request information from the second device, and then determine the second training task based on the second request information; the meaning of the terms in step 501 and the specific implementation of the steps can be referred to the description of step 401 in the embodiment corresponding to Figure 4 above, the difference being that "first device" is replaced with "second device", "first model" is replaced with "second model", and "first request information" is replaced with "second request information", which will not be repeated here.
参阅上述图4对应实施例中的描述可知,第二训练任务可能会包括一个或多个训练任务,在一种情况中,第二训练任务包括一个训练任务,则第二请求信息可以包括一个请求信息,第二设备可以包括一个设备。在另一种情况中,第二训练任务包括至少两个训练任务,则第二请求信息也可以包括与至少两个训练任务一一对应的至少两个请求信息,至少两个请求信息中每个请求信息用于请求执行第二训练任务中的一个训练任务,第二设备可以包括发送前述至少两个请求信息的所有设备。Referring to the description in the embodiment corresponding to Figure 4 above, the second training task may include one or more training tasks. In one case, the second training task includes one training task, in which case the second request information may include one request message, and the second device may include one device. In another case, the second training task includes at least two training tasks, in which case the second request information may also include at least two request messages corresponding one-to-one with the at least two training tasks. Each of the at least two request messages is used to request the execution of one of the training tasks in the second training task, and the second device may include all devices that sent the aforementioned at least two request messages.
训练设备在接收到第二请求信息之后,可以确定是否开始执行第二训练任务,进而开始执行第二训练任务,示例性地,训练设备在接收到第二请求信息包括的每个请求信息之后,可以确定是否开始执行每个请求信息所对应的训练任务,进而开始执行每个请求信息所对应的训练任务。训练设备确定是否开始执行训练任务的过程可以参阅上述图4对应实施例中对训练设备确定是否开始执行第一训练任务的描述,此处不做赘述。After receiving the second request information, the training device can determine whether to start executing the second training task, and then start executing the second training task. For example, after receiving each request information included in the second request information, the training device can determine whether to start executing the training task corresponding to each request information, and then start executing the training task corresponding to each request information. The process by which the training device determines whether to start executing the training task can be referred to the description of the training device determining whether to start executing the first training task in the embodiment corresponding to Figure 4 above, and will not be repeated here.
502、训练设备接收来自第一设备的第一请求信息。502. The training device receives the first request information from the first device.
其中,第一请求信息可以用于请求执行第一模型的训练任务。The first request information can be used to request the execution of the training task of the first model.
503、若第一模型的训练任务的优先级高于第二训练任务的优先级,则训练设备执行第一模型的训练任务,暂停执行第二训练任务。503. If the priority of the training task of the first model is higher than the priority of the training task of the second model, the training device shall execute the training task of the first model and suspend the execution of the second training task.
其中,第一模型的训练任务可以称为第一训练任务。The training task of the first model can be referred to as the first training task.
本申请实施例中,步骤502和503中名词的含义以及步骤的具体实现方式可以参阅上述图4对应实施例中对步骤401和402的描述,此处不做赘述。In this embodiment, the meanings of the terms in steps 502 and 503 and the specific implementation of the steps can be found in the description of steps 401 and 402 in the embodiment corresponding to Figure 4 above, and will not be repeated here.
504、训练设备向第二训练任务对应的第二设备发送通知信息。504. The training equipment sends a notification message to the second equipment corresponding to the second training task.
其中,该通知信息可以用于通知第二设备暂停执行第二模型的训练任务,或者,该通知信息可以用于通知第二设备该训练设备的空闲资源。The notification information can be used to notify the second device to suspend the training task of the second model, or it can be used to notify the second device of the idle resources of the training device.
本申请实施例中,步骤504为可选步骤。训练设备在暂停执行第二训练任务之后,还可以向第二训练任务所对应的第二设备发生通知信息,对应的,第二设备可以接收来自于训练设备的通知信息。In this embodiment, step 504 is an optional step. After pausing the execution of the second training task, the training device can also send a notification message to the second device corresponding to the second training task, and the second device can receive the notification message from the training device.
在一种情况中,该通知信息用于通知第二设备暂停执行第二训练任务,也即步骤504中训练设备通知与第二训练任务对应的第二设备暂停执行第二训练任务。In one scenario, the notification information is used to instruct the second device to suspend the execution of the second training task, that is, in step 504, the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task.
可选地,训练设备通知第二设备暂停执行第二训练任务,可以包括:训练设备向第二设备发送原因值,也即该通知信息包括原因值,该原因值用于指示暂停执行第二训练任务的原因。对应的,第二设备可以接收来自于训练设备的原因值。Optionally, the training device notifying the second device to suspend the execution of the second training task may include: the training device sending a reason value to the second device, that is, the notification information includes a reason value, which indicates the reason for suspending the execution of the second training task. Correspondingly, the second device may receive the reason value from the training device.
示例性地,由于第二设备包括一个或多个设备,则训练设备向第二设备发送原因值可以理解为训练设备向第二设备包括的每个设备发送该原因值。第二设备可以接收来自于训练设备的原因值可以理解为第二设备包括的每个设备接收来自于训练设备的原因值。For example, since the second device includes one or more devices, the sending of a cause value by the training device to the second device can be understood as the training device sending the cause value to each device included in the second device. The receiving of a cause value from the training device by the second device can be understood as each device included in the second device receiving a cause value from the training device.
可选地,该原因包括第二训练任务为训练设备上优先级最低的训练任务。本申请实施例中,训练设备暂停执行的第二训练任务时训练设备上优先级最低的训练任务,也即训练设备中的资源能够尽量先分配给优先级更高的训练任务,使得训练设备中的资源能够被更加高效的使用。Optionally, this reason includes the second training task being the lowest priority training task on the training device. In this embodiment, the second training task that the training device suspends execution from is the lowest priority training task on the training device, meaning that the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
在一种情况中,上述原因值包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个第三训练任务的优先级均高于第二训练任务的优先级。In one scenario, the aforementioned cause value includes the priority of each of at least one third training task currently being performed by the training device, where the priority of each third training task is higher than the priority of the second training task.
示例性地,至少一个第三训练任务可以包括训练设备当前执行的所有训练任务。需要说明的是,由于训练设备可能会暂停执行旧的训练任务,也有可能会开始执行新的训练任务,本申请中训练设备当前执行的训练任务中包括的训练任务是可以变化的。当训练设备向第二设备发送原因值时,训练设备已经暂停执行第二训练任务且开始执行第一训练任务,则训练设备当前执行的训练任务可以包括第一训练任务,也即至少一个第三训练任务中可以包括第一训练任务。或者,至少一个第三训练任务可以仅包括第一训练任务。或者,至少一个第三训练任务可以包括从训练设备当前执行的所有训练任务中选取的预设个数的训练任务等,具体可以结合实际应用场景确定。For example, at least one third training task may include all training tasks currently being executed by the training device. It should be noted that since the training device may pause the execution of old training tasks or begin executing new training tasks, the training tasks currently being executed by the training device in this application can vary. When the training device sends a cause value to the second device, the training device has paused the execution of the second training task and begun executing the first training task; therefore, the training tasks currently being executed by the training device may include the first training task, that is, at least one third training task may include the first training task. Alternatively, at least one third training task may include only the first training task. Or, at least one third training task may include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc., which can be determined based on the actual application scenario.
可选地,上述原因值还包括每个第三训练任务的标识信息;可选地,上述原因值还包括每个第三训练任务的推断类型等,具体均可以结合实际应用场景确定。Optionally, the above-mentioned cause value may also include the identification information of each third training task; alternatively, the above-mentioned cause value may also include the inference type of each third training task, etc., which can be determined in combination with the actual application scenario.
本申请实施例中,训练设备向第二设备发送的原因值包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个训练任务的优先级均高于第二训练任务的优先级,则第二设备在接收到该原因值之后,不仅能够得知因为优先级低导致暂停执行第二训练任务,而且能够得知正在占用训练设备的资源的第三训练任务的优先级,也即第二设备能够得知训练设备的更加详细的资源使用情况,进而便于第二设备能够结合训练设备的资源使用情况以及第二模型的需求情况,确定出更加适配的处理方式,也有利于使得训练设备中的资源的使用情况更能满足当前场景的需求。In this embodiment, the reason value sent by the training device to the second device includes the priority of each of the at least one third training tasks currently being executed by the training device. The priority of each training task is higher than that of the second training task. After receiving the reason value, the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device, which makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device and the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
在另一种情况中,上述原因值也可以表现为字母,例如上述原因值可以表现为“LL”,代表由于第二训练任务为训练设备上优先级最低的训练任务,导致暂停执行第二训练任务。In another scenario, the aforementioned reason value can also be represented by a letter, such as "LL", which means that the execution of the second training task is paused because the second training task is the lowest priority training task on the training device.
在另一种情况中,上述原因值也可以表现为数字,例如上述原因值可以表现为“000000”,代表由于第二训练任务为训练设备上优先级最低的训练任务,导致暂停执行第二训练任务;或者,上述原因值还可以表现为其他形式,具体可以结合实际应用场景设定。In another scenario, the aforementioned reason value can also be expressed as a number. For example, the reason value can be expressed as "000000", which means that the second training task is paused because it is the lowest priority training task on the training device. Alternatively, the aforementioned reason value can also be expressed in other forms, which can be set according to the actual application scenario.
本申请实施例中,训练设备在确定暂停执行第二训练任务之后,会向第二设备发送原因值。其中,该原因值用于指示暂停执行第二训练任务的原因,从而第二设备不仅能够及时得知训练设备中已经暂停执行第二训练任务,而且能够得知训练设备暂停执行第二训练任务的原因,不仅便于第二设备在确定了训练设备已经暂停执行第二训练任务后,及时确定第二训练任务的处理方式,而且第二设备还能根据暂停执行第二训练任务的原因,来确定更加适配的处理方式。In this embodiment, after determining that the second training task has been paused, the training device sends a reason value to the second device. This reason value indicates the reason for pausing the second training task. This allows the second device to promptly know that the second training task has been paused and, more importantly, the reason for the pause. This facilitates the second device in determining the appropriate handling method for the second training task after confirming its pause, and also allows it to determine a more suitable handling method based on the reason for the pause.
具体地,该通知信息可以包括与暂停对应的指示值,例如,与暂停对应的指示值可以为2222,代表暂停执行第二训练任务。或者,该通知信息包括“暂停执行”,从而能够告知第二设备暂停执行第二训练任务等,具体均可以结合实际应用场景确定。Specifically, the notification information may include an indication value corresponding to the pause. For example, the indication value corresponding to the pause may be 2222, representing the pause of the second training task. Alternatively, the notification information may include "pause execution," thereby informing the second device to pause the execution of the second training task. The specific details can be determined based on the actual application scenario.
本申请实施例中,训练设备在暂停执行第二训练任务之后,会及时通知第二训练任务所对应的第二设备该训练设备已经暂停执行第二训练任务,从而便于第二设备能够快速的得知训练设备已经暂停执行第二训练任务,使得第二训练任务的执行过程更加具有可控性。In this embodiment, after the training device suspends the execution of the second training task, it will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task. This allows the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
在另一种情况中,该通知信息用于通知第二设备该训练设备的空闲资源,也即步骤504中训练设备通知第二设备该训练设备的空闲资源,对应的,第二设备可以确定训练设备已经暂停执行第二训练任务,且第二设备能够获知训练设备中有多少空闲资源;其中,训练设备的空闲资源的概念可以参阅上述图4中的描述,此处不做赘述。In another scenario, the notification information is used to inform the second device of the idle resources of the training device. That is, in step 504, the training device notifies the second device of the idle resources of the training device. Correspondingly, the second device can determine that the training device has paused the execution of the second training task, and the second device can know how many idle resources are in the training device. The concept of the idle resources of the training device can be referred to the description in Figure 4 above, and will not be repeated here.
示例性地,该通知信息中可以包括训练设备中空闲的存储资源的数量;可选地,该通知信息中还可以包括训练设备中空闲的处理器资源的数量,或者,该通知信息还可以包括训练设备中其他类型的空闲资源的数量,具体可以结合实际应用场景确定。For example, the notification information may include the number of idle storage resources in the training device; optionally, the notification information may also include the number of idle processor resources in the training device, or the notification information may also include the number of other types of idle resources in the training device, which can be determined in combination with the actual application scenario.
本申请实施例中,训练设备在暂停执行第二训练任务之后,可以通知第二设备该训练设备的空闲资源,从而第二设备不仅能够得知训练设备已经暂停执行第二训练任务了,而且能够根据该训练设备的空闲资源确定如何处理训练设备暂停执行第二训练任务这一状况,有利于得到与训练设备的当前状态更为适配的处理方式。In this embodiment, after the training device suspends the execution of the second training task, it can notify the second device of the idle resources of the training device. Thus, the second device can not only know that the training device has suspended the execution of the second training task, but also determine how to handle the situation of the training device suspending the execution of the second training task based on the idle resources of the training device, which is conducive to obtaining a processing method that is more suitable for the current state of the training device.
505、第二设备向训练设备发送第二训练任务的处理方式。505. The method for the second device to send a second training task to the training device.
其中,第二训练任务的处理方式可以为:终止,等待,或者调整占用资源。The second training task can be handled by terminating, waiting, or adjusting the resources used.
本申请实施例中,步骤505为可选步骤。第二设备在确定了训练设备暂停执行第二训练任务之后,可以确定第二训练任务的处理方式,进而向训练设备发送第二训练任务的处理方式。In this embodiment, step 505 is an optional step. After determining that the training device has paused the execution of the second training task, the second device can determine the processing method for the second training task and then send the processing method for the second training task to the training device.
示例性地,第二设备向训练设备发送第二训练任务的处理方式可以理解为第二设备包括的每个设备向训练设备发送第二训练任务中任意一个训练任务的处理方式。第二训练任务的处理方式可以理解为第二训练任务包括的任意一个训练任务的处理方式。For example, the processing method of the second device sending the second training task to the training device can be understood as the processing method of each device included in the second device sending any one of the training tasks in the second training task to the training device. The processing method of the second training task can be understood as the processing method of any one of the training tasks included in the second training task.
示例性地,当第二训练任务的处理方式为终止时,第二设备可以向训练设备发送与终止对应的指示值,例如,与终止对应的指示值可以为000000、DDDDD,或其他类型的指示值等。或者,第二设备可以向训练设备发送用于指示“取消执行该训练任务”的反馈信息等,具体实现方式可结合实际应用场景确定。For example, when the processing method of the second training task is termination, the second device can send an indication value corresponding to termination to the training device. For example, the indication value corresponding to termination can be 000000, DDDDD, or other types of indication values. Alternatively, the second device can send feedback information to the training device to indicate "cancel the execution of this training task". The specific implementation method can be determined according to the actual application scenario.
当第二训练任务的处理方式为等待时,第二设备可以向训练设备发送与等待对应的指示值;或者,第二设备可以向训练设备发送用于指示“等待”的反馈信息等,此处不做限定。When the processing mode of the second training task is waiting, the second device can send an indication value corresponding to waiting to the training device; or, the second device can send feedback information to the training device to indicate "waiting", etc., without limitation.
当第二训练任务的处理方式为调整占用资源时,在一种情况中,该调整占用资源可以进一步包括:减少与第二训练任务对应的模型的参数量,降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。示例性地,减少与第二训练任务对应的模型的参数量可以通过对第二模型进行剪枝的方式实现,具体采用的剪枝算法可以结合实际应用场景灵活确定,从而不仅能够减少占用的存储资源,还可以减少占用的处理器资源。降低执行第二训练任务时的精度可以为在执行第二训练任务时可以从高精度的数据格式替换为低精度的数据格式,例如,从FP32降低为FP16,从FP64降低为混合精度等,此处示例仅为方便理解本方案,从而不仅能够减少占用的存储资源,还可以减少占用的处理器资源。降低执行第二训练任务时的批训练规模可以为减少单个批次的训练过程中所采用的训练样本的数量,从而能够减少单个批次的训练过程中所需要的存储资源。When the processing method for the second training task involves adjusting resource usage, in one scenario, this resource adjustment can further include: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when executing the second training task, or reducing the batch training scale when executing the second training task. For example, reducing the number of parameters in the model corresponding to the second training task can be achieved by pruning the second model. The specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only storage resources but also processor resources. Reducing the accuracy when executing the second training task can be achieved by replacing the high-precision data format with a low-precision data format, such as reducing from FP32 to FP16, or from FP64 to mixed precision. This example is only for ease of understanding the solution, thus reducing not only storage resources but also processor resources. Reducing the batch training scale when executing the second training task can be achieved by reducing the number of training samples used in a single batch of training, thereby reducing the storage resources required for a single batch of training.
示例性地,第二设备可以向训练设备发送第一反馈信息,第一反馈信息指示训练设备采用第一方式来处理第二训练任务,第一方式为减少与第二训练任务对应的模型的参数量,降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模,也即第二设备不仅指示第二训练任务的处理方式为调整占用资源,而且还指示采用什么方式来调整执行第二训练任务时占用的资源。For example, the second device can send first feedback information to the training device. The first feedback information instructs the training device to process the second training task in a first manner. The first manner is to reduce the number of parameters of the model corresponding to the second training task, reduce the accuracy when performing the second training task, or reduce the batch training scale when performing the second training task. That is, the second device not only instructs the processing method of the second training task to adjust the occupied resources, but also instructs what method to use to adjust the occupied resources when performing the second training task.
本申请实施例中,第二设备在得知暂停执行第二训练任务之后,可以向训练设备发送第二训练任务的处理方式,由于第二设备更加明确对第二模型的需求情况,从而第二设备能够根据对第二模型的需求情况,来确定第二训练任务的处理方式是终止、等待还是调整占用资源,有利于提高第二训练任务的处理方式与具体的应用场景之间的适配度。In this embodiment, after learning that the second training task is suspended, the second device can send the processing method of the second training task to the training device. Since the second device has a clearer understanding of the requirements of the second model, it can determine whether to terminate, wait, or adjust the resource usage of the second training task based on the requirements of the second model. This is beneficial to improving the adaptability between the processing method of the second training task and the specific application scenario.
当第二训练任务的处理方式为调整占用资源时,第二设备还可以进一步指示采用什么方式来调整执行第二训练任务时占用的资源,有利于使得最终得到的第二模型与第二模型的应用场景之间更加适配,也即有利于在有限资源的前提下,得到更满意的第二模型。When the processing method of the second training task is to adjust the occupied resources, the second device can further indicate how to adjust the resources occupied when executing the second training task. This is beneficial to make the final second model more compatible with the application scenario of the second model, that is, to obtain a more satisfactory second model under the premise of limited resources.
在另一种情况中,第二设备可以向训练设备发送与调整占用资源对应的指示值;或者,第二设备可以向训练设备发送用于指示“调整占用资源”的反馈信息等,从而训练设备能够得知第二训练任务的处理方式为调整占用资源,进而可以由训练设备来确定采用什么方式来调整执行第二训练任务时占用的资源。In another scenario, the second device can send an indication value to the training device corresponding to adjusting the occupied resources; or, the second device can send feedback information to the training device to indicate "adjusting the occupied resources," so that the training device can know that the processing method of the second training task is to adjust the occupied resources, and then the training device can determine how to adjust the occupied resources when performing the second training task.
可选地,第二训练任务的处理方式的确定因素包括:第二模型的即时性要求,第二模型的精度要求,和/或,调整占用资源导致的第二模型的精度的降低程度。Optionally, the factors determining the processing method of the second training task include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting the resources used.
可选地,对于即时性要求很高且精度要求很高的第二模型,训练设备可以采用终止的处理方式,进而可以及时请求其他设备执行第二训练任务。对于即时性要求不高的第二模型,训练设备可以采用等待的处理方式。对于即时性要求很高且精度要求不是很高的第二模型,训练设备可以采用调整占用资源的处理方式。Optionally, for a second model with high immediacy and accuracy requirements, the training device can terminate the process, allowing it to promptly request other devices to execute the second training task. For a second model with low immediacy requirements, the training device can wait. For a second model with high immediacy and low accuracy requirements, the training device can adjust resource usage.
其中,第二模型的即时性要求可以参阅上述图4对应实施例中对“第一模型的即时性要求”的描述,区别在于,将“第一模型”替换为“第二模型”,此处不再赘述。“第二模型的精度要求”可以理解为对执行完第二训练任务的第二模型的精度要求,或者也可以理解为对执行完第二训练任务的第二模型的性能要求等,“模型的精度”或者“模型的性能”可以理解为通过模型生成的预测信息的准确度。The immediacy requirement of the second model can be referred to the description of the "immediacy requirement of the first model" in the corresponding embodiment of Figure 4 above, except that "first model" is replaced with "second model", which will not be repeated here. The "accuracy requirement of the second model" can be understood as the accuracy requirement of the second model after performing the second training task, or it can also be understood as the performance requirement of the second model after performing the second training task, etc. "Model accuracy" or "model performance" can be understood as the accuracy of the prediction information generated by the model.
其中,调整占用资源导致的第二模型的精度的降低程度可以通过如下至少一个参数确定:调整执行第二训练任务时占用的资源后得到的第二模型的第一精度范围,和/或,第二精度范围,第二精度范围代表调整占用资源导致最终得到的第二模型的精度下降了第二精度范围的精度,或者,还可以包括其他参数等,本申请实施例中不做穷举。The degree of reduction in the accuracy of the second model caused by adjusting the occupied resources can be determined by at least one of the following parameters: the first accuracy range of the second model obtained after adjusting the resources occupied when performing the second training task, and/or the second accuracy range, where the second accuracy range represents the decrease in the accuracy of the final second model caused by adjusting the occupied resources, or may include other parameters, etc., which are not exhaustively listed in this application embodiment.
示例性地,第二模型的即时性要求为:执行完第二训练任务的第二模型需要在第二时长内上线,第二模型的精度要求为:执行完第二训练任务的第二模型的精度需要在第一预设精度范围内。第二设备可以判断第二时长是否大于预设时长,若第二时长大于预设时长,则第二设备可以确定第二训练任务的处理方式为等待。若第二时长小于预设时长,在一种实现方式中,第二设备可以判断第一精度范围是否位于第一预设精度范围内,若第一精度范围位于第一预设精度范围内,则第二设备可以确定第二训练任务的处理方式为调整占用资源。若第一精度范围不位于第一预设精度范围内,也即第一精度范围中存在第一预设精度范围之外的精度,则第二设备可以确定第二训练任务的处理方式为终止,进而第二设备可以请求其他设备执行第二训练任务。For example, the immediacy requirement of the second model is that the second model, after completing the second training task, needs to be online within a second duration. The accuracy requirement of the second model is that the accuracy of the second model after completing the second training task needs to be within a first preset accuracy range. The second device can determine whether the second duration is greater than the preset duration. If the second duration is greater than the preset duration, the second device can determine that the processing method for the second training task is to wait. If the second duration is less than the preset duration, in one implementation, the second device can determine whether the first accuracy range is within the first preset accuracy range. If the first accuracy range is within the first preset accuracy range, the second device can determine that the processing method for the second training task is to adjust the occupied resources. If the first accuracy range is not within the first preset accuracy range, that is, if there is an accuracy outside the first preset accuracy range within the first accuracy range, the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
在另一种实现方式中,第二设备可以判断第二精度范围是否位于第二预设精度范围内,若第二精度范围位于第二预设精度范围内,则第二设备可以确定第二训练任务的处理方式为调整占用资源。若第二精度范围不位于第二预设精度范围内,也即第二精度范围中存在第二预设精度范围之外的精度,则第二设备可以确定第二训练任务的处理方式为终止,进而第二设备可以请求其他设备执行第二训练任务。In another implementation, the second device can determine whether the second precision range is within the second preset precision range. If the second precision range is within the second preset precision range, the second device can determine that the processing method for the second training task is to adjust the occupied resources. If the second precision range is not within the second preset precision range, that is, if there is a precision outside the second preset precision range, the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
本申请实施例中,第二设备可以根据第二模型的即时性要求、第二模型的精度要求和/或调整占用资源导致的第二模型的精度的降低程度,来确定第二训练任务的处理方式,也即从时间要求和精度要求两个维度上来确定第二训练任务的处理方式,有利于在资源有限的前提下能够得到更优的处理方式。In this embodiment, the second device can determine the processing method of the second training task based on the immediacy requirements of the second model, the accuracy requirements of the second model, and/or the degree of reduction in the accuracy of the second model caused by the use of resources. That is, the processing method of the second training task is determined from the two dimensions of time requirements and accuracy requirements, which is beneficial to obtain a better processing method under the premise of limited resources.
506、训练设备基于第二训练任务的处理方式进行处理。506. The training equipment processes data based on the processing method of the second training task.
本申请实施例中,步骤506为可选步骤。步骤506可以包括:当第二训练任务的处理方式为终止时,训练设备删除与第二训练任务对应的第二请求信息;示例性地,训练设备可以在等待队列中删除与第二训练任务对应的第二请求信息。或者,In this embodiment, step 506 is an optional step. Step 506 may include: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task from the waiting queue. Alternatively,
当第二训练任务的处理方式为等待时,训练设备等待至空闲资源大于或等于第二训练任务所需资源后,继续执行第二训练任务;可选地,参阅上述图4对应实施例中的描述,当训练设备暂停执行第二训练任务后,可以将与第二训练任务对应的第二请求信息放入等待队列中,当等待队列的队列头为第二请求信息,且训练设备的空闲资源大于或等于第二训练任务所需资源时,可以继续执行第二训练任务。或者,When the processing mode of the second training task is waiting, the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task. Optionally, referring to the description in the embodiment corresponding to Figure 4 above, after the training device pauses the execution of the second training task, the second request information corresponding to the second training task can be placed in a waiting queue. When the head of the waiting queue is the second request information, and the idle resources of the training device are greater than or equal to the resources required by the second training task, the second training task can continue to be executed. Alternatively,
步骤506可以包括:当第二训练任务的处理方式为调整占用资源时,无论是由第二设备还是训练设备确定采用什么方式来调整执行第二训练任务时占用的资源,训练设备均可以获知采用什么方式来调整执行第二训练任务时占用的资源,进而训练设备可以减少与第二训练任务对应的模型的参数量,或降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模,前述三种方式的具体实现方式可以参阅上述步骤505中的描述,此处不做赘述。Step 506 may include: when the processing method of the second training task is to adjust the occupied resources, regardless of whether the second device or the training device determines the method to adjust the occupied resources when executing the second training task, the training device can know the method to adjust the occupied resources when executing the second training task. In this way, the training device can reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task. The specific implementation methods of the above three methods can be referred to the description in step 505 above, which will not be repeated here.
本申请实施例中,提供了对于第二设备反馈的第二训练任务的不同的处理方式,训练设备具体的处理方案,从而无论第二设备反馈什么样的处理方式,训练设备都存在对应的处理方案,有利于提高本申请在执行过程中的顺畅性和稳定性;进一步地,当第二训练任务的处理方式为终止时,则删除第二请求信息,从而及时释放训练设备中与第二请求信息相关的所有资源,有利于避免训练设备中的资源被浪费。In this embodiment, different processing methods for the second training task fed back by the second device are provided, and specific processing schemes for the training device are provided. Thus, no matter what processing method the second device feeds back, the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the execution process of this application. Furthermore, when the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
可选地,训练设备在调整了第二训练任务的占用资源之后,还可以调整第二训练任务的优先级,例如,提高第二训练任务的优先级。Optionally, after adjusting the resources occupied by the second training task, the training device can also adjust the priority of the second training task, for example, by increasing the priority of the second training task.
可选地,在上述图4和图5对应的实施例的基础上,请参阅图6,图6为本申请实施例提供的训练任务的处理方法的另一种示意图,如图6所示,本申请提供的训练任务的处理方法可以包括:Optionally, based on the embodiments corresponding to Figures 4 and 5 above, please refer to Figure 6. Figure 6 is another schematic diagram of the training task processing method provided by the embodiments of this application. As shown in Figure 6, the training task processing method provided by this application may include:
601、第二设备向训练设备发送第二请求信息。601. The second device sends a second request message to the training device.
本申请实施例中,步骤601中名词的含义以及步骤的具体实现方式可以参阅上述图5对应实施例中对步骤501的描述,此处不做赘述。In this embodiment, the meaning of the terms in step 601 and the specific implementation of the steps can be found in the description of step 501 in the embodiment corresponding to Figure 5 above, and will not be repeated here.
602、训练设备接收来自第一设备的第一请求信息,第一请求信息用于请求执行第一模型的训练任务。602. The training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
603、若第一模型的训练任务的优先级低于或等于第二训练任务的优先级,则训练设备向第一设备发送响应信息。603. If the priority of the training task of the first model is lower than or equal to the priority of the second training task, the training device sends a response message to the first device.
其中,第一模型的训练任务可以替换为第一训练任务。其中,响应信息可以用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息可以包括第二训练任务的信息。The training task of the first model can be replaced by a first training task. The response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information may include information about the second training task.
本申请实施例中,步骤602和603中名词的含义以及步骤的具体实现方式可以参阅上述图4对应实施例中对步骤401和403的描述,此处不做赘述。In this embodiment, the meanings of the terms in steps 602 and 603 and the specific implementation methods of the steps can be found in the description of steps 401 and 403 in the embodiment corresponding to Figure 4 above, and will not be repeated here.
604、第一设备向训练设备发送第一模型的训练任务的处理方式。604. The method for the first device to send the training task of the first model to the training device.
其中,第一模型的训练任务的处理方式可以为:终止,等待,或者调整占用资源。The training task of the first model can be handled by terminating, waiting, or adjusting the resources used.
605、训练设备基于第一训练任务的处理方式进行处理。605. The training equipment processes data based on the processing method of the first training task.
本申请实施例中,步骤604和605中名词的含义以及步骤的具体实现方式可以参阅上述图5对应实施例中对步骤505和506的描述,区别在于,将“第二设备”替换为“第一设备”,将“第二训练任务”替换为“第一训练任务”,将“第二模型”替换为“第一模型”,将“第二请求信息”替换为“第一请求信息”,此处不再一一进行赘述。In this embodiment, the meanings of the terms in steps 604 and 605 and the specific implementation of the steps can be found in the descriptions of steps 505 and 506 in the embodiment corresponding to Figure 5 above. The difference is that "second device" is replaced with "first device", "second training task" is replaced with "first training task", "second model" is replaced with "first model", and "second request information" is replaced with "first request information". These will not be elaborated on here.
在图4至图6所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图7,图7为本申请实施例提供的训练任务的处理装置的一种结构示意图,训练任务的处理装置700可以应用于训练设备中,训练任务的处理装置700包括:接收模块701,用于接收来自第一设备的请求信息,请求信息用于请求执行第一模型的训练任务;执行模块702,用于若第一模型的训练任务的优先级高于第二训练任务的优先级,则执行第一模型的训练任务,暂停执行第二训练任务;和/或,发送模块703,用于若第一模型的训练任务的优先级低于或等于第二训练任务的优先级,则向第一设备发送响应信息,响应信息用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息包括第二训练任务的信息;其中,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Based on the embodiments corresponding to Figures 4 to 6, in order to better implement the above-described solutions of the embodiments of this application, related devices for implementing the above-described solutions are also provided below. Specifically, referring to Figure 7, Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application. The training task processing device 700 can be applied in a training device. The training task processing device 700 includes: a receiving module 701, used to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module 702, used to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module 703, used to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
可选地,请求信息包括第一模型的训练任务的优先级。Optionally, the requested information may include the priority of the training task for the first model.
可选地,第二训练任务的信息包括第二训练任务的优先级。Optionally, the information for the second training task includes the priority of the second training task.
可选地,训练任务的处理装置700还包括:通知模块704,用于通知第二训练任务对应的第二设备暂停执行第二训练任务;或者,通知模块704,用于通知第二训练任务对应的第二设备训练设备的空闲资源。Optionally, the training task processing device 700 further includes: a notification module 704, used to notify the second device corresponding to the second training task to suspend the execution of the second training task; or, the notification module 704, used to notify the second device corresponding to the second training task of the idle resources of the training device.
可选地,通知模块704,具体用于向第二设备发送原因值,原因值用于指示暂停执行第二训练任务的原因。Optionally, the notification module 704 is specifically used to send a reason value to the second device, the reason value being used to indicate the reason for suspending the execution of the second training task.
可选地,原因包括第二训练任务为训练设备上优先级最低的训练任务。Optionally, the reason may include the second training task being the lowest priority training task on the training device.
可选地,原因值包括训练设备当前执行的至少一个第三训练任务中每个第三训练任务的优先级,每个第三训练任务的优先级均高于第二训练任务的优先级。Optionally, the reason value includes the priority of each of the at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
可选地,接收模块701,还用于接收来自第二设备的第二训练任务的处理方式,第二训练任务的处理方式为:终止,等待,或者调整占用资源。Optionally, the receiving module 701 is also used to receive the processing mode of the second training task from the second device, wherein the processing mode of the second training task is: termination, waiting, or adjusting the occupied resources.
可选地,训练任务的处理装置700还包括:删除模块705,用于当第二训练任务的处理方式为终止时,删除与第二训练任务对应的请求信息;或者,执行模块702,还用于当第二训练任务的处理方式为等待时,等待至空闲资源大于或等于第二训练任务所需资源后,继续执行第二训练任务;或者,调整模块706,用于当第二训练任务的处理方式为调整占用资源时,减少与第二训练任务对应的模型的参数量,或降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。Optionally, the training task processing device 700 further includes: a deletion module 705, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module 702, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module 706, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
可选地,方法应用于以下场景:第一训练任务的需求资源大于训练设备的空闲资源;或者,训练设备当前执行的训练任务的个数大于预设阈值。Optionally, the method can be applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
可选地,调整占用资源包括:减少与第二训练任务对应的模型的参数量,降低执行第二训练任务时的精度,或降低执行第二训练任务时的批训练规模。Optionally, adjusting the resources used may include: reducing the number of parameters of the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
可选地,训练设备当前执行的每个训练任务均为对通信领域的模型进行训练的任务。Optionally, each training task currently being performed by the training device is a task of training a model in the field of communications.
需要说明的是,训练任务的处理装置700中各模块/单元之间的信息交互、执行过程等内容,与本申请中图4至图6对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction and execution process between the modules/units in the training task processing device 700 are based on the same concept as the various method embodiments corresponding to Figures 4 to 6 in this application. For details, please refer to the description in the method embodiments shown above in this application, which will not be repeated here.
请继续参阅图8,图8为本申请实施例提供的训练任务的处理装置的另一种结构示意图,训练任务的处理装置800可以应用于第一设备中,训练任务的处理装置800包括:发送模块801,用于向训练设备发送请求信息,请求信息用于请求执行第一模型的训练任务;接收模块802,用于接收来自训练设备的响应信息,响应信息用于通知第一设备延迟或拒绝执行第一模型的训练任务,或,响应信息包括第二训练任务的信息;其中,第一模型的训练任务的优先级低于或等于第二训练任务的优先级,第二训练任务为训练设备当前执行的一个或多个训练任务,或,第二训练任务为训练设备当前执行的训练任务中优先级最低的训练任务。Please refer to Figure 8. Figure 8 is a schematic diagram of another structure of the training task processing device provided in this application embodiment. The training task processing device 800 can be applied in a first device. The training task processing device 800 includes: a sending module 801, used to send request information to the training device, the request information being used to request the execution of a training task of a first model; and a receiving module 802, used to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein, the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
可选地,发送模块801,还用于向训练设备发送第一模型的训练任务的处理方式,第一模型的训练任务的处理方式为:终止,等待,或者调整占用资源。Optionally, the sending module 801 is also used to send the processing method of the training task of the first model to the training device. The processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
可选地,调整占用资源包括:减少与第一模型的训练任务对应的模型的参数量,降低执行第一模型的训练任务时的精度,或降低执行第一模型的训练任务时的批训练规模。Optionally, adjusting the resources used may include: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the training task of the first model, or reducing the batch training scale when performing the training task of the first model.
可选地,第一模型的训练任务的处理方式的确定因素包括:第一模型的即时性要求,第一模型的精度要求,和/或,调整占用资源导致的第一模型的精度的降低程度。Optionally, the determining factors for how the training task of the first model is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting the resources used.
需要说明的是,训练任务的处理装置800中各模块/单元之间的信息交互、执行过程等内容,与本申请中图4至图6对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction and execution process between the modules/units in the training task processing device 800 are based on the same concept as the various method embodiments corresponding to Figures 4 to 6 in this application. For details, please refer to the description in the method embodiments shown above in this application, which will not be repeated here.
在图4至图6所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图9,图9为本申请实施例提供的训练任务的处理装置的另一种结构示意图,训练任务的处理装置900可以应用于第二设备中,训练任务的处理装置900包括:发送模块901,用于向训练设备发送请求信息,请求信息用于请求训练设备执行第二模型的训练任务;接收模块902,用于接收来自于训练设备的通知信息,通知信息指示暂停执行第二模型的训练任务,或者,通知信息指示训练设备的空闲资源,第二训练任务包括第二模型的训练任务;其中,第二训练任务的优先级低于第一训练任务的优先级,第一训练任务为训练设备新增的训练任务。Based on the embodiments corresponding to Figures 4 to 6, in order to better implement the above-described solutions of this application, related devices for implementing the above-described solutions are also provided below. Specifically, referring to Figure 9, Figure 9 is another structural schematic diagram of the training task processing device provided in this application embodiment. The training task processing device 900 can be applied to a second device. The training task processing device 900 includes: a sending module 901, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; a receiving module 902, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating idle resources of the training device; the second training task includes the training task of the second model; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a newly added training task by the training device.
可选地,通知信息包括原因值,原因值用于指示暂停执行第二模型的训练任务的原因。Optionally, the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
可选地,发送模块901,还用于向训练设备发送第二模型的训练任务的处理方式,第二模型的训练任务的处理方式为:终止,等待,或者调整占用资源。Optionally, the sending module 901 is also used to send the processing method of the training task of the second model to the training device. The processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
需要说明的是,训练任务的处理装置900中各模块/单元之间的信息交互、执行过程等内容,与本申请中图4至图6对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction and execution process between the modules/units in the training task processing device 900 are based on the same concept as the various method embodiments corresponding to Figures 4 to 6 in this application. For details, please refer to the description in the method embodiments shown above in this application, which will not be repeated here.
请参阅图10,图10为本申请实施例提供的设备的一种结构示意图。该设备1000具体可以为上述实施例中的训练设备、第一设备或第二设备,参阅图10,设备1000包括至少一个处理器1001以及至少一个存储器1002,处理器1001和存储器1002相连,例如通过总线相连。Please refer to Figure 10, which is a schematic diagram of the structure of a device provided in an embodiment of this application. The device 1000 can specifically be the training device, the first device, or the second device in the above embodiments. Referring to Figure 10, the device 1000 includes at least one processor 1001 and at least one memory 1002. The processor 1001 and the memory 1002 are connected, for example, through a bus.
存储器1002主要用于存储软件程序。存储器1002可以是独立存在,且与处理器1001相连。可选地,存储器1002可以和处理器1001集成在一起,例如集成在一个芯片之内。其中,存储器1002能够存储执行本申请实施例的技术方案的程序代码,并由处理器1001来控制执行,被执行的各类计算机程序代码也可被视为是处理器1001的驱动程序。The memory 1002 is primarily used to store software programs. The memory 1002 can exist independently and be connected to the processor 1001. Optionally, the memory 1002 can be integrated with the processor 1001, for example, integrated within a single chip. The memory 1002 can store program code that executes the technical solutions of the embodiments of this application, and its execution is controlled by the processor 1001. The various types of computer program code being executed can also be considered as drivers for the processor 1001.
处理器1001主要用执行存储器1002中存储的软件程序,以实现如图4至6中任一实施例中训练设备或第一设备,或第二设备对应的功能。The processor 1001 mainly executes the software program stored in the memory 1002 to implement the functions corresponding to the training device, the first device, or the second device in any of the embodiments shown in Figures 4 to 6.
图10仅示出了一个存储器和一个处理器。在实际的设备中,可以存在多个处理器和多个存储器。存储器也可以称为存储介质或者存储设备等。存储器可以为与处理器处于同一芯片上的存储元件,即片内存储元件,或者为独立的存储元件,本申请实施例对此不做限定。Figure 10 shows only one memory and one processor. In actual devices, there can be multiple processors and multiple memories. Memory can also be called storage medium or storage device, etc. Memory can be a storage element on the same chip as the processor, i.e., an on-chip storage element, or it can be a separate storage element; this application does not limit this.
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述图4至图6所示实施例描述的方法中训练设备所执行的步骤,或者,使得计算机执行如前述图5至图6所示实施例描述的方法中第一设备所执行的步骤,或者,使得计算机执行如前述图5至图6所示实施例描述的方法中第二设备所执行的步骤。This application also provides a computer-readable storage medium storing a program that, when run on a computer, causes the computer to perform the steps executed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps executed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps executed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
本申请实施例中还提供一种包括计算机程序产品,该计算机程序产品包括程序,当其在计算机上运行时,使得计算机执行如前述图4至图6所示实施例描述的方法中训练设备所执行的步骤,或者,使得计算机执行如前述图5至图6所示实施例描述的方法中第一设备所执行的步骤,或者,使得计算机执行如前述图5至图6所示实施例描述的方法中第二设备所执行的步骤。This application also provides a computer program product, which includes a program that, when run on a computer, causes the computer to perform the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
本申请实施例中还提供一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行如前述图4至图6所示实施例描述的方法。This application also provides a circuit system including a processing circuit configured to perform the method described in the embodiments shown in Figures 4 to 6 above.
本申请实施例中还提供了一种训练处理系统,该训练处理系统包括训练设备、第一设备和第二设备,训练设备用于执行前述图4至图6所示实施例描述的方法中训练设备所执行的步骤,第一设备用于执行前述图5至图6所示实施例描述的方法中第一设备所执行的步骤,第二设备用于执行前述图5至图6所示实施例描述的方法中第二设备所执行的步骤。This application also provides a training processing system, which includes a training device, a first device, and a second device. The training device is used to execute the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6. The first device is used to execute the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6. The second device is used to execute the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
本申请实施例提供的训练设备、第一设备、第二设备或者训练任务的处理装置具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图1至图6所示实施例描述的方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The training device, first device, second device, or training task processing apparatus provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input/output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip to execute the methods described in the embodiments shown in Figures 1 to 6. Optionally, the storage unit is a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within the wireless access device, such as read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面方法的程序执行的集成电路。The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of a program in the first aspect of the method.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CLU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CLUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
Claims (25)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410547036.5 | 2024-04-30 | ||
| CN202410547036.5A CN120872424A (en) | 2024-04-30 | 2024-04-30 | A method for processing training tasks and related equipment |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025227845A1 true WO2025227845A1 (en) | 2025-11-06 |
Family
ID=97457801
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/072421 Pending WO2025227845A1 (en) | 2024-04-30 | 2025-01-15 | Training task processing method and related device |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120872424A (en) |
| WO (1) | WO2025227845A1 (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108027889A (en) * | 2016-01-25 | 2018-05-11 | 华为技术有限公司 | A training and scheduling method and related equipment for incremental learning cloud system |
| CN113051054A (en) * | 2021-03-24 | 2021-06-29 | 依瞳科技(深圳)有限公司 | Method, apparatus and computer readable storage medium for scheduling artificial intelligence platform resources |
| US20230168938A1 (en) * | 2021-11-29 | 2023-06-01 | International Business Machines Corporation | Performing batched training for machine-learning pipelines |
| CN117828341A (en) * | 2022-09-27 | 2024-04-05 | 华为技术有限公司 | A method, device and system for model training management |
-
2024
- 2024-04-30 CN CN202410547036.5A patent/CN120872424A/en active Pending
-
2025
- 2025-01-15 WO PCT/CN2025/072421 patent/WO2025227845A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108027889A (en) * | 2016-01-25 | 2018-05-11 | 华为技术有限公司 | A training and scheduling method and related equipment for incremental learning cloud system |
| CN113051054A (en) * | 2021-03-24 | 2021-06-29 | 依瞳科技(深圳)有限公司 | Method, apparatus and computer readable storage medium for scheduling artificial intelligence platform resources |
| US20230168938A1 (en) * | 2021-11-29 | 2023-06-01 | International Business Machines Corporation | Performing batched training for machine-learning pipelines |
| CN117828341A (en) * | 2022-09-27 | 2024-04-05 | 华为技术有限公司 | A method, device and system for model training management |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120872424A (en) | 2025-10-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230232213A1 (en) | Information transmission methods and apparatuses, and communication devices and storage medium | |
| US20240106764A1 (en) | Computing power resource scheduling method and related apparatus | |
| CN114303347A (en) | Method, apparatus and machine-readable medium related to machine learning in communication network | |
| EP4580230A1 (en) | Communication method and apparatus | |
| US20250086473A1 (en) | Model training method and apparatus | |
| WO2025007648A1 (en) | Computing task scheduling method and communication apparatus | |
| US20230112127A1 (en) | Electronic device for deploying application and operation method thereof | |
| US20250321788A1 (en) | Computing task processing method and related apparatus | |
| WO2025227845A1 (en) | Training task processing method and related device | |
| CN113692052A (en) | Network edge machine learning training method | |
| KR102382170B1 (en) | Method and apparatus for data processing | |
| WO2024036526A1 (en) | Model scheduling method and apparatus | |
| CN114301924A (en) | Application task scheduling method and node equipment for cloud edge collaborative environment | |
| EP4648375A1 (en) | Network quantization method and apparatus, and related device | |
| WO2025157098A1 (en) | Inference method and apparatus based on pre-stored context information in large model | |
| CN121052291A (en) | Communication method and device | |
| WO2025124135A1 (en) | Communication method, and apparatus | |
| CN119450415A (en) | Business processing method, device, communication equipment and readable storage medium | |
| WO2023125934A1 (en) | Ai network information transmission method and apparatus, and communication device | |
| CN120729694A (en) | A configuration method, management node and computing node for distributed AI tasks | |
| WO2025227698A1 (en) | Communication method and related apparatus | |
| WO2025098104A1 (en) | Communication method and apparatus, and readable storage medium | |
| WO2025189831A1 (en) | Communication method and related apparatus | |
| CN120676371A (en) | Task management method, device, terminal and network side equipment | |
| WO2025227699A1 (en) | Communication method and related apparatus |