[go: up one dir, main page]

WO2025227845A1 - Procédé de traitement de tâche d'entraînement et dispositif associé - Google Patents

Procédé de traitement de tâche d'entraînement et dispositif associé

Info

Publication number
WO2025227845A1
WO2025227845A1 PCT/CN2025/072421 CN2025072421W WO2025227845A1 WO 2025227845 A1 WO2025227845 A1 WO 2025227845A1 CN 2025072421 W CN2025072421 W CN 2025072421W WO 2025227845 A1 WO2025227845 A1 WO 2025227845A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
training task
task
model
priority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2025/072421
Other languages
English (en)
Chinese (zh)
Inventor
张浩男
马川
李贤明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of WO2025227845A1 publication Critical patent/WO2025227845A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method for processing training tasks and related equipment.
  • AI artificial intelligence
  • ML machine learning
  • the requesting device of the machine learning model can send a request message to the training device, which requests the training device to perform the task of training the machine learning model.
  • the training device can perform the training task according to the received request message.
  • a single training device can receive requests from multiple requesting devices.
  • the training task currently being executed by the same training device may include multiple training tasks.
  • the computer resources in the training device are limited, if the training device executes multiple training tasks simultaneously, it may overload the training device, resulting in the interruption of the process used to execute the training tasks or the loss of data.
  • This application provides a method for processing training tasks and related equipment.
  • the training equipment can use the priority of training tasks to decide to pause, delay or refuse to execute certain training tasks, which helps to avoid overloading of the training equipment, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training equipment in the process of executing training tasks.
  • embodiments of this application provide a method for processing training tasks.
  • This method can be applied to the training phase of a model.
  • the method includes: after a training device receives request information from a first device (hereinafter referred to as "first request information" for easy distinction), the first request information is used to request the training device to execute a training task of a first model (hereinafter referred to as the first training task, i.e., the training task of the first model and the first training task can be relatedly replaced); if the priority of the training task of the first model is higher than the priority of the second training task, the training device executes the training task of the first model and suspends the execution of the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the priority of the first training task is higher than the priority of the second training task
  • the priority of the first training task being higher than the priority of each training task included in the second training task.
  • the training device starting to execute the first training task means that the first training task has been added to the training tasks currently being executed by the training device; "the training device pausing to execute the second training task” can be referred to as the training device suspending the second training task, so that the second training task is no longer included in the training tasks currently being executed by the training device.
  • the training device After receiving the first request information from the first device, if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the training device sends a response information to the first device.
  • the response information is used to notify the first device to delay or refuse to execute the training task of the first model, or the response information includes information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the training device can also determine whether to execute the training task of the first model based on the priority of the training task of the first model. If the priority of the training task of the first model is higher than the priority of the second training task, the training task of the first model can be executed and the execution of the second training task can be suspended. If the priority of the training task of the first model is lower than or equal to the priority of the second training task, the execution of the training task of the first model can be delayed or rejected.
  • the second training task is the training task currently being executed by the training device.
  • the training device can use the priority of the training task to decide to pause, delay or reject the execution of certain training tasks, which helps to avoid the overload of the training device, thereby helping to avoid the interruption of the process used to execute the training task or the loss of data, and improving the stability of the training device in the process of executing the training task.
  • the second training task may include one or more training tasks with the lowest priority among all currently executed training tasks, and the sum of the idle resources of the training device and the occupied resources of the second training task is greater than or equal to the required resources of the training task of the first model.
  • the first request information includes the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the first device determines the priority of the first training task based on a first factor, which may include the inference type of the first model; optionally, the first factor may also include: the resource requirements of the first training task and the immediacy requirements of the first model.
  • the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed in the process of "the training device determining the priority of the first training task," allowing the training device to allocate more resources to executing the training task and obtain a model that has completed the training task as quickly as possible.
  • the information of the second training task includes the priority of each training task included in the second training task; optionally, the information of the second training task may also include identification information of each training task in the second training task; optionally, it may also include the inference type corresponding to each training task in the second training task.
  • the second request information corresponding to the second training task may include the priority of the second training task, and optionally, the training device may also adjust the priority of the second training task.
  • the training device will delay or refuse to execute the first training task.
  • the response information sent by the training device to the first device includes the priority of the second training task.
  • the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority, so it decides to delay or refuse to execute the first training task.
  • the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. It also helps to make the use of resources in the training device better meet the needs of the current scenario.
  • the method further includes: the training device notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, the training device notifying the second device corresponding to the second training task of the training device's idle resources.
  • the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device.
  • the training device after suspending the execution of the second training task, the training device will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task, thereby enabling the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
  • the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task, including: the training device sending a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task.
  • the training device sends a reason value to the second device, which indicates the reason for suspending the execution of the second training task.
  • the second device can not only promptly know that the second training task has been suspended, but also know the reason for the suspension. This facilitates the second device in promptly determining the handling method for the second training task after confirming that it has been suspended, and also allows the second device to determine a more suitable handling method based on the reason for suspending the second training task.
  • the reason is that the second training task is the lowest priority training task on the training device.
  • the second training task that the training device pauses execution is the lowest priority training task on the training device. That is, the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
  • the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
  • "at least one third training task” can include all training tasks currently being executed by the training device. It should be noted that since the training device may pause executing old training tasks or begin executing new training tasks, the training tasks included in "the training tasks currently being executed by the training device" in this application can vary. If, when the training device sends a cause value to the second device, it has paused executing the second training task and begun executing the first training task, then the training tasks currently being executed by the training device can include the first training task; that is, at least one third training task can include the first training task. Alternatively, “at least one third training task” can include only the first training task. Or, "at least one third training task” can include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc.
  • the above cause value may also include identification information for each third training task; alternatively, the above cause value may also include the inference type for each third training task.
  • the reason value sent by the training device to the second device includes the priority of each of the at least one third training task currently being executed by the training device.
  • the priority of each training task is higher than that of the second training task.
  • the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is currently occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device. This makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device with the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
  • the method further includes: the training device receiving a processing method for the second training task from the second device, wherein the processing method for the second training task is: termination, waiting, or adjustment of resource usage.
  • the second device can send a processing method for the second training task to the training device. Since the second device has a clearer understanding of its needs for the second model, it can determine whether to terminate, wait, or adjust resource usage based on its needs for the second model. This helps improve the adaptability of the processing method for the second training task to specific application scenarios.
  • the method further includes: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task.
  • the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task.
  • the training device reduces the number of parameters of the model corresponding to the second training task, or reduces the accuracy when executing the second training task, or reduces the batch training scale when executing the second training task.
  • a training task involves multiple training processes on a model.
  • Batch training refers to dividing the multiple training processes of a training task into multiple batches.
  • n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1.
  • the batch size represents the number of training samples used in the training process of a single batch, which is also the number of n.
  • the "batch size" affects the memory resources required by the model in the training process of a single batch.
  • "reducing the number of parameters in the model corresponding to the second training task” can be achieved by pruning the second model.
  • the specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only the storage resources used but also the processor resources used.
  • "Reducing the precision when performing the second training task” can be achieved by replacing the high-precision data format with a low-precision data format when performing the second training task, such as reducing from FP32 to FP16, or from FP64 to mixed precision, etc. This example is only for the purpose of understanding the solution, thereby reducing not only the storage resources used but also the processor resources used.
  • "Reducing the batch training size when performing the second training task” can be achieved by reducing the number of training samples used in a single batch of training, thereby reducing the storage resources required for a single batch of training.
  • different processing methods are provided for the second training task fed back by the second device, and specific processing schemes of the training device are provided.
  • the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the application during execution.
  • the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
  • the method is applied to the following scenarios: the resource requirements of the first training task exceed the idle resources of the training device; or, the number of training tasks currently being executed by the training device exceeds a preset threshold.
  • the resource requirements of the first training task may include the storage resources required by the first training task, and optionally, may also include the processor resources required by the first training task.
  • the storage resources include video memory resources and/or system memory resources.
  • the training task processing method provided in this application is started when the required resources of the first training task exceed the idle resources of the training device, or when the number of training tasks currently being executed by the training device exceeds a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
  • adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training scale when performing the second training task.
  • the second device can further indicate how to adjust the resources used when performing the second training task. This helps to make the final second model more compatible with the application scenario of the second model, that is, it helps to obtain a more satisfactory second model under the premise of limited resources.
  • each training task currently performed by the training device is a task of training a model in the communication domain.
  • This implementation provides a specific application domain for the method of this application, increasing the degree of integration between this application and a specific application domain.
  • this application provides a method for processing training tasks, which can be applied to the training phase of a model.
  • the method includes: a first device sending request information to a training device, the request information being used to request the execution of a training task of a first model; the first device receiving response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task being one or more training tasks currently being executed by the training device, or the second training task being the lowest priority training task among the training tasks currently being executed by the training device.
  • the first device sends the processing method of the training task of the first model to the training device.
  • the processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
  • adjusting resource usage includes: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
  • the determining factors for the processing method of the first model's training task include: the immediacy requirement of the first model, the accuracy requirement of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage.
  • the "immediacy requirement of the first model” can be understood as the time requirement for the first model after completing the first training task, or as how quickly the first model needs to be deployed. The shorter the deployment time, the higher the immediacy requirement of the first model.
  • the "degree of reduction in the accuracy of the first model caused by adjusting resource usage” can be determined by at least one of the following parameters: the first accuracy range of the first model obtained after adjusting the resources used during the execution of the first training task, and/or, the second accuracy range, where the second accuracy range represents the decrease in accuracy of the final obtained first model due to the adjustment of resource usage by a second accuracy range.
  • the processing method for the second training task is determined from two dimensions: time requirements and accuracy requirements. This approach is beneficial for obtaining a better processing method under the premise of limited resources.
  • the first device may also perform the steps performed by the first device in the first aspect and various implementations of the first aspect.
  • the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the second aspect can all be found in the first aspect, and will not be repeated here.
  • this application provides a method for processing training tasks, which can be applied to the training phase of a model.
  • the method includes: a second device sending a request message to a training device, the request message being used to request the training device to execute a training task of a second model (hereinafter referred to as the second training task); the second device receiving a notification message from the training device, the notification message indicating to suspend the execution of the training task of the second model, or the notification message indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a training task added by the training device.
  • the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
  • the method further includes: the first device sending the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
  • the determining factors for how the training task of the second model is handled include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting resource usage.
  • the second device may also perform the steps performed by the second device in the first aspect and various implementations of the first aspect.
  • the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the third aspect can all be found in the first aspect, and will not be repeated here.
  • the training task processing apparatus includes: a receiving module, configured to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module, configured to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module, configured to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information of the second training task; wherein the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the requested information includes the priority of the training task for the first model.
  • the information for the second training task includes the priority of the second training task.
  • the training task processing device further includes: a notification module for notifying the second device corresponding to the second training task to suspend the execution of the second training task; or, a notification module for notifying the second device corresponding to the second training task of the idle resources of the training device.
  • the notification module is specifically used to send a reason value to the second device, the reason value indicating the reason for suspending the execution of the second training task.
  • the reason is that the second training task is the lowest priority training task on the training device.
  • the cause value includes the priority of each of at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
  • the receiving module is further configured to receive the processing method of the second training task from the second device, wherein the processing method of the second training task is: termination, waiting, or adjustment of occupied resources.
  • the training task processing apparatus further includes: a deletion module, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
  • a deletion module used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination
  • an execution module used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task
  • an adjustment module used to reduce the number of parameters of the model corresponding to the second training task, or
  • the method is applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
  • adjusting resource usage includes: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
  • each training task currently performed by the training device is a task of training a model in the field of communications.
  • the training task processing device can also execute the steps performed by the training device in the first aspect and various implementations of the first aspect.
  • the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fourth aspect can all be found in the first aspect, and will not be repeated here.
  • the training task processing apparatus includes: a sending module, configured to send request information to the training device, the request information being used to request the execution of a training task for a first model; and a receiving module, configured to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task for the first model, or the response information including information about a second training task; wherein the priority of the first training task is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the lowest priority training task among the training tasks currently being executed by the training device.
  • the sending module is further configured to send the processing method of the first training task to the training device, wherein the processing method of the first training task is: termination, waiting, or adjustment of occupied resources.
  • adjusting resource usage includes: reducing the number of parameters in the model corresponding to the first training task, reducing the accuracy when performing the first training task, or reducing the batch training size when performing the first training task.
  • the determining factors for how the first training task is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting resource usage.
  • the training task processing device can also execute the steps executed by the first device in the second aspect and various implementations of the second aspect.
  • the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the fifth aspect can all be found in the second aspect, and will not be repeated here.
  • inventions of this application provide a training task processing apparatus, which can be applied to the training phase of a model and can be used in a second device.
  • the training task processing apparatus includes: a sending module, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; and a receiving module, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating the idle resources of the training device; wherein, the priority of the second training task is lower than the priority of the first training task, and the first training task is a training task added by the training device.
  • the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
  • the sending module is further configured to send the processing method of the training task of the second model to the training device, wherein the processing method of the training task of the second model is: termination, waiting, or adjustment of resource usage.
  • the training task processing device can also execute the steps performed by the second device in the third aspect and various implementations of the third aspect.
  • the specific implementation methods, the meanings of the terms, and the beneficial effects of the steps in the sixth aspect can all be found in the second aspect, and will not be repeated here.
  • embodiments of this application provide an apparatus including a processor and a memory, the processor being coupled to the memory, the memory being used to store a program; and the processor being used to execute the program in the memory, causing the apparatus to perform the methods described in the first, second, or third aspects above.
  • embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
  • embodiments of this application provide a computer program product, which includes a program that, when run on a computer, causes the computer to perform the methods described in the first, second, or third aspects above.
  • this application provides a chip system including a processor for supporting the implementation of the functions involved in the foregoing aspects, such as transmitting or processing data and/or information involved in the foregoing methods.
  • the chip system further includes a memory for storing program instructions and data necessary for a terminal device or communication device.
  • This chip system may be composed of chips or may include chips and other discrete devices.
  • Figure 1 is a schematic diagram of a structure
  • Figure 2 is a schematic diagram of the training and application phases of the model
  • Figure 3a is a schematic diagram of an architecture for a training task processing system
  • Figure 3b is a schematic diagram of another architecture for the training task processing system
  • Figure 3c is a schematic diagram of another architecture for the training task processing system
  • Figure 4 is a schematic diagram of a training task processing method provided in an embodiment of this application.
  • FIG. 5 is another schematic diagram of the training task processing method provided in the embodiments of this application.
  • Figure 6 is another schematic diagram of the training task processing method provided in the embodiments of this application.
  • Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application.
  • Figure 8 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application.
  • Figure 9 is a schematic diagram of another structure of the training task processing device provided in the embodiment of this application.
  • Figure 10 is a schematic diagram of the structure of a device provided in an embodiment of this application.
  • “send” and “receive” refer to the direction of signal transmission.
  • “send information to device XX” can be understood as the destination of the information being device XX, which may include direct transmission via the air interface or indirect transmission by other units or modules via the air interface.
  • “Receive information from device YY” can be understood as the source of the information being device YY, which may include direct reception from device YY via the air interface or indirect reception from device YY via other units or modules via the air interface.
  • “Send” can also be understood as the "output” of the chip interface
  • “receive” can also be understood as the "input” of the chip interface.
  • sending and receiving can occur between devices or within devices, for example, through buses, traces, or interfaces between components, modules, chips, software modules, or hardware modules within a device. It is understood that information may undergo necessary processing, such as encoding and modulation, between the source and destination of information transmission, but the destination can understand the valid information from the source. Similar expressions in this application can be understood in a similar way and will not be elaborated further.
  • instruction can include direct and indirect instructions, as well as explicit and implicit instructions.
  • the information indicated by a certain piece of information (hereinafter referred to as instruction information) is called the information to be instructed.
  • instruction information The information indicated by a certain piece of information
  • instruction information is called the information to be instructed.
  • there are many ways to indicate the information to be instructed such as, but not limited to, directly indicating the information to be instructed, such as the information to be instructed itself or its index. It can also indirectly indicate the information to be instructed by indicating other information, where there is an association between the other information and the information to be instructed; or it can indicate only a part of the information to be instructed, while the other parts are known or pre-agreed upon.
  • the instruction can be implemented by using a pre-agreed (e.g., protocol predefined) arrangement of various information, thereby reducing the instruction overhead to a certain extent.
  • a pre-agreed e.g., protocol predefined
  • This application does not limit the specific method of instruction. It is understood that for the sender of the instruction information, the instruction information can be used to indicate the information to be instructed; for the receiver of the instruction information, the instruction information can be used to determine the information to be instructed.
  • Figure 1 is a structural diagram.
  • the framework of the aforementioned artificial intelligence theme is then elaborated from two dimensions: the “Intelligent Information Chain” (horizontal axis) and the “IT Value Chain” (vertical axis).
  • the "Intelligent Information Chain” reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output.
  • the "IT Value Chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.
  • the infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. Communication with the outside world is achieved through sensors; computing power is provided by intelligent chips, which can specifically employ hardware acceleration chips such as central processing units (CPUs), neural network processing units (NPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs).
  • intelligent chips which can specifically employ hardware acceleration chips such as central processing units (CPUs), neural network processing units (NPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs).
  • the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.
  • the data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.
  • machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.
  • Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.
  • Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.
  • the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
  • Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They encapsulate overall artificial intelligence solutions, productize intelligent information decision-making, and realize practical applications. Their application areas mainly include: communications, intelligent driving, intelligent terminals, intelligent transportation, smart homes, intelligent healthcare, intelligent security, intelligent manufacturing, and smart cities.
  • the method provided in this application can be applied to various application fields of artificial intelligence technology, specifically to the training stage of models in various application fields.
  • the method provided in this application can be applied to application scenarios where multiple training tasks are performed by the same training device, with each training task being a task of training a model.
  • the "model” in this application can also be called a "machine learning model.”
  • the machine learning model in this application can specifically be represented as a neural network, or as a non-neural network model, etc., which can be determined based on the actual application scenario.
  • models can be deployed in core network equipment, network equipment, and/or terminal equipment.
  • the functions implemented by the models may include, but are not limited to: predicting the movement trajectory of terminal equipment, compressing the codebook in the Channel State Index Reference Signal (CSI-RS), decompressing the codebook in the CSI-RS, beamforming, load balancing of network equipment, or other functions.
  • CSI-RS Channel State Index Reference Signal
  • the specific devices in which models are deployed to implement which functions can be determined based on the actual application scenario.
  • core network equipment refers to cloud servers that carry various network functions.
  • the functions implemented through core network equipment include, but are not limited to: network data analytics function (NWDAF), access and mobility management function (AMF), session management function (SMF), authentication server function (AUSF), network exposure function (NEF), network repository function (NRF), network slice selection function (NSSF), unified data management (UDM), user plane function (UPF), etc., which will not be listed exhaustively here.
  • Network equipment can refer to devices that provide wireless access services in a wireless network.
  • a network device can be a device that connects terminal devices to a wireless network, and can also be called a base station; the aforementioned base station can be various forms of macro base stations, micro base stations, relay stations, or access points, etc.
  • the names of network devices with base station functions may differ.
  • a base station can be called an evolved Node B (eNB), a Node B (NB), the next-generation Node B (gNB) in a 5th generation (5G) communication system, a home base station (e.g., home evolved Node B, or home Node B, HNB), a base band unit (BBU), a wireless fidelity (Wi-Fi) access point (AP), a transmission reception point (TRP), or a radio network controller (RNC), etc.
  • the terminal device can achieve wireless access through the cooperation of multiple network nodes, with each network node performing a portion of the base station's functions.
  • network nodes can be central units (CU), distributed units (DU), CU-control plane (CP), CU-user plane (UP), or radio units (RU), etc.
  • CU and DU can be set up separately or included in the same network element, such as a baseband unit (BBU).
  • BBU baseband unit
  • RU can be included in radio frequency equipment or radio frequency units, such as remote radio units (RRU), active antenna units (AAU), or remote radio heads (RRH).
  • RRU remote radio units
  • AAU active antenna units
  • RRH remote radio heads
  • CU or CU-CP and CU-UP
  • DU or RU may have different names, but their meanings will be understood by those skilled in the art.
  • a CU can also be called an open CU (O-CU), a DU can also be called an open DU (O-DU), a CU-CP can also be called an open CU-CP (O-CU-CP), a CU-UP can also be called an open CU-UP (O-CU-UP), and an RU can also be called an open RU (O-RU).
  • O-CU open CU
  • CU-CP open CU-CP
  • CU-UP can also be called an open CU-UP
  • an RU can also be called an open RU (O-RU).
  • Any of the CU (or CU-CP, CU-UP), DU, and RU units can be implemented through software modules, hardware modules, or a combination of software and hardware modules. This application does not limit the specific device form of the network equipment.
  • a terminal device refers to a wireless terminal device capable of receiving scheduling and instruction information sent by network devices.
  • Terminal devices can be handheld devices, vehicle-mounted devices, wearable devices, computing devices, or other processing devices with wireless communication capabilities, etc., without exhaustive list.
  • Terminal devices can communicate with one or more core network devices or the Internet via a wireless access network (RAN).
  • RAN wireless access network
  • terminal devices can be portable, pocket-sized, handheld, computer-embedded, or vehicle-mounted mobile devices that exchange voice and/or data with the RAN.
  • a terminal device can be a user agent, a cellular phone, a smartphone, a personal digital assistant (PDA), a tablet PC, a modem, a handset, a laptop computer, a personal communication service (PCS) phone, a remote station, an access point (AP), a remote terminal, an access terminal, a customer premises equipment (CPE), a terminal, a user equipment (UE), or a mobile terminal (MT), etc.
  • PDA personal digital assistant
  • PCS personal communication service
  • AP access point
  • CPE customer premises equipment
  • UE user equipment
  • MT mobile terminal
  • terminal devices can also be wearable devices, such as glasses, gloves, watches, clothing, and shoes.
  • terminal devices can also be drones, robots, terminal devices in device-to-device (D2D) communication, vehicle-to-everything (V2X) communication, virtual reality (VR) devices, augmented reality (AR) devices, wireless terminals in industrial control, terminal devices in self-driving, remote medical care, smart grids, smart cities, and smart homes.
  • D2D device-to-device
  • V2X vehicle-to-everything
  • VR virtual reality
  • AR augmented reality
  • wireless terminals in industrial control terminal devices in self-driving, remote medical care, smart grids, smart cities, and smart homes.
  • the terminal device can also be a terminal device in a communication system after 5G (such as a sixth-generation (6G) communication system) or a terminal device in a future evolved public land mobile network (PLMN), etc.
  • 5G such as a sixth-generation (6G) communication system
  • PLMN public land mobile network
  • functions implemented using models include: predicting the trajectory of obstacles around the vehicle, determining the vehicle's lateral and longitudinal directions, predicting the location areas the vehicle may reach in the future, performing trajectory planning, and other functions.
  • functions implemented using models include: image style transfer, image inpainting, predicting the category of objects in an image, speech recognition, text translation, and other functions.
  • Figure 2 is a schematic diagram of the training and application phases of the model. As shown in Figure 2, the entire system may include a request device 200, a training device 210, a database 220, an execution device 230, and a data storage system 240.
  • the execution device 230 includes a computing module 231.
  • the requesting device 200 and the training device 210 can be communicatively connected.
  • the requesting device 200 is a device that sends request information to the training device 210, which is used to request the execution of a training task for the model.
  • the requesting device 200 can be a terminal device, a network device, or a core network device.
  • the requesting device 200 can be a terminal device responsible for the operation, management, and maintenance of the network. Technicians can directly interact with the aforementioned terminal device to achieve network management and maintenance.
  • Database 220 stores a training dataset.
  • training device 210 acquires model 201 before the training operation is performed, and uses the training dataset to iteratively train model 201 until the preset convergence condition is met, thus obtaining model 201 after the training operation is performed.
  • Model 201 after the training operation is performed can also be called model 201 after training.
  • iteratively training model 201 can be understood as updating the weight parameters in model 201 multiple times.
  • the trained model 201 can be deployed to the computing module 231 of the execution device 230.
  • the execution device 230 can access data, code, etc., from the data storage system 240, and can also store data, instructions, etc., in the data storage system 240.
  • the data storage system 240 can be located within the execution device 230, or it can be an external storage device relative to the execution device 230.
  • the execution device 230 can input the data to be processed into the model 201 to obtain the prediction information generated by the model 201, thereby realizing the function of the model 201.
  • Figure 3a is a schematic diagram of an architecture of a training task processing system.
  • the training device is taken as a core network device in the communication field
  • the execution device is taken as a base station in the communication field.
  • the requesting device can be a terminal device that communicates with the core network device.
  • the requesting device can send request information to the core network device. This request information is used to request the execution of the training task.
  • the core network device When the core network device completes the training task based on the request information, it can obtain a model that has undergone training operations.
  • the model that has undergone training operations is deployed to multiple base stations. It should be understood that the example in Figure 3a is only for the convenience of understanding this scheme and is not intended to limit this scheme.
  • Figure 3a is merely a schematic diagram of one architecture for the training task processing system, and the positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation.
  • the training device 210 and the execution device 230 can also be integrated into the same device.
  • Figure 3b is a schematic diagram of another architecture for the training task processing system.
  • the training device and the execution device are both the same core network device as an example.
  • the requesting device can be a terminal device that communicates with the core network device. As shown in Figure 3b, the requesting device can send request information to the core network device.
  • the core network device When the core network device completes the training task based on the request information, it can obtain the model that has undergone training operations.
  • the model that has undergone training operations is deployed to the core network device. It should be understood that the example in Figure 3b is only for the convenience of understanding this solution and is not intended to limit this solution.
  • Figure 3c is a schematic diagram of another architecture of the training task processing system.
  • the training device and the execution device are both the same base station as an example.
  • the requesting device can be a terminal device that communicates with the core network device.
  • the requesting device can send request information to the core network device, and the core network device will then forward the request information to each base station.
  • the base station After receiving the request information, if the base station completes the training task based on the request information, it can obtain the model that has been trained.
  • the model that has been trained is deployed to the base station.
  • the requesting device can also be integrated with the training device in the same device.
  • the core network device determines to perform a training operation on a certain model
  • it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same core network device.
  • the base station determines to perform a training operation on a certain model
  • it can also generate request information and then perform the training task based on the request information. That is, the requesting device and the training device can be integrated in the same base station, etc.
  • the specific product forms of the "requesting device", "training device” and “execution device” can be determined in combination with the actual application scenario.
  • Figures 3a, 3b, and 3c above are only examples of applications in the field of communication.
  • the method provided by this application can also be applied to other fields.
  • the requesting device and the training device are integrated into the same device, both of which are cloud servers, and the executing device is a vehicle, etc.
  • the situations in each application field will not be listed here one by one.
  • this application discloses that after receiving a request information for executing a training task of a first model (hereinafter referred to as "first request information" for ease of distinction), the training device can determine whether to execute the first training task based on its priority.
  • the first training task includes the training task of the first model. If the priority of the first training task is higher than the priority of the second training task, the first training task can be executed, and the execution of the second training task can be suspended.
  • the execution of the first training task can be delayed or rejected.
  • the second training task is the training task currently being executed by the training device.
  • the training device can use the priority of training tasks to decide whether to pause, delay, or reject the execution of certain training tasks, which helps to avoid overloading the training device, thereby preventing process interruption or data loss and improving the stability of the training device during the execution of training tasks.
  • Figure 4 is a schematic diagram of a training task processing method provided in an embodiment of this application.
  • the training task processing method provided in this application may include:
  • the training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
  • the first request information can be a single message or information within a message, without restriction.
  • training task of the first model and the first training model in this application can be interchanged without restriction.
  • the first device when it determines to train the first model, it can send the first request information to the training device.
  • the training device receives the first request information and, based on it, determines that it requests to execute the training task of the first model. Therefore, it can create a first training task, which includes the training task of the first model. It should be noted that the creation of the first training task by the training device does not mean that the training device immediately begins executing the first training task.
  • the training device can determine whether to begin executing the first training task in subsequent steps; the specific determination process will be described in subsequent steps.
  • the training device creating a first training task may include: the training device generating identification information for the first training task.
  • the training device can then place the first training task into a waiting queue; for example, the training device can place the identification information of the first training task into the waiting queue, and optionally, the training device can place the identification of the first training task and first request information into the waiting queue.
  • the first device can be a requesting device.
  • the first device i.e., the requesting device
  • the first device can specifically be a terminal device, a network device, a core network device, or other types of devices, depending on the actual application scenario.
  • the training device can be a network device, a core network device, a cloud server, or other types of devices.
  • the first training task is the training task of the first model, which can also be called the first machine learning model.
  • the first model can be a neural network or a non-neural network model.
  • the first request information may include information indicating the inference type of the first model; the training device can determine what kind of model the first training task is based on the inference type of the first model.
  • the inference type of the first model can also be referred to as the type of task performed by the first model, or other names, etc., which will not be exhaustively listed here.
  • the inference type of the first model can indicate the function implemented by the first model.
  • the inference type of the first model is to predict the movement trajectory of the terminal device
  • the function implemented by the first model is to predict the movement trajectory of the terminal device
  • the inference type of the first model is to compress the codebook in CSI-RS
  • the function implemented by the first model is to compress the codebook in CSI-RS
  • the inference type of the first model is image classification
  • the function implemented by the first model is to predict the category of objects in the image, and so on.
  • the first request information indicates the inference type of the first model by including at least one of the following: a description of the inference type of the first model, an identifier of the inference type of the first model, a first model that has not yet been trained, an identifier of the first model, or other information.
  • the training device upon acquiring the untrained first model, can determine what kind of information the first model outputs, that is, determine the inference type of the first model. For example, if the first model outputs the movement trajectory of the terminal device over a future period, the inference type of the first model is predicting the movement trajectory of the terminal device; as another example, if the first model outputs compressed information of the codebook in CSI-RS, the inference type of the first model is compressing the codebook in CSI-RS; as yet another example, if the first model outputs the category of an object in an image, the inference type of the first model is image classification, etc. These examples are only provided to facilitate understanding of how to determine the inference type of the first model.
  • the training device can determine the first model corresponding to the identification information of the first model according to the correspondence. After obtaining the first model, it can determine what kind of information the first model outputs, that is, it can determine the inference type of the first model.
  • the first request information may further include: the accuracy requirement of the first training task, and/or, the batch training size requirement of the first training task.
  • the "accuracy requirement of the first training task" can be used to indicate the precision to be used when performing the first training task.
  • the precision used when performing a training task can be single-precision floating-point (FP32), half-precision floating-point (FP16), double-precision floating-point (FP64), mixed precision, or other precision, etc., which are not limited here.
  • a training task can include multiple training processes for a model.
  • Batch training refers to dividing the multiple training processes of a training task into multiple batches.
  • n training samples are obtained from the training dataset, where n is an integer greater than or equal to 1.
  • the batch size represents the number of training samples used in the training process of a single batch, which is also the number of n. For example, the batch size affects the GPU memory resources required by the model in the training process of a single batch.
  • the training device can determine the required precision when performing the first training task based on the precision requirements of the first training task, and/or, the training device can determine the number of training samples used in the training process of a single batch based on the batch training scale requirements of the first training task.
  • the first request information may also include the priority of the training task of the first model, which can also be understood as the first request information including the priority of the first training task; for example, the priority of the first training task may be expressed as a number, such as 1, 2, 3, 4, 5 or other numbers, with the larger the number, the higher the priority; or, the priority of the first training task may be expressed as text, such as first level, second level, third level, etc.
  • the first device determines the priority of the first training task before sending the first request information to the training device.
  • the determination of this priority can refer to the following two cases.
  • Scenario 1 The first device determines the priority of the first training task based on the first factor.
  • the first factor may include at least one of the following: the inference type of the first model, the resource requirements of the first training task, or the immediacy requirements of the first model.
  • the impact of the inference type of the first model on the priority of the first training task can be described in the following description. For example, the less resources the first training task requires, the higher its priority can be; the more resources the first training task requires, the lower its priority can be. The higher the immediacy requirement of the first model, the higher its priority can be; the lower the immediacy requirement of the first model, the lower its priority can be.
  • the resource requirements for the first training task may include storage resources needed to execute the first training task, which may include video memory and/or system memory resources, without limitation.
  • the resource requirements for the first training task may also include processor resources or other resources needed to execute the first training task.
  • the immediacy requirement of the first model can be understood as the time requirement for the first model after completing the first training task, or as the time within which the first model needs to be deployed.
  • the shorter the deployment time the higher the immediacy requirement of the first model.
  • Model 1 is used to compress or decompress the codebook in CSI-RS, then Model 1 has a high immediacy requirement; as another example, if Model 2 is used to implement load balancing of base stations, since Model 2 is updated periodically, then Model 2 has a lower immediacy requirement, etc.
  • the examples here are only for the convenience of understanding this scheme.
  • the first factor includes the inference type of the first model; assuming the first device stores a correspondence between inference types and priorities, the first device can determine the priority corresponding to the inference type of the first model, i.e., the priority of the first training task, based on the correspondence.
  • the first factor includes not only the inference type of the first model, but also: the resource requirements of the first training task, and/or the immediacy requirements of the first model.
  • the first device acquires a first score corresponding to the inference type of the first model, and acquires a second score corresponding to the resource requirements of the first training task, and/or acquires a third score corresponding to the immediacy requirements of the first model.
  • the first device can perform a weighted summation of all the acquired scores to obtain the total score of the first training task, thereby determining the priority of the first training task. All acquired scores may include the first score, as well as the second and/or third scores. The higher the total score of the first training task, the higher its priority; the lower the total score of the first training task, the lower its priority.
  • the first device can determine the first score corresponding to the inference type of the first model based on the correspondence.
  • the first device can store a correspondence between required resources and scores, and then the first device can determine the second score corresponding to the required resources of the first training task based on this correspondence.
  • the first device inputs the required resources of the first training task into a first preset algorithm to obtain the second score corresponding to the required resources of the first training task, whereby the first preset algorithm indicates the mapping relationship between required resources and scores.
  • the first device can store the correspondence between immediacy requirements and scores, and then the first device can determine the third score corresponding to the immediacy requirement of the first model based on the correspondence.
  • the first device inputs the immediacy requirement of the first model into a second preset algorithm to obtain the third score corresponding to the immediacy requirement of the first model.
  • the second preset algorithm indicates the mapping relationship between immediacy requirements and scores.
  • the first factor may include more or fewer elements.
  • the examples above are only for the convenience of understanding this solution and are not intended to limit this solution.
  • Scenario 2 The first device determines the priority of the first training task based on the first operation.
  • the first device can receive a first operation input by the user, and then determine the priority of the first training task based on the first operation.
  • the first operation can be a selection operation for the priority of the first training task, or it can be the priority of the first training task input through a text box, or it can be the priority of the first training task input by voice, etc.
  • the specific operation can be determined according to the actual product form.
  • the first request information may also include other information, such as the training data set of the first model, or the performance requirements of the first model, etc., which will not be listed here.
  • the training device can create a first training task based on the first request information, and then determine whether to execute the first training task.
  • the training device can determine whether to execute the first training task based on the priority of the first training task; when the training device is not in a first scenario, the training device can start executing the first training task; for example, the first scenario includes: the resource requirement of the first training task is greater than the idle resources of the training device, and/or, the number of training tasks currently being executed by the training device is greater than a preset threshold, which will be described below.
  • Scenario 1 The first scenario involves the resource requirements of the first training task exceeding the available resources of the training equipment.
  • the training device can determine whether the resource requirement of the first training task is greater than the available resources of the training device. If the resource requirement of the first training task is greater than the available resources of the training device, the training device can determine the second training task from all currently executing training tasks, and then determine whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the resource requirement of the first training task is less than or equal to the available resources of the training device, then the first training task can be started.
  • the training device can determine the resource requirements of the first training task in the following three ways.
  • the first request information may further include information indicating the required resources for the first training task, so that the training device can determine the required resources for the first training task based on the first request information.
  • the training device can determine the resource requirements of the first training task based on the inference type of the first model, the accuracy requirements of the first training task, and the batch training scale requirements of the first training task.
  • the inference type of the first model indicates which type of first model to train; the more parameters the first model has, the more resources the first training task requires; the higher the accuracy requirements of the first training task, the more resources it requires; and the larger the batch training scale requirements of the first training task, the more GPU memory resources are required during the training of a single batch.
  • the training device can store a correspondence between inference types and required resources, and then the training device can determine the required resources of the first training task corresponding to the inference type of the first model based on the correspondence.
  • the idle resources of the training device may include idle storage resources in the training device; optionally, it may also include idle processor resources in the training device, or it may include other types of idle resources, etc., which can be determined according to the actual application scenario.
  • the training device's determination of whether the resources required by the first training task are greater than the idle resources of the training device may include: the training device determining whether the storage resources required by the first training task are greater than the idle storage resources in the training device, for example, the aforementioned storage resources may include video memory resources and/or system memory resources.
  • it may also include: the training device determining whether the processor resources required by the first training task are greater than the idle processor resources in the training device, etc., which can be set according to the actual application scenario.
  • the second training task is the lowest priority training task among all training tasks currently being executed by the training device. Therefore, the higher priority of the first training task than the second training task can also be understood as the higher priority of the first training task than the lowest priority training task among all currently executed training tasks.
  • the second training task comprises S training tasks currently being executed by the training device, where S is an integer greater than or equal to 1. That is, the second training task includes one or more training tasks currently being executed by the training device.
  • the higher priority of the first training task than the second training task indicates that the first training task has a higher priority than any of the S training tasks.
  • the lower or equal priority of the first training task than the second training task indicates that the first training task has a lower or equal priority than any of the S training tasks.
  • the training device can determine a first difference between the resource requirement of the first training task and the idle resources of the training device. Based on the first difference, the training device determines S training tasks from all currently executed training tasks, which are the second training tasks.
  • the S training tasks can be the S lowest priority training tasks among all currently executed training tasks, and the sum of the resources occupied by the S training tasks is greater than or equal to the first difference, that is, the sum of the idle resources of the training device and the resources occupied by all the training tasks among the S training tasks is greater than or equal to the resource requirement of the first training task.
  • the resources occupied by the training task may include the storage resources occupied by the training task, and optionally, the processor resources occupied by the training task, or other types of resources, which can be determined according to the actual application scenario.
  • S can be a preset value, and the training device can determine the S lowest priority training tasks from all currently executed training tasks.
  • S can be a preset value, and the training device can determine the S resource-intensive training tasks from all currently executed training tasks.
  • Another implementation can also involve the training device randomly determining S training tasks from all currently executed training tasks, etc. The specific implementation method for the training device to determine the second training task can be determined based on the actual application scenario.
  • the first scenario includes the number of training tasks currently being executed by the training device being greater than a preset threshold.
  • the training device can determine whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device determines whether the priority of the first training task is higher than the priority of the second training task. If the priority of the first training task is higher than the priority of the second training task, then proceed to step 402; if the priority of the first training task is lower than or equal to the priority of the second training task, then proceed to step 403. If the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started.
  • the preset threshold can be an integer greater than or equal to 1, for example, the preset threshold can be 4, 5, 6 or other values, which can be determined according to the actual application scenario.
  • the first scenario includes situations where the resource requirements of the first training task exceed the available resources of the training device, and the number of training tasks currently being executed by the training device exceeds a preset threshold.
  • the training device can determine whether the resource requirement of the first training task is greater than the idle resources of the training device, and whether the number of training tasks currently being executed by the training device is greater than a preset threshold. If the resource requirement of the first training task is greater than the idle resources of the training device, or the number of training tasks currently being executed by the training device is greater than the preset threshold, the training device will determine whether the priority of the first training task is higher than the priority of the second training task. If the resource requirement of the first training task is less than or equal to the idle resources of the training device, and the number of training tasks currently being executed by the training device is less than or equal to the preset threshold, then the first training task can be started.
  • the training task processing method provided in this application is started when the required resources of the first training task are greater than the idle resources of the training device, or when the number of training tasks currently being executed by the training device is greater than a preset threshold. That is, in the aforementioned scenarios, the training device will pause, delay, or refuse to execute certain training tasks, providing two application scenarios for this application and improving the implementation flexibility of this solution. In addition, in the aforementioned two application scenarios, the load limit of the training device is almost reached, which is conducive to avoiding overload of the training device while making the maximum use of the resources in the training device.
  • the training device can obtain the priority of the first training task from the first request information. In another implementation, the training device can determine the priority of the first training task based on a first factor. The specific implementation methods of the aforementioned steps can be found in the above description of the first device determining the priority of the first training task based on the first factor, and will not be repeated here.
  • the training device can also adjust the priority of the first training task.
  • the priority in the first request information and the priority of the first training task determined by the training device based on the first factor can both be understood as the initial priority of the first training task.
  • the training device can increase the initial priority of the first training task according to the first duration to obtain the current priority of the first training task.
  • the first duration is the duration between the time when the training device receives the first request information and the current time. For example, the longer the first duration, the higher the current priority of the first training task, and the shorter the first duration, the lower the current priority of the first training task.
  • the specific implementation method for determining the priority of each training task included in the second training task by the training device is similar to the specific implementation method for determining the priority of the first training task by the training device, and will not be repeated here.
  • the training device can directly obtain the priority of the first training task from the first request information. This allows for a faster comparison between the priority of the first training task and the priorities of other training tasks. Since the decision on whether to execute the first training task can only be made after determining the comparison result between the priority of the first training task and the priority of the second training task, this also facilitates a faster determination of whether to execute the first training task. Furthermore, obtaining the priority of the first training task directly from the first request information reduces the resources consumed by the training device in determining the priority of the first training task. This allows the training device to allocate more resources to executing the training task, which is beneficial for obtaining a model that has completed the training task as quickly as possible.
  • each training task currently performed by the training device is a task of training a model in the communications domain.
  • models in the communications domain can be found in the above description and will not be repeated here. This provides a specific application domain for the method in this application, improving the degree of integration between this application and specific application domains.
  • the training device executes the training task of the first model and suspends the execution of the second training task.
  • the second training task can be one or more training tasks currently being executed by the training device, or the second training task can be the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the execution of the first training task by the training device means that the first training task has been added to the training tasks that the training device is currently executing;
  • the suspension of the execution of the second training task by the training device can be referred to as the suspension of the second training task by the training device, so that the second training task is no longer included in the training tasks currently being executed by the training device.
  • the training device may also maintain a waiting queue.
  • the training device pauses the execution of the second training task, the second request information corresponding to the second training task can be put into the waiting queue.
  • the training device can also store information about the first training task.
  • This information may include the identification information of the first training task, and optionally, it may also include the priority of the first training task. If the training device can adjust the priority of the first training task, then the information about the first training task includes its current priority.
  • the information about the first training task may also include other types of information, such as the inference type of the first model corresponding to the first training task.
  • the specific information that the first training task's information may include can be determined based on the actual application scenario.
  • the training device sends a response message to the first device.
  • the response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information can include information about the second training task.
  • the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the first device can receive response information from the training device, which indicates that the training device has not directly executed the first training task.
  • the response information may include an indication value corresponding to delay or rejection.
  • the indication value corresponding to delay or rejection could be 1111, representing delay or rejection of executing the first training task; or, the response information may include delay or rejection, thereby informing the first device to delay or reject the execution of the first training task, etc.
  • the information of the second training task may include the information of each of the S training tasks, and the information of each of the S training tasks may include the identification information of each of the S training tasks; optionally, it may also include the priority of each of the S training tasks, that is, the information of the second training task may include the priority of the second training task; optionally, it may also include the inference type corresponding to each of the S training tasks, etc., all of which can be determined in combination with the actual application scenario.
  • the response information may include information about all training tasks currently being executed by the training device that have a higher priority than the first training task, wherein all training tasks currently being executed by the training device that have a higher priority than the first training task include the second training task.
  • the training device can use the priority of training tasks to decide to pause, delay, or refuse to execute certain training tasks, which helps to avoid overloading of the training device, thereby helping to avoid interruption of the process used to execute training tasks or to avoid data loss, and improving the stability of the training device in the process of executing training tasks.
  • the training device will delay or refuse to execute the first training task.
  • the response information sent by the training device to the first device includes the priority of the second training task.
  • the first device can not only know that the training device has decided to delay or refuse to execute the first training task, but also know that the training device is executing the second training task with a higher priority.
  • the first device can know the current status of the training device, which makes it easier for the first device to determine a more suitable processing method by combining the current status of the training device and the needs of the first model. This also helps to make the use of resources in the training device better meet the needs of the current scenario.
  • Figure 5 is another schematic diagram of the training task processing method provided by the embodiments of this application.
  • the training task processing method provided by this application may include:
  • the second device sends a second request message to the training device.
  • the second device and the first device are different requesting devices.
  • the second device determines to train the second model, it can send a second request message to the training device.
  • the second request information is used to request the execution of the training task for the second model.
  • the training task for the second model can be called the second training task, and the two can be interchanged.
  • the training device can receive the second request information from the second device, and then determine the second training task based on the second request information; the meaning of the terms in step 501 and the specific implementation of the steps can be referred to the description of step 401 in the embodiment corresponding to Figure 4 above, the difference being that "first device” is replaced with “second device”, “first model” is replaced with “second model”, and “first request information” is replaced with “second request information”, which will not be repeated here.
  • the second training task may include one or more training tasks.
  • the second training task includes one training task, in which case the second request information may include one request message, and the second device may include one device.
  • the second training task includes at least two training tasks, in which case the second request information may also include at least two request messages corresponding one-to-one with the at least two training tasks.
  • Each of the at least two request messages is used to request the execution of one of the training tasks in the second training task, and the second device may include all devices that sent the aforementioned at least two request messages.
  • the training device can determine whether to start executing the second training task, and then start executing the second training task. For example, after receiving each request information included in the second request information, the training device can determine whether to start executing the training task corresponding to each request information, and then start executing the training task corresponding to each request information.
  • the process by which the training device determines whether to start executing the training task can be referred to the description of the training device determining whether to start executing the first training task in the embodiment corresponding to Figure 4 above, and will not be repeated here.
  • the training device receives the first request information from the first device.
  • the first request information can be used to request the execution of the training task of the first model.
  • the training device shall execute the training task of the first model and suspend the execution of the second training task.
  • the training task of the first model can be referred to as the first training task.
  • the training equipment sends a notification message to the second equipment corresponding to the second training task.
  • the notification information can be used to notify the second device to suspend the training task of the second model, or it can be used to notify the second device of the idle resources of the training device.
  • step 504 is an optional step.
  • the training device can also send a notification message to the second device corresponding to the second training task, and the second device can receive the notification message from the training device.
  • the notification information is used to instruct the second device to suspend the execution of the second training task, that is, in step 504, the training device notifies the second device corresponding to the second training task to suspend the execution of the second training task.
  • the training device notifying the second device to suspend the execution of the second training task may include: the training device sending a reason value to the second device, that is, the notification information includes a reason value, which indicates the reason for suspending the execution of the second training task.
  • the second device may receive the reason value from the training device.
  • the sending of a cause value by the training device to the second device can be understood as the training device sending the cause value to each device included in the second device.
  • the receiving of a cause value from the training device by the second device can be understood as each device included in the second device receiving a cause value from the training device.
  • this reason includes the second training task being the lowest priority training task on the training device.
  • the second training task that the training device suspends execution from is the lowest priority training task on the training device, meaning that the resources in the training device can be allocated to higher priority training tasks as much as possible, so that the resources in the training device can be used more efficiently.
  • the aforementioned cause value includes the priority of each of at least one third training task currently being performed by the training device, where the priority of each third training task is higher than the priority of the second training task.
  • At least one third training task may include all training tasks currently being executed by the training device. It should be noted that since the training device may pause the execution of old training tasks or begin executing new training tasks, the training tasks currently being executed by the training device in this application can vary. When the training device sends a cause value to the second device, the training device has paused the execution of the second training task and begun executing the first training task; therefore, the training tasks currently being executed by the training device may include the first training task, that is, at least one third training task may include the first training task. Alternatively, at least one third training task may include only the first training task. Or, at least one third training task may include a preset number of training tasks selected from all training tasks currently being executed by the training device, etc., which can be determined based on the actual application scenario.
  • the above-mentioned cause value may also include the identification information of each third training task; alternatively, the above-mentioned cause value may also include the inference type of each third training task, etc., which can be determined in combination with the actual application scenario.
  • the reason value sent by the training device to the second device includes the priority of each of the at least one third training tasks currently being executed by the training device.
  • the priority of each training task is higher than that of the second training task.
  • the second device can not only know that the execution of the second training task is suspended due to the low priority, but also know the priority of the third training task that is occupying the resources of the training device. That is, the second device can know a more detailed resource usage of the training device, which makes it easier for the second device to determine a more suitable processing method by combining the resource usage of the training device and the needs of the second model. It also helps to make the resource usage of the training device better meet the needs of the current scenario.
  • the aforementioned reason value can also be represented by a letter, such as "LL", which means that the execution of the second training task is paused because the second training task is the lowest priority training task on the training device.
  • the aforementioned reason value can also be expressed as a number.
  • the reason value can be expressed as "000000", which means that the second training task is paused because it is the lowest priority training task on the training device.
  • the aforementioned reason value can also be expressed in other forms, which can be set according to the actual application scenario.
  • the training device after determining that the second training task has been paused, the training device sends a reason value to the second device.
  • This reason value indicates the reason for pausing the second training task. This allows the second device to promptly know that the second training task has been paused and, more importantly, the reason for the pause. This facilitates the second device in determining the appropriate handling method for the second training task after confirming its pause, and also allows it to determine a more suitable handling method based on the reason for the pause.
  • the notification information may include an indication value corresponding to the pause.
  • the indication value corresponding to the pause may be 2222, representing the pause of the second training task.
  • the notification information may include "pause execution," thereby informing the second device to pause the execution of the second training task. The specific details can be determined based on the actual application scenario.
  • the training device after the training device suspends the execution of the second training task, it will promptly notify the second device corresponding to the second training task that the training device has suspended the execution of the second training task. This allows the second device to quickly know that the training device has suspended the execution of the second training task, making the execution process of the second training task more controllable.
  • the notification information is used to inform the second device of the idle resources of the training device. That is, in step 504, the training device notifies the second device of the idle resources of the training device.
  • the second device can determine that the training device has paused the execution of the second training task, and the second device can know how many idle resources are in the training device.
  • the concept of the idle resources of the training device can be referred to the description in Figure 4 above, and will not be repeated here.
  • the notification information may include the number of idle storage resources in the training device; optionally, the notification information may also include the number of idle processor resources in the training device, or the notification information may also include the number of other types of idle resources in the training device, which can be determined in combination with the actual application scenario.
  • the training device after the training device suspends the execution of the second training task, it can notify the second device of the idle resources of the training device.
  • the second device can not only know that the training device has suspended the execution of the second training task, but also determine how to handle the situation of the training device suspending the execution of the second training task based on the idle resources of the training device, which is conducive to obtaining a processing method that is more suitable for the current state of the training device.
  • the method for the second device to send a second training task to the training device is not limited.
  • the second training task can be handled by terminating, waiting, or adjusting the resources used.
  • step 505 is an optional step. After determining that the training device has paused the execution of the second training task, the second device can determine the processing method for the second training task and then send the processing method for the second training task to the training device.
  • the processing method of the second device sending the second training task to the training device can be understood as the processing method of each device included in the second device sending any one of the training tasks in the second training task to the training device.
  • the processing method of the second training task can be understood as the processing method of any one of the training tasks included in the second training task.
  • the second device can send an indication value corresponding to termination to the training device.
  • the indication value corresponding to termination can be 000000, DDDDD, or other types of indication values.
  • the second device can send feedback information to the training device to indicate "cancel the execution of this training task".
  • the specific implementation method can be determined according to the actual application scenario.
  • the second device can send an indication value corresponding to waiting to the training device; or, the second device can send feedback information to the training device to indicate "waiting", etc., without limitation.
  • this resource adjustment can further include: reducing the number of parameters in the model corresponding to the second training task, reducing the accuracy when executing the second training task, or reducing the batch training scale when executing the second training task.
  • reducing the number of parameters in the model corresponding to the second training task can be achieved by pruning the second model.
  • the specific pruning algorithm can be flexibly determined based on the actual application scenario, thereby reducing not only storage resources but also processor resources. Reducing the accuracy when executing the second training task can be achieved by replacing the high-precision data format with a low-precision data format, such as reducing from FP32 to FP16, or from FP64 to mixed precision.
  • the second device can send first feedback information to the training device.
  • the first feedback information instructs the training device to process the second training task in a first manner.
  • the first manner is to reduce the number of parameters of the model corresponding to the second training task, reduce the accuracy when performing the second training task, or reduce the batch training scale when performing the second training task. That is, the second device not only instructs the processing method of the second training task to adjust the occupied resources, but also instructs what method to use to adjust the occupied resources when performing the second training task.
  • the second device can send the processing method of the second training task to the training device. Since the second device has a clearer understanding of the requirements of the second model, it can determine whether to terminate, wait, or adjust the resource usage of the second training task based on the requirements of the second model. This is beneficial to improving the adaptability between the processing method of the second training task and the specific application scenario.
  • the second device can further indicate how to adjust the resources occupied when executing the second training task. This is beneficial to make the final second model more compatible with the application scenario of the second model, that is, to obtain a more satisfactory second model under the premise of limited resources.
  • the second device can send an indication value to the training device corresponding to adjusting the occupied resources; or, the second device can send feedback information to the training device to indicate "adjusting the occupied resources," so that the training device can know that the processing method of the second training task is to adjust the occupied resources, and then the training device can determine how to adjust the occupied resources when performing the second training task.
  • the factors determining the processing method of the second training task include: the immediacy requirements of the second model, the accuracy requirements of the second model, and/or, the degree of reduction in the accuracy of the second model caused by adjusting the resources used.
  • the training device can terminate the process, allowing it to promptly request other devices to execute the second training task.
  • the training device can wait.
  • the training device can adjust resource usage.
  • the immediacy requirement of the second model can be referred to the description of the "immediacy requirement of the first model” in the corresponding embodiment of Figure 4 above, except that "first model” is replaced with “second model”, which will not be repeated here.
  • the "accuracy requirement of the second model” can be understood as the accuracy requirement of the second model after performing the second training task, or it can also be understood as the performance requirement of the second model after performing the second training task, etc.
  • Model accuracy” or “model performance” can be understood as the accuracy of the prediction information generated by the model.
  • the degree of reduction in the accuracy of the second model caused by adjusting the occupied resources can be determined by at least one of the following parameters: the first accuracy range of the second model obtained after adjusting the resources occupied when performing the second training task, and/or the second accuracy range, where the second accuracy range represents the decrease in the accuracy of the final second model caused by adjusting the occupied resources, or may include other parameters, etc., which are not exhaustively listed in this application embodiment.
  • the immediacy requirement of the second model is that the second model, after completing the second training task, needs to be online within a second duration.
  • the accuracy requirement of the second model is that the accuracy of the second model after completing the second training task needs to be within a first preset accuracy range.
  • the second device can determine whether the second duration is greater than the preset duration. If the second duration is greater than the preset duration, the second device can determine that the processing method for the second training task is to wait. If the second duration is less than the preset duration, in one implementation, the second device can determine whether the first accuracy range is within the first preset accuracy range. If the first accuracy range is within the first preset accuracy range, the second device can determine that the processing method for the second training task is to adjust the occupied resources.
  • the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
  • the second device can determine whether the second precision range is within the second preset precision range. If the second precision range is within the second preset precision range, the second device can determine that the processing method for the second training task is to adjust the occupied resources. If the second precision range is not within the second preset precision range, that is, if there is a precision outside the second preset precision range, the second device can determine that the processing method for the second training task is to terminate, and then the second device can request other devices to execute the second training task.
  • the second device can determine the processing method of the second training task based on the immediacy requirements of the second model, the accuracy requirements of the second model, and/or the degree of reduction in the accuracy of the second model caused by the use of resources. That is, the processing method of the second training task is determined from the two dimensions of time requirements and accuracy requirements, which is beneficial to obtain a better processing method under the premise of limited resources.
  • the training equipment processes data based on the processing method of the second training task.
  • step 506 is an optional step.
  • Step 506 may include: when the processing mode of the second training task is termination, the training device deletes the second request information corresponding to the second training task; for example, the training device may delete the second request information corresponding to the second training task from the waiting queue.
  • the training device may delete the second request information corresponding to the second training task from the waiting queue.
  • the training device waits until the idle resources are greater than or equal to the resources required by the second training task before continuing to execute the second training task.
  • the second request information corresponding to the second training task can be placed in a waiting queue.
  • the head of the waiting queue is the second request information, and the idle resources of the training device are greater than or equal to the resources required by the second training task, the second training task can continue to be executed.
  • Step 506 may include: when the processing method of the second training task is to adjust the occupied resources, regardless of whether the second device or the training device determines the method to adjust the occupied resources when executing the second training task, the training device can know the method to adjust the occupied resources when executing the second training task. In this way, the training device can reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task.
  • the specific implementation methods of the above three methods can be referred to the description in step 505 above, which will not be repeated here.
  • different processing methods for the second training task fed back by the second device are provided, and specific processing schemes for the training device are provided.
  • the training device has a corresponding processing scheme, which helps to improve the smoothness and stability of the execution process of this application.
  • the processing method of the second training task is termination, the second request information is deleted, thereby releasing all resources related to the second request information in the training device in a timely manner, which helps to avoid wasting resources in the training device.
  • the training device can also adjust the priority of the second training task, for example, by increasing the priority of the second training task.
  • Figure 6 is another schematic diagram of the training task processing method provided by the embodiments of this application.
  • the training task processing method provided by this application may include:
  • the second device sends a second request message to the training device.
  • step 601 the meaning of the terms in step 601 and the specific implementation of the steps can be found in the description of step 501 in the embodiment corresponding to Figure 5 above, and will not be repeated here.
  • the training device receives a first request message from the first device, the first request message being used to request the execution of the training task of the first model.
  • the training device sends a response message to the first device.
  • the training task of the first model can be replaced by a first training task.
  • the response information can be used to notify the first device to delay or refuse to execute the training task of the first model, or the response information may include information about the second training task.
  • the training task of the first model can be handled by terminating, waiting, or adjusting the resources used.
  • the training equipment processes data based on the processing method of the first training task.
  • Figure 7 is a schematic diagram of a training task processing device provided in an embodiment of this application.
  • the training task processing device 700 can be applied in a training device.
  • the training task processing device 700 includes: a receiving module 701, used to receive request information from a first device, the request information being used to request the execution of a training task of a first model; an execution module 702, used to execute the training task of the first model and suspend the execution of the second training task if the priority of the training task of the first model is higher than the priority of the second training task; and/or, a sending module 703, used to send response information to the first device if the priority of the training task of the first model is lower than or equal to the priority of the second training task, the response information being used to notify the first device to delay or refuse the execution of the training task of the first model, or the response information including information about the second training task; wherein, the second training task is one or more training tasks currently being executed by the training device, or, the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the requested information may include the priority of the training task for the first model.
  • the information for the second training task includes the priority of the second training task.
  • the training task processing device 700 further includes: a notification module 704, used to notify the second device corresponding to the second training task to suspend the execution of the second training task; or, the notification module 704, used to notify the second device corresponding to the second training task of the idle resources of the training device.
  • a notification module 704 used to notify the second device corresponding to the second training task to suspend the execution of the second training task
  • the notification module 704 used to notify the second device corresponding to the second training task of the idle resources of the training device.
  • the notification module 704 is specifically used to send a reason value to the second device, the reason value being used to indicate the reason for suspending the execution of the second training task.
  • the reason may include the second training task being the lowest priority training task on the training device.
  • the reason value includes the priority of each of the at least one third training task currently being performed by the training device, wherein the priority of each third training task is higher than the priority of the second training task.
  • the receiving module 701 is also used to receive the processing mode of the second training task from the second device, wherein the processing mode of the second training task is: termination, waiting, or adjusting the occupied resources.
  • the training task processing device 700 further includes: a deletion module 705, used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination; or, an execution module 702, used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task; or, an adjustment module 706, used to reduce the number of parameters of the model corresponding to the second training task, or reduce the accuracy when executing the second training task, or reduce the batch training scale when executing the second training task when the processing mode of the second training task is adjusting the occupied resources.
  • a deletion module 705 used to delete the request information corresponding to the second training task when the processing mode of the second training task is termination
  • an execution module 702 used to wait until the idle resources are greater than or equal to the resources required by the second training task when the processing mode of the second training task is waiting, and then continue to execute the second training task
  • an adjustment module 706 used to reduce the number of parameters of the
  • the method can be applied to the following scenarios: the resource requirements of the first training task are greater than the available resources of the training device; or, the number of training tasks currently being executed by the training device is greater than a preset threshold.
  • adjusting the resources used may include: reducing the number of parameters of the model corresponding to the second training task, reducing the accuracy when performing the second training task, or reducing the batch training size when performing the second training task.
  • each training task currently being performed by the training device is a task of training a model in the field of communications.
  • FIG 8 is a schematic diagram of another structure of the training task processing device provided in this application embodiment.
  • the training task processing device 800 can be applied in a first device.
  • the training task processing device 800 includes: a sending module 801, used to send request information to the training device, the request information being used to request the execution of a training task of a first model; and a receiving module 802, used to receive response information from the training device, the response information being used to notify the first device to delay or refuse to execute the training task of the first model, or the response information including information of a second training task; wherein, the priority of the training task of the first model is lower than or equal to the priority of the second training task, the second training task is one or more training tasks currently being executed by the training device, or the second training task is the training task with the lowest priority among the training tasks currently being executed by the training device.
  • the sending module 801 is also used to send the processing method of the training task of the first model to the training device.
  • the processing method of the training task of the first model is: termination, waiting, or adjusting the occupied resources.
  • adjusting the resources used may include: reducing the number of parameters of the model corresponding to the training task of the first model, reducing the accuracy when performing the training task of the first model, or reducing the batch training scale when performing the training task of the first model.
  • the determining factors for how the training task of the first model is handled include: the immediacy requirements of the first model, the accuracy requirements of the first model, and/or, the degree of reduction in the accuracy of the first model caused by adjusting the resources used.
  • Figure 9 is another structural schematic diagram of the training task processing device provided in this application embodiment.
  • the training task processing device 900 can be applied to a second device.
  • the training task processing device 900 includes: a sending module 901, used to send request information to the training device, the request information being used to request the training device to execute a training task of a second model; a receiving module 902, used to receive notification information from the training device, the notification information indicating to suspend the execution of the training task of the second model, or the notification information indicating idle resources of the training device; the second training task includes the training task of the second model; wherein, the priority of the second training task is lower than the priority of the first training task, the first training task being a newly added training task by the training device.
  • the notification information includes a reason value, which indicates the reason for pausing the training task of the second model.
  • the sending module 901 is also used to send the processing method of the training task of the second model to the training device.
  • the processing method of the training task of the second model is: termination, waiting, or adjusting the occupied resources.
  • FIG 10 is a schematic diagram of the structure of a device provided in an embodiment of this application.
  • the device 1000 can specifically be the training device, the first device, or the second device in the above embodiments.
  • the device 1000 includes at least one processor 1001 and at least one memory 1002.
  • the processor 1001 and the memory 1002 are connected, for example, through a bus.
  • the memory 1002 is primarily used to store software programs.
  • the memory 1002 can exist independently and be connected to the processor 1001.
  • the memory 1002 can be integrated with the processor 1001, for example, integrated within a single chip.
  • the memory 1002 can store program code that executes the technical solutions of the embodiments of this application, and its execution is controlled by the processor 1001.
  • the various types of computer program code being executed can also be considered as drivers for the processor 1001.
  • the processor 1001 mainly executes the software program stored in the memory 1002 to implement the functions corresponding to the training device, the first device, or the second device in any of the embodiments shown in Figures 4 to 6.
  • Figure 10 shows only one memory and one processor. In actual devices, there can be multiple processors and multiple memories. Memory can also be called storage medium or storage device, etc. Memory can be a storage element on the same chip as the processor, i.e., an on-chip storage element, or it can be a separate storage element; this application does not limit this.
  • This application also provides a computer-readable storage medium storing a program that, when run on a computer, causes the computer to perform the steps executed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps executed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps executed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
  • This application also provides a computer program product, which includes a program that, when run on a computer, causes the computer to perform the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6, or causes the computer to perform the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6, or causes the computer to perform the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
  • This application also provides a circuit system including a processing circuit configured to perform the method described in the embodiments shown in Figures 4 to 6 above.
  • This application also provides a training processing system, which includes a training device, a first device, and a second device.
  • the training device is used to execute the steps performed by the training device in the methods described in the embodiments shown in Figures 4 to 6.
  • the first device is used to execute the steps performed by the first device in the methods described in the embodiments shown in Figures 5 to 6.
  • the second device is used to execute the steps performed by the second device in the methods described in the embodiments shown in Figures 5 to 6.
  • the training device, first device, second device, or training task processing apparatus provided in this application embodiment can specifically be a chip.
  • the chip includes a processing unit and a communication unit.
  • the processing unit can be, for example, a processor, and the communication unit can be, for example, an input/output interface, pins, or circuits.
  • the processing unit can execute computer execution instructions stored in the storage unit to cause the chip to execute the methods described in the embodiments shown in Figures 1 to 6.
  • the storage unit is a storage unit within the chip, such as a register or cache.
  • the storage unit can be a storage unit located outside the chip within the wireless access device, such as read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).
  • ROM read-only memory
  • RAM random access memory
  • the processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of a program in the first aspect of the method.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units.
  • Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
  • the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
  • This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
  • a computer device which may be a personal computer, server, or network device, etc.
  • implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof.
  • software When implemented in software, it can be implemented, in whole or in part, as a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • wired e.g., coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless e.g., infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media.
  • the available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Computer And Data Communications (AREA)

Abstract

La présente demande porte, selon ses modes de réalisation, sur un procédé de traitement de tâche d'entraînement et sur un dispositif associé. Le procédé peut être appliqué à une étape d'entraînement de modèle, et consiste : à recevoir des informations de demande pour demander l'exécution d'une tâche d'entraînement pour un premier modèle ; si la priorité de la tâche d'entraînement pour le premier modèle est plus élevée que la priorité d'une seconde tâche d'entraînement, à exécuter la tâche d'entraînement pour le premier modèle, et à suspendre l'exécution de la seconde tâche d'entraînement ; et si la priorité de la tâche d'entraînement pour le premier modèle est inférieure ou égale à la priorité de la seconde tâche d'entraînement, à envoyer des informations de réponse, les informations de réponse étant utilisées pour indiquer un retard ou un refus pour exécuter la tâche d'entraînement pour le premier modèle, ou les informations de réponse comportant des informations de la seconde tâche d'entraînement, et la seconde tâche d'entraînement étant une tâche d'entraînement actuellement exécutée au moyen d'un dispositif d'entraînement. Les priorités de tâches d'entraînement sont utilisées pour décider de suspendre, retarder ou refuser l'exécution de certaines tâches d'entraînement, ce qui aide à empêcher un dispositif d'entraînement d'être surchargé et aide à éviter des interruptions à des processus d'exécution des tâches d'entraînement ou une perte de données.
PCT/CN2025/072421 2024-04-30 2025-01-15 Procédé de traitement de tâche d'entraînement et dispositif associé Pending WO2025227845A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202410547036.5 2024-04-30
CN202410547036.5A CN120872424A (zh) 2024-04-30 2024-04-30 一种训练任务的处理方法以及相关设备

Publications (1)

Publication Number Publication Date
WO2025227845A1 true WO2025227845A1 (fr) 2025-11-06

Family

ID=97457801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2025/072421 Pending WO2025227845A1 (fr) 2024-04-30 2025-01-15 Procédé de traitement de tâche d'entraînement et dispositif associé

Country Status (2)

Country Link
CN (1) CN120872424A (fr)
WO (1) WO2025227845A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108027889A (zh) * 2016-01-25 2018-05-11 华为技术有限公司 一种用于增量式学习云系统的训练、调度方法及相关设备
CN113051054A (zh) * 2021-03-24 2021-06-29 依瞳科技(深圳)有限公司 调度人工智能平台资源的方法、设备和计算机可读存储介质
US20230168938A1 (en) * 2021-11-29 2023-06-01 International Business Machines Corporation Performing batched training for machine-learning pipelines
CN117828341A (zh) * 2022-09-27 2024-04-05 华为技术有限公司 一种模型训练管理的方法、装置和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108027889A (zh) * 2016-01-25 2018-05-11 华为技术有限公司 一种用于增量式学习云系统的训练、调度方法及相关设备
CN113051054A (zh) * 2021-03-24 2021-06-29 依瞳科技(深圳)有限公司 调度人工智能平台资源的方法、设备和计算机可读存储介质
US20230168938A1 (en) * 2021-11-29 2023-06-01 International Business Machines Corporation Performing batched training for machine-learning pipelines
CN117828341A (zh) * 2022-09-27 2024-04-05 华为技术有限公司 一种模型训练管理的方法、装置和系统

Also Published As

Publication number Publication date
CN120872424A (zh) 2025-10-31

Similar Documents

Publication Publication Date Title
US20230232213A1 (en) Information transmission methods and apparatuses, and communication devices and storage medium
US20240106764A1 (en) Computing power resource scheduling method and related apparatus
CN114303347A (zh) 通信网络中与机器学习相关的方法、装置和机器可读介质
EP4580230A1 (fr) Procédé et appareil de communication
US20250086473A1 (en) Model training method and apparatus
WO2025007648A1 (fr) Procédé de planification de tâche informatique et appareil de communication
US20230112127A1 (en) Electronic device for deploying application and operation method thereof
US20250321788A1 (en) Computing task processing method and related apparatus
WO2025227845A1 (fr) Procédé de traitement de tâche d'entraînement et dispositif associé
CN113692052A (zh) 一种网络边缘机器学习训练方法
KR102382170B1 (ko) 영상 스트리밍 방법 및 장치
WO2024036526A1 (fr) Procédé et appareil de planification de modèle
CN114301924A (zh) 一种云边协同环境的应用任务调度方法及节点设备
EP4648375A1 (fr) Procédé et appareil de quantification de réseau, et dispositif associé
WO2025157098A1 (fr) Procédé et appareil d'inférence basés sur des informations de contexte pré-stockées dans un grand modèle
CN121052291A (en) Communication method and device
WO2025124135A1 (fr) Procédé et appareil de communication
CN119450415A (zh) 业务处理方法、装置、通信设备及可读存储介质
WO2023125934A1 (fr) Procédé et appareil de transmission d'informations de réseau d'ia, et dispositif de communication
CN120729694A (zh) 一种分布式ai任务的配置方法、管理节点以及计算节点
WO2025227698A1 (fr) Procédé de communication et appareil associé
WO2025098104A1 (fr) Procédé et appareil de communication, et support de stockage lisible
WO2025189831A1 (fr) Procédé de communication et appareil associé
CN120676371A (zh) 任务管理方法、装置、终端及网络侧设备
WO2025227699A1 (fr) Procédé de communication et appareil associé