[go: up one dir, main page]

WO2022069059A1 - Apprentissage d'importance variable pour modèles multi-tâches - Google Patents

Apprentissage d'importance variable pour modèles multi-tâches Download PDF

Info

Publication number
WO2022069059A1
WO2022069059A1 PCT/EP2020/077651 EP2020077651W WO2022069059A1 WO 2022069059 A1 WO2022069059 A1 WO 2022069059A1 EP 2020077651 W EP2020077651 W EP 2020077651W WO 2022069059 A1 WO2022069059 A1 WO 2022069059A1
Authority
WO
WIPO (PCT)
Prior art keywords
tasks
model
task
training
model parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2020/077651
Other languages
English (en)
Inventor
Rafail Nikolaos KOURDIS
Philip John GORINSKI
Gabriel Peter Albert GORDON-HALL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/EP2020/077651 priority Critical patent/WO2022069059A1/fr
Priority to CN202080103533.6A priority patent/CN116057537B/zh
Publication of WO2022069059A1 publication Critical patent/WO2022069059A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • This invention relates to multi-task models, in particular to improving the training of such models.
  • Multi-task learning may be used to train models applicable in a variety of fields.
  • such models may be used in Natural Language Understanding (NLU) problems.
  • NLU Natural Language Understanding
  • problems include, but are not limited to, Natural Language Inference (NLI), Entailment, or Word Sense Disambiguation (WSD).
  • All problems come with their own datasets for training, development, and test purposes, typically consisting of pairs of the Natural Language input (for example, sentences or sentence pairs) and an associated discrete label (for example, “Yes” or “No”) where the label is one of a pre-defined closed set of possible labels (classification problems), or an associated continuous score, for example in the range between 0 and 5 (regression problems).
  • the goal of any NLU system is to learn, from the provided data, to predict the correct output for a new input.
  • multi-task learning may train a model for a target task, where one of the available tasks is treated as the task to optimize performance for, and all other tasks are treated only as auxiliary, i.e. , their contributions to model updates are only of interest in so far as they improve the performance of the target task.
  • prior methods generally estimate the task weights based on, and apply them to, the gradients derived from training input-output pairs. They are limited to using only very simple model optimization methods, typically only Stochastic Gradient Descent (SGD), as manipulating gradients before an optimization step can easily interfere with the internal state of more sophisticated optimizers such as Adam. This means that, at the end of the day, any such method really simply learns task-specific learning rates to be used with SGD, and are not easily (or at all) applicable to use with more advanced optimizers.
  • SGD Stochastic Gradient Descent
  • a device comprising one or more processors configured to train a machine learning model having an initial set of model parameters, the device being configured to train the model using a respective training data set for each of a plurality of tasks, the plurality of tasks comprising a target task for which the model is to be trained and one or more auxiliary tasks, the device being configured to train the model by performing the following steps: assign at least one candidate scaling factor to each of the plurality of tasks; for each of one or more of the plurality of tasks, perform one or more optimization steps for the respective task in dependence on the respective candidate scaling factor to form a respective refined set of model parameters; perform a machine learning operation, by means of one or more predetermined evaluation criteria and based on one or more predetermined constraints, to assess the performance of the refined set of model parameters of the one or more of the plurality of tasks on the target task and thereby determine a set of mixing weights; update the set of model parameters of the machine learning model in accordance with the refined model parameters as weighted by the set of mixing weights; and
  • the use of the one or more auxiliary tasks during training may be so as to optimize the performance of the target task. This may allow the training process for the target task to gain information from one or more auxiliary tasks that is mutually beneficial.
  • the set of mixing weights may include one or more of a positive value, a zero value and a negative value. This may allow for the use of negatively related tasks for improving the performance of the target task. In cases where model updates of an auxiliary task are detrimental to the target task, their data may still be made use of by setting the mixing ratio of that task to a negative number, leading to updates that can instead be beneficial to the target task. In cases where an auxiliary task cannot be leveraged at all, the method can still set the weight to 0, ignoring the task.
  • the device may be configured to assign multiple candidate scaling factors to each of the plurality of tasks. This may allow different layers for each task to be assigned a different candidate scaling factor.
  • the device may be further configured to, for each task of the plurality of tasks, determine an optimal set of mixing weights with respect to the performance of the target task. This may allow the device to train a model having good performance for the target task.
  • the device may be further configured to, for each task of the plurality of tasks, adjust the at least one scaling factor based on the optimal set of mixing weights. This may allow the device to use the optimal scaling factors for each task.
  • the model architecture may comprise a plurality of layers and the device may be configured to, for at least one of the plurality of tasks, apply a different scaling factor to each of at least some of the plurality of layers.
  • the model architecture may comprise a plurality of layers and the device may be configured to, for each of the plurality of tasks, apply a single scaling factor to all layers of the model.
  • the a variables assigned to the tasks can either be “global”, applying exactly one mixing ratio per task to all layers of the model, or “local”, where one variable per task per model layer is used. In either case, the number of additional parameters introduced to the model are minimal, and often insignificant when compared against the parameters of the original multi-task model.
  • the device may be configured to, at the start of the training, assign a candidate scaling factor having a value of 1 to each of the plurality of tasks. This may provide a convenient initial value that may be optimized in successive iterations.
  • the model architecture may comprise an embedder comprising multiple layers, and at least one encoder block, each encoder block comprising a self-attention mechanism and having at least one linear layer and at least one normalization layer.
  • a pre-trained RoBERTa model may be used as an input encoder, having encoder layers and task-specific classification heads and comprising simple feed-forward layers on top.
  • the one or more of the plurality of tasks may be sampled at random from the plurality of tasks. This may allow the influence of different tasks on the training of the target task to be evaluated effectively.
  • the target task may be a predetermined task to be optimized by the training of the model. This may allow a model to be trained using datasets from various related tasks, where classifiers are trained for each of the tasks, under the assumption that information gained from training on the different tasks is mutually beneficial and improves performance with respect to the target task.
  • the model may be for performing Natural Language Understanding (NLU) tasks.
  • NLU Natural Language Understanding
  • the model may allow a computer to interpret and use human language input.
  • problems include, but are not limited to, Natural Language Inference (NLI), Entailment, or Word Sense Disambiguation (WSD).
  • the plurality of tasks may correspond to related problems in the field of human language understanding.
  • the training datasets for each task preferably comprise pairs of the Natural Language input (for example, sentences or sentence pairs) and an associated discrete label (for example, “Yes” or “No”).
  • the label may be one of a pre-defined closed set of possible labels (classification problems), or an associated continuous score, for example in the range between 0 and 5 (regression problems).
  • the goal of such a Natural Language Understanding system may be to learn, from the provided data, to predict the correct output for a new input.
  • Natural Language Understanding tasks may include BoolQ, CommitBank, CoPA, RTE, WiC, MRPC, WNLI, CNLI, CoLA.
  • the use of arbitrary auxiliary tasks can boost target-task performance for NLU tasks.
  • the updated candidate scaling factors may indicate the benefit of a respective task to the training performance of the target task. This may allow the tasks to be weighted according to their effect on the training of the target task.
  • the step of performing a machine learning operation to assess the performance of the refined set of model parameters may comprise assessing the performance of a combination of the refined set of model parameters. This may allow the mixing weights to be more accurately determined than through a heuristic approach.
  • a method of training a machine learning model having an initial set of model parameters using a respective training data set for each of a plurality of tasks, the plurality of tasks comprising a target task for which the model is to be trained and one or more auxiliary tasks comprising: assigning at least one candidate scaling factor to each of the plurality of tasks; for each of one or more of the plurality of tasks, performing one or more optimization steps for the respective task in dependence on the respective scaling factor to form a respective refined set of model parameters; performing a machine learning operation, by means of one or more predetermined evaluation criteria and based on one or more predetermined constraints, to assess the performance of the refined set of model parameters of the one or more of the plurality of tasks on the target task by means of one or more predetermined evaluation criteria and thereby determine a set of mixing weights; updating the set of model parameters of the machine learning model in accordance with the refined set of model parameters as weighted by the set of mixing weights; and updating the candidate scaling factors in dependence on the set of mixing weights
  • the method can therefore make use of model updates for auxiliary tasks if and when they help push the target task’s performance.
  • a device configured to implement a model trained by the method described above.
  • the device may be used to implement models relating to tasks including Natural Language Processing.
  • Figure 1 schematically illustrates a multi-task model.
  • Figure 2 schematically illustrates a single Transformer Block (courtesy of "The Illustrated Transformer", http://jalammar.github.io/illustrated-transformer/).
  • Figure 3(a) schematically illustrates the model training of the training phase.
  • Figure 3(b) schematically illustrates the a variable updates of the training phase.
  • Figure 4 shows an exemplary algorithm for training the model.
  • Figure 5 schematically illustrates a device comprising a processor and a memory configured to train a machine learning model.
  • Figure 6 illustrates an example of a method for training a machine learning model.
  • the method described herein uses multi-task learning to train a model for a target task.
  • the target task is one of the available tasks that is treated as the task to optimize performance for, and all other tasks are treated as auxiliary, meaning that their contributions to model updates are only of interest in so far as they improve the performance of the target task.
  • the use of arbitrary auxiliary tasks during training can therefore boost target task performance.
  • parameters 0 of a deep neural network are trained on the data of all tasks in T while minimizing
  • the cross-entropy loss for model training is used on all tasks in the set T of tasks, and the task-specific weights are used to augment the standard loss function: where y t is the one-hot vector encoding the correct label of an instance of task t e T, y t is the classifier prediction for that instance, ⁇ denotes the inner product of true label and prediction, and a) t is the current weight associated with the task.
  • the method described herein provides a way of estimating the influence of tasks on the target’s performance, and of adjusting weights according during model training.
  • the method replaces the traditional estimation of task weights with a method based on the model updates, as expressed by the difference of the model’s parameters before and after one or more, preferably multiple, optimization steps (or an optimization procedure), henceforth referred to as the model “deltas”.
  • the method described herein estimates the optimal mixing ratio of task-specific model deltas with respect to performance of the target task with a meta-optimizer, and uses the mixing ratio to scale the deltas during training. Based on the determined optimal mixing ratio of deltas, the task-specific weights are then updated.
  • weights of tasks can be limited to be positive, the mixing ratio of any given task may be negative and instead allow the weights to become arbitrarily large or small, including negative.
  • variables are introduced to the model training process, which are used to scale the individual tasks’ model updates by finding an optimal mixing ratio with respect to a performance measure on the target task. These variables are associated with the actually realized task-specific model updates.
  • variables can either be “global”, applying exactly one mixing ratio per task to all layers of the model, or “local”, where one variable per task per model layer is used.
  • the number of additional parameters introduced to the model are minimal, and often insignificant when compared against the parameters of the original multi-task model.
  • the original model has a total of over 124 million parameters, while this method introduces an additional 8 (global, 8 tasks) or 1 ,773 (local, 9 tasks x 197 weights and biases) a variables.
  • some additional data of the target task can be set aside to estimate the a variables, in practice it is sufficient to estimate them on random subsets of the training data or the development data, leading to no additional data requirements like having to collect a new tuning set specifically for the a variables.
  • one implementation of the training method described herein can use a pre-trained RoBERTa model as an input encoder 101 , having encoder layers 1-n, as illustrated at 102-104, and task-specific classification heads 1-t (i.e. each tasks has its own classification head), as illustrated at 105-107, comprising simple feed-forward layers on top.
  • RoBERTa is based on BERT, which is a well-established model in Natural Language Processing (NLP), built upon the Transformer architecture, a block of which is shown in Figure 2.
  • NLP Natural Language Processing
  • the encoder block 201 comprises a self-attention mechanism 202, a normalization layer 203, a feed-forward layer 204, 205 and a second normalization layer 206.
  • RoBERTa-base comprises an embedder (3 embedding layers + 1 normalization layer), and 12 encoder blocks, each comprising a selfattention mechanism (3 linear layers), 1 self-output layer (1 linear layer + 1 normalization layer), 1 intermediate layer (1 linear layer), and 1 output module (1 linear layer + 1 normalization layer).
  • Each linear and normalization layer comes with an additional bias term, for a total of 197 weights and biases.
  • Scaling factors herein referred to as a variables, for each task are injected into the model at each embedding, linear, and normalization layer, and each bias.
  • a variables to optimize task mixing globally all a variables of a given task are optimized to have the same value, while local optimization targets all individual layers and bias variables on their own.
  • a single scaling factor is applied to all layers of the model.
  • a different scaling factor is applied to each of at least some of the plurality of layers.
  • the training of the model comprises two main phases, carried out in alternation, as schematically illustrated in Figures 3(a) and 3(b).
  • Figure 3(a) schematically illustrates phase 1 of the training: the model training phase. Any number of tasks are sampled from the plurality of available tasks (i.e. the target task and at least one auxiliary task). Each task of the one or more sampled tasks is used to update the underlying model on its own using the task’s current weight. After this, the differences in model parameters (deltas) are collected. The model is then reset.
  • Any number of tasks are sampled from the plurality of available tasks (i.e. the target task and at least one auxiliary task).
  • Each task of the one or more sampled tasks is used to update the underlying model on its own using the task’s current weight. After this, the differences in model parameters (deltas) are collected. The model is then reset.
  • phase 1 after assigning at least one candidate scaling factor to each of the plurality of tasks (the target task and at least one auxiliary task), taking the model with its initial set of model parameters, one or more optimization steps are performed for one or more of the plurality of tasks using the tasks’ respective sets of training data in dependence on the respective candidate scaling factor(s) for the task to form a respective refined set of model parameters.
  • the one or more optimization steps are performed in isolation, i.e., more than one task may be sampled, but the individual deltas are based on optimization steps with each individual task.
  • the model deltas A ⁇ 61+ 62+ 63... ⁇ can then be determined as the difference between the refined set of model parameters and the initial set of model parameters (0).
  • Task 3 is the target task and the other tasks are auxiliary tasks.
  • One or more optimization steps are performed for task 1 , 301 , and task 4, 304, in isolation.
  • the differences between the previous model parameters and the new model parameters are saved in a delta for the task at 307.
  • the model is reset and trained with the remaining task, 306.
  • the delta for this task is also saved at 307.
  • phase 2 as illustrated in Figure 3(b), showing the a variable update phase, the previously collected deltas from phase 1 are used to find the optimal mixing ratio with respect to model performance on the target task. Initially, at the beginning of step 2, all a variables are set to 1 , indicating use of the individual tasks’ full updates.
  • optimization steps are performed with respect to a variables and their impact on the target task performance (task 3, 303), with a “meta optimizer”, as shown at 308.
  • the optimized a variables represent an optimal “mixing ratio” of the deltas collected in phase 1.
  • Each a can have an arbitrary value. Values above 1.0 indicate that the corresponding delta should be scaled up, that is, the corresponding task is more beneficial to the target task. Values below 1.0 indicate that a task is less beneficial and its delta should be scaled down. Negative a values indicate that a task is to the detriment of the target task, and its updates should be performed in the opposite direction. These mixing weights are used to interpolate the collected model deltas, which are then applied to the model.
  • the individual task weights are updated based on the newly found optimal mixing ratio. Therefore, in phase 2, a machine learning operation is performed, by means of one or more predetermined evaluation criteria and based on one or more predetermined constraints, to assess the performance of the refined set of model parameters of the one or more of the plurality of tasks (that underwent optimization steps in phase 1) on the target task and thereby determine a set of mixing weights.
  • the set of model parameters of the machine learning model are then updated in accordance with the refined set of model parameters as weighted by the set of mixing weights (a variables) and the candidate scaling factors are updated in dependence on the set of mixing weights.
  • Training then returns to phase 1 (Figure 3(a)), and the cycle continues for a pre-defined number of iterations over the training data, or until model performance on the target task converges.
  • the optimal set of model parameters are then selected and used as the final model parameters for inference after training is complete.
  • auxiliary updates if and when they help push the target task’s performance, but ignore them if they do not.
  • the approach first performs weighted task-specific model updates on a proportion of the available training data, starting from the current model parameters for each individual task. It collects the resulting model deltas, i.e. , the differences between the model’s parameters before and after the single task update, and resets the model. After this delta collection phase, a variables are used as additional parameters which are optimized via gradient descent to find a good interpolation of the individual tasks’ model updates, with respect to the loss on the target task’s development data.
  • the best-found interpolation of deltas is used to update the model parameters, and the taskspecific weights are updated using the new interpolation parameters.
  • FIG. 4 An exemplary algorithm for training the model is shown in Figure 4.
  • the result is an updated set of model parameters optimized for performance on the target task t*.
  • Figure 5 is a schematic representation of a device 500 configured to perform the methods described herein.
  • the device 500 may be implemented on a device, such as a laptop, tablet, smart phone or TV.
  • the device 500 comprises a processor 501 configured to process the datasets in the manner described herein.
  • the processor 501 may be implemented as a computer program running on a programmable device such as a GPU or a Central Processing Unit (CPU).
  • the device 500 comprises a memory 502 which is arranged to communicate with the processor 501.
  • Memory 502 may be a non-volatile memory.
  • the processor 501 may also comprise a cache (not shown in Figure 5), which may be used to temporarily store data from memory 502.
  • the device may comprise more than one processor and more than one memory.
  • the memory may store data that is executable by the processor.
  • the processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium.
  • the computer program may store instructions for causing the processor to perform its methods in the manner described herein.
  • a device such as 500 may also implement a model trained by the method described herein.
  • Figure 6 shows a flowchart summarising an example of a method for training a machine learning model having an initial set of model parameters using a respective training data set for each of a plurality of tasks, the plurality of tasks comprising a target task for which the model is to be trained and one or more auxiliary tasks.
  • the method comprises forming the following steps 601-605.
  • the method comprises assigning at least one candidate scaling factor to each of the plurality of tasks.
  • the method comprises, for each of one or more of the plurality of tasks, performing one or more optimization steps for the respective task in dependence on the respective candidate scaling factor to form a respective refined set of model parameters.
  • the method comprises performing a machine learning operation, by means of one or more predetermined evaluation criteria and based on one or more predetermined constraints, to assess the performance of the refined set of model parameters of the one or more of the plurality of tasks on the target task and thereby determine a set of mixing weights.
  • the method comprises updating the set of model parameters of the machine learning model in accordance with the refined model parameters as weighted by the set of mixing weights.
  • the method comprises updating the candidate scaling factors in dependence on the set of mixing weights.
  • the machine learning system described herein therefore makes direct use of auxiliary training data and uses scaling factors per task in the form of a variables assigned to the model layers.
  • the model updates are performed based on tasks’ data and the a parameters are used to scale the model updates (the difference between model parameters 0 before and after an update).
  • the deltas of the model parameters are collected between phases 1 and 2, as show in Figure 3(a).
  • the a updates are then estimated in phase 2.
  • model parameter updates are therefore used to estimate the weights.
  • indirect weights are learned, e.g. via gating mechanisms.
  • Direct learning is usually performed using the gradient weights, whereas the method described herein uses parameter updates.
  • RoBERTa base has around 124 million parameters.
  • the additional parameters required using the method described herein are equal to the number of tasks, or
  • Optimal use of available training data can also be made by treating random samples from a training or development set as tuning data.
  • One advantage of this solution is that it allows to make use of sophisticated optimizers during task training, rather than being limited to using stochastic gradient descent (SGD).
  • Optimizers such as Adam or AdamW (see https://arxiv.org/abs/1711.05101) have been shown to lead to faster optimization, and higher final model performance, over SGD.
  • the algorithm for target task multi-task training is therefore able to use other optimizers in addition to standard SGD, which may improve the performance of the model.
  • Another advantage is that the method allows for the use of negatively related tasks, for improving the performance of the target task.
  • model updates of an auxiliary task are detrimental to the target task, their data may still be made use of by setting the mixing ratio of that task to a negative number, leading to updates that can instead be beneficial to the target task.
  • the method can still set the weight to 0, ignoring the task.
  • One implementation uses AdamW for model and head optimization, and SGD with momentum for a Variables optimization.
  • the model deltas are collected for 25% of any training dataset, after which a variable number of alpha optimization steps are performed with the collected model deltas.
  • These settings, as well as the actual optimizers used, are hyperparameters and subject to empirical tuning.
  • the training scheme itself makes no assumptions about these hyperparameters and we could thus, for example, use entirely different optimizers than the proposed ones for the base model, heads, or a optimization.
  • the method described herein may in some implementations be model agnostic, task agnostic and optimizer agnostic.
  • the method may be conveniently implemented for target task oriented multi-task learning for Natural Language Understanding problems.
  • the model may be used to allow a computer to interpret and use human language input.
  • problems include, but are not limited to, Natural Language Inference (NLI), Entailment, or Word Sense Disambiguation (WSD). All such problems come with their own datasets for training, development, and test purposes, typically consisting of pairs of the Natural Language input (for example, sentences or sentence pairs) and an associated discrete label (for example, “Yes” or “No”) where the label is one of a predefined closed set of possible labels (classification problems), or an associated continuous score, for example in the range between 0 and 5 (regression problems).
  • the goal of any NLU system is to learn, from the provided data, to predict the correct output for a new input.
  • Natural Language Understanding tasks may include BoolQ, CommitBank, CoPA, RTE, WiC, MRPC, WNLI, CNLI, CoLA, which are well-established tasks in the Natural Language Understanding research community, and are also represented in the GLUE and/or superGLUE NLU benchmarks.
  • the use of arbitrary auxiliary tasks can boost target-task performance for NLU tasks.
  • the model may also be used for classification tasks in such a field. For example, from predefined possible categories, the model may be used to determine the correct category for given Natural Language input.
  • the model may be used in an Entailment vs Contradiction scenario, as illustrated below:
  • the method may be applied to train a variety of tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Feedback Control In General (AREA)

Abstract

L'invention concerne un dispositif (500) comprenant un ou plusieurs processeurs (501) configurés pour entraîner un modèle d'apprentissage automatique ayant un ensemble initial de paramètres de modèle, le dispositif étant configuré pour entraîner le modèle à l'aide d'un ensemble de données d'apprentissage respectif pour chaque tâche d'une pluralité de tâches (301, 302, 303, 304, 305), la pluralité de tâches comprenant une tâche cible (303) pour laquelle le modèle doit être entraîné et une ou plusieurs tâches auxiliaires (301, 302, 304, 305), le dispositif étant configuré pour entraîner le modèle en effectuant les étapes suivantes : attribuer (601) au moins un facteur de mise à l'échelle candidat à chaque tâche de la pluralité de tâches ; pour chaque tâche d'une ou de plusieurs tâches (301, 304) de la pluralité de tâches, effectuer (602) une ou plusieurs étapes d'optimisation pour la tâche respective en fonction du facteur de mise à l'échelle candidat respectif pour former un ensemble affiné respectif de paramètres de modèle ; effectuer (603) une opération d'apprentissage automatique, au moyen d'un ou de plusieurs critères d'évaluation prédéfinis et sur la base d'une ou de plusieurs contraintes prédéfinies, pour évaluer les performances de l'ensemble affiné de paramètres de modèle de la ou des tâches de la pluralité de tâches sur la tâche cible et déterminer ainsi un ensemble de poids de mélange ; mettre à jour (604) l'ensemble de paramètres de modèle du modèle d'apprentissage automatique en fonction des paramètres de modèle affinés pondérés par l'ensemble de poids de mélange ; et mettre à jour (605) les facteurs de mise à l'échelle candidats en fonction de l'ensemble de poids de mélange.
PCT/EP2020/077651 2020-10-02 2020-10-02 Apprentissage d'importance variable pour modèles multi-tâches Ceased WO2022069059A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2020/077651 WO2022069059A1 (fr) 2020-10-02 2020-10-02 Apprentissage d'importance variable pour modèles multi-tâches
CN202080103533.6A CN116057537B (zh) 2020-10-02 2020-10-02 多任务模型的变量重要性学习

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/077651 WO2022069059A1 (fr) 2020-10-02 2020-10-02 Apprentissage d'importance variable pour modèles multi-tâches

Publications (1)

Publication Number Publication Date
WO2022069059A1 true WO2022069059A1 (fr) 2022-04-07

Family

ID=72840491

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/077651 Ceased WO2022069059A1 (fr) 2020-10-02 2020-10-02 Apprentissage d'importance variable pour modèles multi-tâches

Country Status (2)

Country Link
CN (1) CN116057537B (fr)
WO (1) WO2022069059A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118940299A (zh) * 2024-10-09 2024-11-12 天津中科闻歌科技有限公司 一种提高目标模型的安全性的方法、电子设备及存储介质
CN119646373A (zh) * 2024-11-20 2025-03-18 人工智能与数字经济广东省实验室(深圳) 一种基于大语言模型微调的任务适应结果确认方法及系统

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130275A1 (en) * 2017-10-26 2019-05-02 Magic Leap, Inc. Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117709426B (zh) * 2017-02-24 2024-11-15 渊慧科技有限公司 训练机器学习模型的方法、系统和计算机存储介质
CN110188358B (zh) * 2019-05-31 2023-10-24 鼎富智能科技有限公司 自然语言处理模型的训练方法及装置
CN111079847B (zh) * 2019-12-20 2023-05-02 郑州大学 一种基于深度学习的遥感影像自动标注方法
CN111724083B (zh) * 2020-07-21 2023-10-13 腾讯科技(深圳)有限公司 金融风险识别模型的训练方法、装置、计算机设备及介质

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130275A1 (en) * 2017-10-26 2019-05-02 Magic Leap, Inc. Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ABBAS WASEEM ET AL: "Adaptively Weighted Multi-task Learning Using Inverse Validation Loss", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 1408 - 1412, XP033566331, DOI: 10.1109/ICASSP.2019.8683776 *
ANONYMOUS: "Revision History [alpha]VIL: Learning to Leverage Auxiliary Tasks for Multitask Learning", 28 September 2020 (2020-09-28), XP055814240, Retrieved from the Internet <URL:https://openreview.net/revisions?id=0p-aRvcVs-U> [retrieved on 20210615] *
DASGUPTA RIDDHIMAN ET AL: "Leveraging multiple tasks to regularize fine-grained classification", 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), IEEE, 4 December 2016 (2016-12-04), pages 3476 - 3481, XP033086160, DOI: 10.1109/ICPR.2016.7900172 *
SAM VERBOVEN ET AL: "HydaLearn: Highly Dynamic Task Weighting for Multi-task Learning with Auxiliary Tasks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 August 2020 (2020-08-26), XP081748991 *
ZHAO JIEJIE ZJJ@BUAA EDU CN ET AL: "Multiple Relational Attention Network for Multi-task Learning", PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING , KDD '19, ACM PRESS, NEW YORK, NEW YORK, USA, 25 July 2019 (2019-07-25), pages 1123 - 1131, XP058466112, ISBN: 978-1-4503-6201-6, DOI: 10.1145/3292500.3330861 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118940299A (zh) * 2024-10-09 2024-11-12 天津中科闻歌科技有限公司 一种提高目标模型的安全性的方法、电子设备及存储介质
CN119646373A (zh) * 2024-11-20 2025-03-18 人工智能与数字经济广东省实验室(深圳) 一种基于大语言模型微调的任务适应结果确认方法及系统

Also Published As

Publication number Publication date
CN116057537A (zh) 2023-05-02
CN116057537B (zh) 2025-07-04

Similar Documents

Publication Publication Date Title
KR102071179B1 (ko) 데이터 셋의 연속적인 학습 방법 및 장치
US11593625B2 (en) Method and apparatus with neural network parameter quantization
US11386256B2 (en) Systems and methods for determining a configuration for a microarchitecture
Lezama et al. Improved masked image generation with token-critic
US11657802B2 (en) Utilizing a dynamic memory network for state tracking
JP2020520516A5 (fr)
CN111598253B (zh) 使用教师退火来训练机器学习模型
CN110663049B (zh) 神经网络优化器搜索
CN108846077A (zh) 问答文本的语义匹配方法、装置、介质及电子设备
US12050979B2 (en) Budgeted neural network architecture search system and method
WO2021257160A1 (fr) Apprentissage de sélection de modèle pour une distillation de connaissances
WO2023221043A1 (fr) Formation d&#39;auto-codeurs masqués pour la retouche d&#39;images
WO2022069059A1 (fr) Apprentissage d&#39;importance variable pour modèles multi-tâches
CN115713676A (zh) 小样本目标检测模型构建方法、装置、设备及存储介质
CN115516466A (zh) 超参数神经网络集成
CN116910210A (zh) 基于文档的智能问答模型训练方法、装置及其应用
KR102583943B1 (ko) 태스크 간의 상관관계 분석 알고리즘을 이용하여 연속학습을 수행하는 신경망 장치 및 신경망 학습 방법
CN117999560A (zh) 机器学习模型的硬件感知渐进训练
US20250148752A1 (en) Open vocabulary image segmentation
CN119558372A (zh) 生成式模型的微调方法、装置、设备、介质及产品
CN119005290A (zh) 基于军事文档和回答相似度的强化学习训练方法及系统
EP4466642A2 (fr) Optimisation d&#39;incorporation pour un modèle d&#39;apprentissage automatique
US20250356209A1 (en) Distilling diffusion models using imitation learning
CN116758992B (zh) 一种多组学数据分析方法
US12488430B2 (en) Training masked autoencoders for image inpainting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20789877

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20789877

Country of ref document: EP

Kind code of ref document: A1

WWG Wipo information: grant in national office

Ref document number: 202080103533.6

Country of ref document: CN