WO2024159132A1 - Pré-apprentissage continu de réseaux neuronaux de mélange d'experts - Google Patents
Pré-apprentissage continu de réseaux neuronaux de mélange d'experts Download PDFInfo
- Publication number
- WO2024159132A1 WO2024159132A1 PCT/US2024/013166 US2024013166W WO2024159132A1 WO 2024159132 A1 WO2024159132 A1 WO 2024159132A1 US 2024013166 W US2024013166 W US 2024013166W WO 2024159132 A1 WO2024159132 A1 WO 2024159132A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- moe
- expert
- training
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Definitions
- This specification relates to performing a machine learning task on a network input using neural networks.
- Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input.
- Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.
- Each layer of the network generates an output from a received input in accordance with current value inputs of a respective set of parameters.
- This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a neural network to perform a machine learning task on a network input.
- the neural network is a Mixture of Experts (MoE) neural network that includes a plurality of neural network layers.
- the plurality of neural network layers include a MoE layer arranged between a first neural network layer and a second neural network layer in the neural network.
- the MoE layer includes a plurality of expert neural networks, in which each expert neural network is configured to process a first layer output generated by the first neural network layer in accordance with a respective set of expert parameters of the expert neural network to generate a respective expert output.
- the MoE layer further includes a gating neural network.
- the MoE layer is configured to: select, based on the first layer output, one or more of the expert neural networks and determine a respective weight for each selected expert neural network, provide the first layer output as input to each of the selected expert neural networks, combine the expert outputs generated by the selected expert neural networks in accordance with the weights for the selected expert neural networks to generate an MoE output, and provide the MoE output as input to the second neural network layer.
- the subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages.
- the described techniques can implement a continual training framework suitable for training a MoE neural network that has an ever growing model size on a very large, potentially infinitely large, set of training datasets which may be obtained from different data sources, over different time periods, and so on.
- the MoE neural network that is being trained under the described continual training framework achieves a desirable balance between fitting new data distributions and preserving previous knowledge.
- the continuously pre-trained MoE neural network thus can be more easily adapted, e.g., through one-shot or few-shot learning, to any of a range of downstream tasks. Once adapted, the continuously pre-trained MoE neural network can achieve or even exceed the performance of a conventionally pre-trained neural network on any of the downstream tasks, despite an adaptation process that consumes fewer computing resources, is faster in terms of wall -clock time, or both.
- the continual training framework avoids the need for a full model update when a new functionality needs to be added to the device on which the MoE neural network is deployed. Instead, the new functionality can be added through a small partial update (e.g., partial expansion) to the MoE neural network.
- the single MoE neural network requires fewer total network parameters to be learned.
- the reduced number of parameters can further allow the MoE neural network to be deployed in resource-constrained environments after training.
- the neural network can be deployed on an edge device, e.g., a mobile phone or tablet computer, that has limited computational and memory resources that would otherwise have made it infeasible to deploy multiple different trained neural networks on the edge device.
- FIG. 1 shows an example neural network training system.
- FIG. 2 is a flow diagram of an example process for training a Mixture of Experts (MoE) neural network on multiple training datasets.
- MoE Mixture of Experts
- FIG. 3 is a flow diagram of sub-steps of one of the steps of the process of FIG. 2.
- FIG. 4 is an example illustration of training a Mixture of Experts (MoE) neural network on multiple training datasets.
- MoE Mixture of Experts
- FIG. 5 shows a quantitative example of the performance gains that can be achieved by a Mixture of Experts (MoE) neural network trained by the neural network training system described in this specification.
- MoE Mixture of Experts
- This specification describes a neural network training system implemented as computer programs on one or more computers in one or more locations that implements a continual training framework to train a neural network.
- the neural network training system can train the neural network to receive any kind of digital data input and to perform any kind of machine learning task (e.g., generative task, classification task, or regression task) on the input to generate an output.
- machine learning task e.g., generative task, classification task, or regression task
- the neural network is a neural network that is configured to perform an image processing task, i.e., receive an input image (which may comprise a plurality of pixel values) and to process the input image to generate a network output for the input image.
- the task may be image classification and the output generated by the neural network for a given image may be scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category.
- the task can be image embedding generation and the output generated by the neural network can be a numeric embedding of the input image.
- the task can be object detection and the output generated by the neural network can identify locations in the input image at which particular types of objects are depicted.
- the task can be image segmentation and the output generated by the neural network can assign each pixel of the input image to a category from a set of categories.
- the neural network is a neural network that is configured to perform an image generation task, where the input is a conditioning input and the output is a sequence of intensity value inputs for the pixels of an image.
- the task may be a neural machine translation task.
- the input to the neural network is a sequence of text, e g., a sequence of words, phrases, characters, or word pieces, in one language
- the output generated by the neural network may be a translation of the sequence of text into another language, i.e., a sequence of text in the other language that is a translation of the input sequence of text.
- the task may be a multi-lingual machine translation task, where a single neural network is configured to translate between multiple different source language - target language pairs.
- the source language text may be augmented with an identifier that indicates the target language into which the neural network should translate the source language text.
- the task may be an audio processing task.
- the output generated by the neural network may be a score for each of a set of pieces of text, each score representing an estimated likelihood that the piece of text is the correct transcript for the utterance.
- the output generated by the neural network can indicate whether a particular word or phrase (“hotword”) was spoken in the utterance.
- the output generated by the neural network can be a classification of the spoken utterance into one of a plurality of categories, for example an identity of the natural language in which the utterance was spoken.
- the task can be a natural language processing or understanding task, e.g., an entailment task, a paraphrase task, a textual similarity task, a sentiment task, a sentence completion task, a grammaticality task, and so on, that operates on a sequence of text in some natural language.
- the task can be a text to speech task, where the input is text in a natural language or features of text in a natural language and the network output is a spectrogram, a waveform, or other data defining audio of the text being spoken in the natural language.
- the task can be a health prediction task, where the input is a sequence derived from electronic health record data for a patient and the output is a prediction that is relevant to the future health of the patient, e.g., a predicted treatment that should be prescribed to the patient, the likelihood that an adverse health event will occur to the patient, or a predicted diagnosis for the patient.
- Such electronic health data may, for example, comprise one or more sequences of physiological data taken from a patient, with the output being a corresponding prediction that relates to those sequences of data.
- physiological data and a corresponding prediction include: blood glucose measurements, with the prediction being a predicted future blood glucose measurement or the prediction of a hyper- or hypo-glycemic event; a heart rate, with the prediction being the presence or absence of a heart condition, or a future cardiac event; blood pressure measurements, with the prediction being the risk of a future heart condition; or the like.
- the task can be a text generation task, where the input is a sequence of text, and the output is another sequence of text, e.g., a completion of the input sequence of text, a response to a question posed in the input sequence, or a sequence of text that is about a topic specified by the first sequence of text.
- the input to the text generation task can be an input other than text, e.g., an image, and the output sequence can be text that describes the input.
- the task can be an agent control task, where the input is a sequence of observations or other data characterizing states of an environment and the output defines an action to be performed by the agent in response to the most recent data in the sequence.
- the agent can be, e.g., a real-world or simulated robot, a control system for an industrial facility, or a control system that controls a different kind of agent.
- the observations may comprise sensor data captured by sensors associated with (e g. part of) the agent, for example visual data, LIDAR data, sonar data, agent configuration data (e.g. joint angles), agent orientation data, or the like.
- the task can be a genomics task, where the input is a sequence representing a fragment of a DNA sequence or other molecule sequence and the output is either an embedding of the fragment for use in a downstream task, e.g., by making use of an unsupervised learning technique on a data set of DNA sequence fragments, or an output for the downstream task.
- downstream tasks include promoter site prediction, methylation analysis, predicting functional effects of non-coding variants, and so on.
- the machine learning task is a combination of multiple individual machine learning tasks, i.e., the system is configured to perform multiple different individual machine learning tasks, e.g., two or more of the machine learning tasks mentioned above.
- the neural network can be configured to perform multiple individual natural language understanding tasks, with the input including an identifier for the individual natural language understanding task to be performed on the network input.
- the task is a multi-modal task that requires processing both text and image inputs, so that the neural network includes both a computer vision neural network and a text processing neural network. That is, the target output to be generated by the computer vision neural network for a given image depends on one or more outputs generated by the text processing neural network for one or more corresponding text inputs (and vice versa).
- Examples of such tasks include open-vocabulary image classification, open-vocabulary object detection, image captioning, text-based image search, image-based retrieval, and so on.
- FIG. 1 shows an example neural network training system 100.
- the neural network training system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
- the neural network training system 100 includes a Mixture of Experts (MoE) neural network 110.
- the MoE neural network 110 is a neural network that can be configured through training to perform any one of the tasks mentioned above.
- the MoE neural network 110 includes a plurality of neural network layers.
- the plurality of neural network layers include a first neural network layer 120, a second neural network layer 150, and a MoE layer 130 arranged between the first neural network layer 120 and the second neural network layer 150 in the MoE neural network.
- the MoE layer 130 is a layer that performs conditional computation, i.e., performs different operations for different input data.
- the MoE neural network 110 generally includes many other layers, including, for example, embedding layers, output layer, and other MoE layers.
- the MoE layer 130 includes a plurality of expert neural networks 130A-B, in which each expert neural network is configured to process a first layer output generated by the first neural network layer 120 in accordance with a respective set of parameters (referred to in this specification as “expert parameters”) of the expert neural network to generate a respective expert output.
- Each expert neural network can have any appropriate neural network architecture, e.g., a feed-forward architecture or a recurrent architecture.
- FIG. 1 illustrates only two expert neural networks 130A-B in the MoE layer 130 for convenience
- the MoE layer 130 of the MoE neural network 110 can include any number of expert neural networks, e.g., three expert neural networks, five expert neural networks, ten expert neural networks, or another appropriate number.
- the MoE layer 130 further includes a gating neural network 140.
- the gating neural network 140 is a neural network that has a set of parameters (referred to in this specification as “gating parameters”) and that maps the first layer output to the respective gate scores for the plurality of expert neural networks 130A-B in accordance with the gating parameters.
- the gating neural network 140 includes one or more initial layers, followed by an output layer of dimension N, i.e., an output layer with A units, where A represents the number of expert neural networks included in the MoE layer 130, that can generate a respective initial score for each expert neural network by processing the first layer output and then compute the gate scores by applying a softmax function to the initial scores.
- the MoE layer 130 is configured to: select, based on the first layer output, one or more of the plurality of expert neural networks 130A-B based on the respective gate scores for the plurality of expert neural networks 130A-B, provide the first layer output as input to each of the selected expert neural networks, combine the expert outputs generated by the selected expert neural networks in accordance with the gate scores for the selected expert neural networks to generate a combined expert output, and provide the combined expert output, or data derived from the combined expert output, as input to the second neural network layer 150.
- the MoE neural network 110 can be a Mixture of Experts (MoE) attention neural network, and the plurality of neural network layers include one or more attention layers.
- the first neural network layer 120, the second neural network layer 150, or both can be an attention layer.
- An attention layer is a neural network that includes an attention mechanism, e.g., a self-attention mechanism, e.g., a multi-head selfattention mechanism.
- each expert neural network 130A, 130B included in the MoE layer 130 can each have a respective feed-forward architecture.
- each expert neural network 130A, 130B can be a multi-layer, e.g., two layer or three layer, neural network of fully-connected layers with, e.g., a ReLU or GeLU activation function.
- the first neural network layer 120 when it is an attention layer, it receives an input sequence for the layer and applies an attention mechanism on the input sequence for the layer to generate an attended input sequence, i.e., a sequence that includes a respective attended layer input at each of multiple positions.
- the first neural network layer 120 uses one or more attention heads.
- Each attention head generates a set of queries by using one or more query transformation layers, a set of keys by using one or more key transformation layers, and a set of values by using one or more value transformation layers, and then applies any of a variety of variants of query-key -value (QKV) attention using the queries, keys, and values to generate an output.
- QKV query-key -value
- the first neural network layer 120 When there are multiple attention heads, the first neural network layer 120 then combines the outputs of the multiple attention heads, e.g., by concatenating the outputs and, optionally, processing the concatenated outputs through a linear layer.
- the MoE layer 130 uses a gating neural network 140 to select one or more of the expert neural networks 130A-B and generates a combined expert output from the expert outputs generated by the selected expert neural networks as a result of processing the attended layer input included attended input sequence.
- the gating neural network 140 computes, for the respective attended layer input at each of multiple positions, the gate scores for the plurality of expert neural networks 130A-B by processing the respective attended layer input using the one or more initial layers and the output layer of dimension N. For each attended layer input, the gating neural network 140 then selects, from the plurality of expert neural networks 130A-B, one or more expert neural networks based at least on the respective gate scores.
- the gating neural network 140 then provides the combined expert outputs generated by the selected expert neural networks for each attended layer input included in the attended input sequence, or data derived from the combined expert outputs, as an input sequence to the second layer 150, which can be another attention layer.
- the neural network training system 100 is a system that implements a continual training framework, sometimes referred to as a “lifelong learning (LLL)” framework, to train the MoE neural network 110 on each of multiple training datasets, e.g., training dataset 102- 1, training dataset 102-2, and so on, until training dataset 102-N.
- LLL lifelong learning
- the neural network training system 100 trains the MoE neural network 110 on training datasets 102-1— 102-N one after another in a sequential order, e.g., in the order in which the training datasets become available.
- the neural network training system 100 can terminate the training process after having iterated through a list of existing training datasets and then proceed to adapt the pre-trained neural network to a specific downstream machine learning task. In some other cases, the neural network training system 100 can continue the pre-training process indefinitely, i.e., resume the training whenever a new training dataset becomes available, in order to reduce generalization errors of the neural network on unseen and possibly out-of- distribution datasets.
- the multiple training datasets can include any training dataset that might be accessible to the neural network training system 100.
- the multiple training datasets include labeled training datasets.
- the multiple training datasets include unlabeled training datasets.
- the multiple training datasets include both labeled and unlabeled training datasets.
- a labeled training dataset include a plurality of training inputs that are each associated with a ground truth output, i.e., an output that should be that should be generated by the neural network from the training input.
- An unlabeled dataset includes a plurality of training inputs, and is referred to as “unlabeled” because for each training input, information about the ground truth output, may not be specified by unlabeled dataset and thus may not be readily available to the neural network training system 100.
- the multiple training datasets include training data for the same task while in other cases, the multiple training datasets include training data for different tasks, e.g., two or more of the tasks mentioned above.
- the multiple training datasets are obtained from the same source, e.g., received as uploads from a same user, or are obtained from the same network location while in other cases, the multiple training datasets are obtained from different sources, e.g., received as uploads from different users, or are obtained from different network locations.
- the multiple training datasets are part of a larger dataset, e.g., a dataset generated from a continuous data stream.
- the neural network training system 100 can be in data communication with a server that obtains a continuous data stream, e.g., in the form of data feeds or event updates, and that updates a dataset in real-time to include the latest data in continuous data stream.
- the continuous data stream can be generated by hardware devices (e.g., sensors), software algorithms (e.g., generative artificial intelligence models), users of social network platforms, forums, or other content platforms, and the like.
- the MoE neural network 110 that undergoes the training process is progressively expanded as more training datasets are used to train the neural network.
- additional expert neural networks can be generated based on one or more of the plurality of existing expert neural networks in the MoE layer 130, and subsequently added to the MoE neural network 110.
- the model capacity can be gradually improved to better fit the neural network on the data distributions of the new training dataset.
- the framework avoids overfitting the neural network on the multiple training datasets.
- the MoE neural network thus can be more easily adapted, e.g., through one-shot or few-shot learning, to any of a range of downstream tasks. Once adapted, the MoE neural network can achieve or even exceed the performance of a conventionally pre-trained neural network on any of the downstream tasks, despite an adaptation process that consumes fewer computing resources, is faster in terms of wall-clock time, or both.
- the number of expert neural networks that are selected for processing an input may be held fixed. As such, whilst the total number of expert neural networks and model capacity increases, the latency of the MoE neural network remains unchanged.
- FIG. 1 thus illustrates that the neural network training system 100 obtains the first training dataset 102-1, and uses the first training dataset 102-1 to train the MoE neural network 110.
- the first training dataset 102-1 is one of the training datasets mentioned in Nan Du, et al., GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, arXiv: 2112.06905.
- the neural network training system 100 trains the MoE neural network 110 on the first training dataset 102-1 using an appropriate pre-training loss function, e.g., an unsupervised or self-supervised pre-training loss function.
- the pre-training loss function can include one or more terms that measure, for each training input selected from the first training dataset 102-1, the quality of a training output for the training input generated by performing a forward pass through the MoE neural network 110.
- the pre-training loss function includes a cross-entropy loss term that measures, for each training input, a difference between (i) the predicted tokens generated by the MoE neural network 110 from processing the training input and (ii) the ground truth tokens that correspond to the predicted tokens (e g., and that were masked or otherwise excluded from the training input processed by the MoE neural network 110).
- the tokens can include any of a variety of tokens that represent text symbols or other symbols.
- the tokens can include one or more of characters, sub-words, words, punctuation marks, numbers, or other symbols that appear in a corpus of natural language text.
- the tokens may be derived from image, video and/or audio data.
- the neural network training system 100 can train the MoE neural network 110 on the first training dataset 102-1 based on optimizing a pre-training loss function that includes a BERT loss term, as described in more detail in Devlin, Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805.
- the pre-training loss function also includes one or more auxiliary loss terms, e.g., loss terms that encourage load balancing across the plurality of expert neural networks in the MoE layer 130, and so on, to improve the speed, stability, or both of the training.
- the plurality of expert neural networks may be implemented by a plurality of computing units of a parallel processing or distributed system.
- the plurality of computing units may include one or more neural network hardware accelerators such as TPUs, GPUs and ASICs.
- the plurality of expert neural networks may be allocated depending on the number of computing units available, for example, each of the expert neural networks may be implemented by a different computing unit.
- the processing of the expert neural networks may be carried out in parallel on the plurality of computing units.
- An auxiliary loss term may be used to encourage load balancing of the expert neural networks and load balancing of parallel processing/distributed system.
- the neural network training system 100 By processing training inputs included in the first training dataset 102-1 and optimizing the appropriate unsupervised pre-training loss function, the neural network training system 100 repeatedly updates the values of the parameters of the MoE neural network 110, i.e., the parameters of the first neural network layer 120; the parameters of second neural network layer 150; the parameters of the MoE layer 130, including the respective sets of expert parameters of the plurality of expert neural networks 130A-B, and the gating parameters of the gating neural network 140; and the parameters of the other layers of the neural network 110.
- the parameters of the MoE neural network 110 i.e., the parameters of the first neural network layer 120; the parameters of second neural network layer 150; the parameters of the MoE layer 130, including the respective sets of expert parameters of the plurality of expert neural networks 130A-B, and the gating parameters of the gating neural network 140; and the parameters of the other layers of the neural network 110.
- the neural network training system 100 After training the MoE neural network 110 on the first training dataset 102-1, the neural network training system 100 obtains the second training dataset 102-2.
- the second training dataset 102-2 is one of the training datasets mentioned in Daniel Adiwardana, et al. Towards a human-like open-domain chatbot. CoRR, abs/2001.09977, 2020.
- the neural network training system 100 modifies the architecture of the MoE neural network 110 to generate a modified MoE neural network 111 by expanding the MoE layer 130 to generate an expanded MoE layer 131. This is depicted by the transition from the MoE neural network 110 on the left-hand side to the modified MoE neural network 111 on the right-hand side of FIG. 1, where the arrow on top indicates a temporal evolution.
- the expanded MoE layer 131 includes one or more additional expert neural networks.
- the neural network training system 100 is configured to generate the one or more additional expert neural networks based on one or more of the plurality of existing expert neural networks 130A-B in the MoE layer 130.
- An additional expert neural network will have the same architecture and, upon instantiation, the same parameter values as one of the plurality of existing expert neural networks 130A-B.
- the one or more additional expert neural networks are configured to operate in parallel with the plurality of expert neural networks that were already included in the MoE layer 130. Initialization of an additional expert neural network through copying of an existing expert neural network may provide better performance than a random initialization for example. This provides a better starting point for training of the additional expert neural network.
- FIG. 1 thus illustrates that the modified neural network 111 includes an expanded MoE layer 131 that includes an additional expert neural network 130C.
- the expert neural network 130C has the same architecture as either the expert neural network 130A or the expert neural network BOB; upon instantiation, the expert neural network 130C has expert parameters that have the same trained values as either the expert neural network 130 A or the expert neural network BOB that have been determined as a result of training on the first training dataset 102-1.
- the expert neural network 130C can be a multi-layer, e.g., two layer or three layer, neural network of fully-connected layers with, e.g., a ReLU or GeLU activation function, and can have expert parameters that have the same trained values as either the expert neural network BOA or the expert neural network BOB.
- the expanded MoE layer 131 includes an expanded gating neural network 141.
- the expanded gating neural network 141 includes an additional set of gating parameters that is added by the neural network training system.
- the additional set of gating parameters enables the expanded gating neural network 141 to select, from among all the plurality of expert neural network 130A-130C included in the expanded MoE layer 131, one or more expert neural networks based on processing a first layer output generated by the first neural network layer 120 (or a part of the first layer output).
- the expanded gating neural network 141 is able to map the first layer output (or a part of the first layer output) to the respective gate scores for the plurality of expert neural networks 130A-C in accordance with the (i) the existing set of gating parameters included in the gating neural network 140, and (ii) the additional set of gating parameters.
- the neural network training system 100 modifies the output layer of the gating neural network 140 by adding an additional set of gating parameters to the output layer to increase the dimension of the output layer from N to N+d, i.e., to generate an expanded output layer of dimension N+d, i.e., an output layer with N+d units, where d represents the number of additional expert neural networks that have been added to the expanded MoE layer 131.
- the expanded gating neural network 141 includes one or more initial layers, followed by the expanded output layer of dimension N+d that can generate a respective initial score for each of the plurality of existing expert neural networks and for each of the one or more additional expert neural networks, by processing the first layer output (or a part of the first layer output) and then compute the gate scores by applying a softmax function to the initial scores.
- the neural network training system 100 proceeds to train the modified neural network 111 on the second training dataset 102-2 using an appropriate loss function.
- the loss function can be the same as the unsupervised or self-supervised pre-training loss function was used when training the MoE neural network 110 on the first training dataset 102-1, or can alternatively be a different loss function, e.g., a different unsupervised pre-training loss function, a different self-supervised pre-training loss function, or a different supervised loss function.
- the loss function includes the cross-entropy loss term and one or more auxiliary loss terms.
- the loss function includes an output regularization auxiliary term that is dependent on, for each training input, a difference between (i) a first output generated by the MoE neural network 110 from processing the training input and (ii) a second output generated by the modified MoE neural network 111 from processing the training input.
- an output regularization auxiliary term regularizes the newly added parameters through distillation from existing expert parameters and existing gating parameters. In particular, it prevents the newly added parameters from being updated too far from already trained values of the existing expert parameters and existing gating parameters.
- the output regularization auxiliary term helps the modified MoE neural network to retain the knowledge learned from previous training datasets and helps to avoid forgetting when training on a new dataset.
- the neural network training system 100 can train the modified neural network 111 on the second training dataset 102-2 by optimizing the following loss function (where the difference is computed as a Kullback-Leibler divergence, and the cross-entropy loss term is computed as the perplexity loss):
- the output regularization auxiliary term is computed as the Kullback-Leibler divergence multiplied by a scaling factor A.
- M represents the MoE neural network
- 0 O ;t-i represents the existing expert parameters and existing gating parameters included in the MoE layer
- 0 t represents the newly added expert parameters and the newly added gating parameters included in the MoE layer
- 0 d represents the parameters of other layers of the MoE neural network
- x is the embedding of the current token
- X represents the entire training dataset.
- the neural network training system 100 when training the modified MoE neural network 111 on the second training dataset 102-2, updates the values of the set of expert parameters of the expert neural network 130C, starting from the trained values that are the same as either the expert neural network 130A or the expert neural network 130B.
- the neural network training system 100 also updates the values of the additional set of gating parameters of the expanded gating neural network 141.
- the neural network training system 100 further updates the parameter values of the first layer 120 and the second layer 150.
- the neural network training system 100 holds the trained values of the respective sets of expert parameters of the expert neural networks 130A-B (that have been determined as a result of the training on the first training dataset 102-1) fixed.
- the neural network training system 100 also holds the trained values of the existing set of gating parameters of the expanded gating neural network 141 — or, put another way, the gating parameters included in the gating neural network 140 (that have been determined as a result of the training on the first training dataset 102-1) — fixed.
- the values of the gating parameter of the one or more initial layers included in the gating neural network 140 are held fixed (but the values of the additional set of gating parameters added to the output layer of the gating neural network 140 will be updated).
- the neural network training system 100 can obtain a third training dataset, and modifies the modified MoE neural network 111 to generate a further modified MoE neural network by generating a further expanded MoE layer.
- the further modified MoE neural network will include one or more additional expert neural networks, and a further expanded gating neural network that includes a further additional set of gating parameters.
- the neural network training system 100 then trains the further modified MoE neural network on the third training dataset.
- the neural network training system 100 obtains multiple training datasets and progressively expands the MoE neural network 110, e.g., increases the number of expert neural network (and, correspondingly, parameters) included in the MoE layer 130, while the neural network training system 100 trains the MoE neural network 110 on each of the multiple training datasets.
- the neural network training system 100 or a different inference system 150 deploys the trained MoE neural network on one or more computing devices to perform inference, i.e., to generate new network outputs for any one of the tasks mentioned above for new network inputs.
- the neural network training system 100 or another system can adapt the trained MoE neural network to a downstream task by further fine-tuning some or all of the parameter values of the trained MoE neural network, e.g., using a different dataset, e.g., a labeled dataset that is specific to the downstream task, or on a different loss function, e.g., a supervised loss function that is specific to the downstream task, or both.
- a different dataset e.g., a labeled dataset that is specific to the downstream task
- a different loss function e.g., a supervised loss function that is specific to the downstream task, or both.
- FIG. 2 is a flow diagram of an example process 200 for training a Mixture of Experts (MoE) neural network on multiple training datasets.
- the process 200 will be described as being performed by a system of one or more computers located in one or more locations.
- a neural network training system e.g., the neural network training system 100 depicted in FIG. 1, appropriately programmed in accordance with this specification, can perform the process 200.
- the MoE neural network includes a plurality of layers.
- the plurality of layers include a MoE layer.
- the MoE layer includes a plurality of expert neural networks that each has a respective set of expert parameters.
- the MoE layer also includes a gating neural network that has a set of gating parameters.
- the system can repeatedly perform multiple iterations of process 200 for multiple training datasets. For example, the system can begin from the first training dataset, and iterate through all training datasets one after another. By repeatedly performing iterations of process 200, the system can generate a trained MoE neural network that can perform a machine learning task.
- the system obtains a new training dataset for the MoE neural network (step 202).
- the new training dataset includes a plurality of training inputs.
- the system generates one or more additional expert neural network based on one or more of the plurality of expert neural networks (step 204).
- the system generates a fixed number of additional expert neural networks, e.g., one, two, three, and so on, additional expert neural networks, for every new training dataset.
- the MoE layer when trained on the first training dataset, the MoE layer includes four expert neural networks; when trained on the second training dataset, the MoE layer includes seven expert neural networks (with three additional expert neural network added); when trained on the third training dataset, the MoE layer includes ten expert neural networks (with three further additional expert neural network added); and so on.
- the system generates a varying number of additional expert neural networks for every new training dataset, e.g., generates no additional expert neural network for certain training datasets.
- FIG. 3 shows sub-steps 302-304 corresponding to step 204.
- the system can repeatedly perform iterations of sub-steps 302-304 to generate multiple additional expert neural networks.
- the system identifies an expert neural network from among the plurality of expert neural networks included in the MoE layer (step 302).
- any one of the plurality of expert neural networks included in the MoE layer can be identified.
- the system can identify different expert neural networks at different iterations of sub-steps 302- 304, e.g., by iterating through the plurality of expert neural networks included in the MoE layer or by sampling with some measure of randomness. In other cases, the system can identify the same expert neural network at each iteration of sub-steps 302-304.
- the system generates an additional expert neural network that has the same architecture, and that has a set of expert parameters that have the same trained values, as the identified expert neural network (step 304).
- the trained values of the set of expert parameters of the identified expert neural network are determined as a result of training on a previous training dataset.
- the system generates a modified MoE neural network based on modifying the MoE layer to generate an expanded MoE layer (step 206).
- the modification involves adding the one or more additional expert neural networks to the MoE layer to generate an expanded MoE layer.
- the expanded MoE layer thus includes the plurality of existing expert neural networks and the one or more additional expert neural networks.
- the modification also involves adding a set of gating parameters to the gating neural network included in the MoE layer to generate an expanded gating neural network.
- the expanded gating neural network thus includes an additional set of gating parameters, i.e., in addition to the existing set of gating parameters of the gating neural network.
- the system adds an additional set of gating parameters to an output layer of the gating neural network to increase the dimension of the output layer to match the total number of the existing and additional expert neural networks included in the expanded MoE layer, i.e., such that the expanded gating neural network is able to generate a respective gate score for each of the plurality of expert neural networks and for each of the one or more additional expert neural networks.
- the system trains the modified MoE neural network on the new training dataset (step 208).
- the system updates the values of the respective sets of expert parameters of the one or more additional expert neural networks.
- the system also updates the values of the additional set of gating parameters of the expanded gating neural network.
- the system further updates the values of the parameters of other layers, e g., the embedding layers, the output layer, and the attention layers, of the neural network.
- the system can update these parameters of the modified MoE neural network by performing a forward pass through the modified MoE neural network using the training inputs obtained from the new training dataset and then perform a backward pass through the neural network to compute the respective gradients of a pre-training loss function by backpropagating through appropriate parameters.
- the pre-training loss function can include the cross-entropy loss term and, optionally, the auxiliary loss term mentioned above.
- the system can then determine the updates by applying an update rule, e.g., an Adam update rule, an Adafactor update rule, an Rmsprop update rule, or a stochastic gradient descent (SGD) update rule, to the respective gradients.
- an update rule e.g., an Adam update rule, an Adafactor update rule, an Rmsprop update rule, or a stochastic gradient descent (SGD) update rule, to the respective gradients.
- an update rule e.g., an Adam update rule, an Adafactor update rule, an Rmsprop update rule, or a stochastic gradient descent (SGD) update rule
- the system holds the trained values of the respective sets of expert parameters of the plurality of expert neural networks included in the MoE layer (that have been determined as a result of the training on a previous training dataset) fixed.
- the system also holds the trained values of the existing set of gating parameters of the gating neural network (that have been determined as a result of the training on the previous training dataset) fixed.
- the system can do this by, e.g., by applying a stop gradient operator to these parameters.
- FIG. 4 is an example illustration of training a Mixture of Experts (MoE) neural network on multiple training datasets
- x represents a training dataset (that has a corresponding data distribution).
- the number of expert neural networks included in the MoE layer M is increased as x ⁇ is to be used to train the MoE neural network, i.e., with E xpert E (t-i) +1 — Expert ⁇ added after the MoE neural network has been trained on and prior to being trained on x ⁇ X
- the values of respective sets of expert parameters of the existing expert neural networks Expert — Expert E (t-i) will be frozen.
- the existing set of gating parameters of the gating neural network will be frozen (and, correspondingly, the routing to the existing expert neural networks, as illustrated by the curved arrows could be maintained).
- FIG. 5 shows a quantitative example of the performance gains that can be achieved by a Mixture of Experts (MoE) neural network trained by the neural network training system described in this specification in comparison with MoE neural networks (Dense + Online L2 Reg., Dense + Memory Replay, Dense Oracle) trained using a GShard system (described in Lepikhin, D., et al. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021), a GLaM system (described in Du, N., et al. Glam: Efficient scaling of language models with mixture-ofexperts. In International Conference on Machine Learning, pp. 5547-5569. PMLR, 2022), and a conventional lifelong learning system.
- MoE Mixture of Experts
- Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.
- the computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data, e g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
- the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations.
- the index database can include multiple collections of data, each of which may be organized and accessed differently.
- engine is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.
- an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
- the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
- Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
- a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
- the central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- USB universal serial bus
- Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD ROM and DVD-ROM disks.
- embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
- a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
- Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.
- Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a JAX framework.
- a machine learning framework e.g., a TensorFlow framework or a JAX framework.
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.
- Data generated at the user device e.g., a result of the user interaction, can be received at the server from the device.
- a method of training a Mixture of Experts (MoE) neural network on multiple training datasets to perform a machine learning task the MoE neural network comprising a plurality of layers, the plurality of layers comprising a MoE layer that comprises a plurality of expert neural networks
- the method comprises: obtaining a new training dataset for the MoE neural network; generating one or more additional expert neural network based on one or more of the plurality of expert neural networks; generating a modified MoE neural network based on modifying the MoE layer to include the one or more additional expert neural networks; and training the modified MoE neural network on the new training dataset to update parameters values of the one or more additional expert neural networks while holding parameter values of the plurality of expert neural networks fixed.
- Clause 3 The method of any one of clauses 1-2, wherein generating the one or more additional expert neural networks based on one or more of the plurality of expert neural networks comprises: generating a fixed number of additional expert neural networks for every new training dataset.
- Clause 4 The method of any one of clauses 1-3, wherein the one or more additional expert neural networks are configured to operate in parallel with the plurality of expert neural networks in the MoE layer.
- Clause 5 The method of any one of clauses 1-4, wherein generating the modified MoE neural network comprises: modifying one or more output layers of a gating neural network included in the MoE layer such that the gating neural network is configured to generate a respective gate score for each of the plurality of expert neural networks and for each of the one or more additional expert neural networks.
- Clause 6 The method of clause 5, wherein the one or more output layers comprises a softmax layer, and wherein training the modified MoE neural network on the new training dataset comprises: holding parameter values of other layers included in the gating neural network fixed.
- Clause 7 The method of any one of clauses 1-6, wherein training the modified MoE neural network on the new training dataset comprises: updating parameters values of one or more other layers included in the MoE layer.
- Clause 8 The method of any one of clauses 1-7, wherein training the modified MoE neural network on the new training dataset comprises: determining an update to the parameters values of the one or more additional expert neural networks based on optimizing a self-supervised loss function that includes a main term dependent on a performance of the MoE neural network on the machine learning task.
- the self-supervised loss function also includes an auxiliary term that is dependent on, for a training input in the new training dataset, a difference between (i) a first output generated by the MoE neural network from processing the training input and (ii) a second output generated by the modified MoE neural network from processing the training input.
- Clause 10 The method of clause 9, wherein the difference comprises a Kullback-Leibler divergence, and wherein the second term is computed as the Kullback-Leibler divergence multiplied by a scaling factor.
- Clause 11 The method of any one of clauses 1-10, wherein the machine learning task comprises a masked token prediction task.
- Clause 12 The method of any one of clauses 1-11, wherein the MoE neural network is a MoE attention neural network, and wherein the MoE layer is a MoE attention layer that comprises an attention sub-layer and a feed-forward sub-layer that comprises the plurality of expert neural networks. Clause 13. The method of clause 12, wherein the attention sub-layer comprises one or more linear projection layers.
- Clause 14 The method of any one of clauses 1-13, further comprising adapting the trained MoE neural network to a downstream task that is different from the machine learning task, including fine-tuning the trained parameters values of the trained MoE neural network, by using a different loss function, one or more different training datasets, or both.
- Clause 15 A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one more computers to perform the operations of the respective method of any one of clauses 1-14.
- Clause 16 One or more computer storage media storing instructions that when executed by one or more computers cause the one more computers to perform the operations of the respective method of any one of clauses 1-14.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
L'invention concerne des procédés, des systèmes et un appareil, y compris des programmes informatiques codés sur des supports de stockage informatiques, destinés à entraîner un réseau neuronal de mélange d'experts (MoE) sur de multiples ensembles de données d'apprentissage. L'un des procédés consiste à obtenir un nouvel ensemble de données d'apprentissage ; générer un ou plusieurs réseaux neuronaux experts supplémentaires sur la base d'un ou plusieurs de la pluralité de réseaux neuronaux experts ; générer un réseau neuronal MoE modifié sur la base de la modification de la couche MoE pour inclure le ou les réseaux neuronaux experts supplémentaires ; et entraîner le réseau neuronal MoE modifié sur le nouvel ensemble de données d'apprentissage pour mettre à jour des valeurs de paramètres du ou des réseaux neuronaux experts supplémentaires tout en maintenant des valeurs de paramètre de la pluralité de réseaux neuronaux experts fixes.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202480009085.1A CN120883219A (zh) | 2023-01-26 | 2024-01-26 | 专家混合神经网络的终身预训练 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363441414P | 2023-01-26 | 2023-01-26 | |
| US63/441,414 | 2023-01-26 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024159132A1 true WO2024159132A1 (fr) | 2024-08-02 |
Family
ID=90361904
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/013166 Ceased WO2024159132A1 (fr) | 2023-01-26 | 2024-01-26 | Pré-apprentissage continu de réseaux neuronaux de mélange d'experts |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120883219A (fr) |
| WO (1) | WO2024159132A1 (fr) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118966275A (zh) * | 2024-10-21 | 2024-11-15 | 沐曦集成电路(上海)有限公司 | 基于moe场景的数据均衡分配方法、电子设备及存储介质 |
| CN119152341A (zh) * | 2024-08-23 | 2024-12-17 | 南京理工大学 | 一种基于多专家协作机制的自适应深度学习条纹分析方法 |
| CN119398121A (zh) * | 2024-11-18 | 2025-02-07 | 西北工业大学 | 基于Multi-A和Multi-B专家杂化混合专家的大模型微调方法 |
| CN119761407A (zh) * | 2024-12-11 | 2025-04-04 | 白熊智数(北京)科技有限公司 | 一种多智能体系统的自适应混合专家模型训练框架 |
| CN120012727A (zh) * | 2025-04-16 | 2025-05-16 | 北京飞书科技有限公司 | 文本生成方法及电子设备 |
| CN120373423A (zh) * | 2025-06-27 | 2025-07-25 | 山东海量信息技术研究院 | 专家并行训练耗时预测方法、装置、设备、介质及产品 |
-
2024
- 2024-01-26 WO PCT/US2024/013166 patent/WO2024159132A1/fr not_active Ceased
- 2024-01-26 CN CN202480009085.1A patent/CN120883219A/zh active Pending
Non-Patent Citations (16)
| Title |
|---|
| ASHISH VASWANI ET AL.: "Attention is all you need", 31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017 |
| BOJAR, O ET AL.: "Findings of the 2016 conference on machine translation", PROCEEDINGS OF THE FIRST CONFERENCE ON MACHINE TRANSLATION, vol. 2, 2016, pages 131 - 198 |
| COLIN RAFFEL ET AL.: "Exploring the limits of transfer learning with a unified text-to-text transformer", ARXIV:1910.10683, 2019 |
| DANIEL ADIWARDANA ET AL.: "Towards a human-like open-domain chatbot", CORR, ABS/2001.09977, 2020 |
| DEVLIN, JACOB: "Bert: Pre-training of deep bidirectional transformers for language understanding", ARXIV: 1810.04805 |
| DU, N ET AL.: "Glam: Efficient scaling of language models with mixture-ofexperts", INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 2022, pages 5547 - 5569 |
| HEINKE HIHN ET AL: "Mixture-of-Variational-Experts for Continual Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 December 2021 (2021-12-01), XP091111892 * |
| JIANZHAO LIU ET AL: "LIRA: Lifelong Image Restoration from Unknown Blended Distortions", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 August 2020 (2020-08-19), XP081743932 * |
| JOSHI, M ET AL.: "TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension", PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, vol. 1, 2017, pages 1601 - 1611 |
| KUDUGUNTA, S. ET AL., EMNLP 2021, 2021, pages 3577 - 3599 |
| LEPIKHIN, D ET AL.: "GShard: Scaling giant models with conditional computation and automatic sharding", IN INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2021 |
| NAN DU ET AL.: "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts", ARXIV: 2112.06905 |
| NAN DU ET AL.: "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, arXiv: 2112.06905, and in William Fedus, et al., Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity", ARXIV:2101.03961 |
| RAHAF ALJUNDI ET AL: "Expert Gate: Lifelong Learning with a Network of Experts", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 November 2016 (2016-11-18), XP080733092, DOI: 10.1109/CVPR.2017.753 * |
| SOOCHAN LEE ET AL: "A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 January 2020 (2020-01-03), XP081571185 * |
| TOM B BROWN ET AL.: "Language models are few-shot learners", ARXIV:2005.14165, 2020 |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119152341A (zh) * | 2024-08-23 | 2024-12-17 | 南京理工大学 | 一种基于多专家协作机制的自适应深度学习条纹分析方法 |
| CN118966275A (zh) * | 2024-10-21 | 2024-11-15 | 沐曦集成电路(上海)有限公司 | 基于moe场景的数据均衡分配方法、电子设备及存储介质 |
| CN119398121A (zh) * | 2024-11-18 | 2025-02-07 | 西北工业大学 | 基于Multi-A和Multi-B专家杂化混合专家的大模型微调方法 |
| CN119761407A (zh) * | 2024-12-11 | 2025-04-04 | 白熊智数(北京)科技有限公司 | 一种多智能体系统的自适应混合专家模型训练框架 |
| CN120012727A (zh) * | 2025-04-16 | 2025-05-16 | 北京飞书科技有限公司 | 文本生成方法及电子设备 |
| CN120373423A (zh) * | 2025-06-27 | 2025-07-25 | 山东海量信息技术研究院 | 专家并行训练耗时预测方法、装置、设备、介质及产品 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120883219A (zh) | 2025-10-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11934791B2 (en) | On-device projection neural networks for natural language understanding | |
| CN112487182B (zh) | 文本处理模型的训练方法、文本处理方法及装置 | |
| US20250315622A1 (en) | Performing machine learning tasks using instruction-tuned neural networks | |
| US10643120B2 (en) | Joint learning of local and global features for entity linking via neural networks | |
| US20220188636A1 (en) | Meta pseudo-labels | |
| WO2024159132A1 (fr) | Pré-apprentissage continu de réseaux neuronaux de mélange d'experts | |
| US12050983B2 (en) | Attention neural networks with parallel attention and feed-forward layers | |
| EP3411835B1 (fr) | Augmentation des réseaux neuronals par mémoire hiérarchique externe | |
| US20230029590A1 (en) | Evaluating output sequences using an auto-regressive language model neural network | |
| US12393840B2 (en) | Granular neural network architecture search over low-level primitives | |
| JP2025517085A (ja) | 対照キャプションニューラルネットワーク | |
| EP3586276A1 (fr) | Traitement de séquence à l'aide d'une attention en ligne | |
| EP4121909A1 (fr) | Réseaux neuronaux d'attention avec calcul conditionnel | |
| US20240378441A1 (en) | Training of large neural networks | |
| US20240005131A1 (en) | Attention neural networks with tree attention mechanisms | |
| WO2021234610A1 (fr) | Procédé et système d'entraînement d'un algorithme d'apprentissage automatique pour générer un résumé de texte | |
| US20240256865A1 (en) | Training neural networks using learned optimizers | |
| US12423518B2 (en) | Attention neural networks with N-grammer layers | |
| US20230359895A1 (en) | Training neural networks using sign and momentum based optimizers | |
| EP4490663A1 (fr) | Modèles d'apprentissage machine épistémique | |
| US20240289619A1 (en) | Gradient-free structured pruning of neural networks | |
| US20250245502A1 (en) | Training neural networks using weight norm regularizations | |
| US20250371320A1 (en) | Neural networks with learned augmented residual layers | |
| US20250013915A1 (en) | Reinforcement Learning with Information Retrieval Feedback |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24709921 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202480009085.1 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 202480009085.1 Country of ref document: CN |