WO2021002968A1 - Génération de modèle basée sur une compression de modèle - Google Patents
Génération de modèle basée sur une compression de modèle Download PDFInfo
- Publication number
- WO2021002968A1 WO2021002968A1 PCT/US2020/033891 US2020033891W WO2021002968A1 WO 2021002968 A1 WO2021002968 A1 WO 2021002968A1 US 2020033891 W US2020033891 W US 2020033891W WO 2021002968 A1 WO2021002968 A1 WO 2021002968A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- training
- target
- sample
- models
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
Definitions
- Deep learning models have been continuously developed and performed well in areas such as natural language processing and computer vision.
- deep learning models such as Bidirectional Encoder Representations from Transformers (BERT) model and Generative Pre-Training Transformer (GPT) model are proven to have good results.
- BERT Bidirectional Encoder Representations from Transformers
- GPT Generative Pre-Training Transformer
- Such deep learning models tend to rely on complex models having deep networks with huge parameters, for example, the BERT model may contain 24 converter layers with a total of 340 million parameters, and the GPT model may contain 48 converter layers with a total of 1.5 billion parameters. Training such complex models and inferring using such complex models is time consuming and it is difficult to apply them to actual business scenarios. Model compression methods are often used to obtain simple models that have fewer parameters than complex models and can be deployed.
- Embodiments of the present disclosure provide a method and apparatus for model generation.
- a pre-training dataset can be scored through a plurality of pre-training models, the plurality of pre-training models performing a first task.
- An initial model can be pre-trained with the scored pre-training dataset.
- the initial model can be updated based on a plurality of reference models to obtain a target model, the plurality of reference models performing a second task.
- a reference dataset can be scored through the plurality of reference models.
- the target model can be trained with the scored reference dataset.
- FIG.l illustrates an exemplary process for model generation based on model compression according to an embodiment of the present disclosure.
- FIG.2 illustrates a schematic diagram for a student model according to an embodiment of the present disclosure.
- FIG.3 illustrates a schematic diagram for an initial model according to an embodiment of the present disclosure.
- FIG.4 illustrates an exemplary process for pre-training an initial model according to an embodiment of the present disclosure.
- FIG.5 illustrates a specific example for pre-training an initial model according to an embodiment of the present disclosure.
- FIG.6 illustrates a schematic diagram for a target model according to an embodiment of the present disclosure.
- FIG.7 illustrates an exemplary process for training a target model according to an embodiment of the present disclosure.
- FIG.8 illustrates a specific example for training a target model according to an embodiment of the present disclosure.
- FIG.9 illustrates an exemplary process for performing tasks during a deployment phase according to an embodiment of the present disclosure.
- FIG.10 illustrates a specific example for performing tasks during a deployment phase according to an embodiment of the present disclosure.
- FIG.11 is a flowchart of an exemplary method for model generation according to an embodiment of the present disclosure.
- FIG.12 illustrates an exemplary apparatus for model generation according to an embodiment of the present disclosure.
- FIG.13 illustrates an exemplary apparatus for model generation according to an embodiment of the present disclosure.
- a commonly used model compression method can be based on Knowledge Distillation. This method usually migrates knowledge from a complex model to a simple model by learning output distribution of a complex model by a simple model. The knowledge can be considered as parameters of the complex model and as input-to-output mapping implemented by the complex model.
- the method is based on a teacher- student architecture in which a model that provides knowledge can be considered a teacher model and a model that learns knowledge can be considered a student model. Specifically, when training the student model, not only training data with human-provided labels but also training data with scores output by the teacher model are provided to the student model. Therefore, the student model can optimize its model parameters by learning the scores output by the teacher model in an attempt to achieve the effect of the teacher model.
- the one-to-one method refers to one teacher model providing knowledge to one student model
- the many-to-many method refers to a plurality of teacher models providing knowledge to a plurality of student models and combining the plurality of student models into a student model set during application.
- Parameters of the student model obtained through the one-to-one method are greatly reduced compared with the teacher model, but due to the information loss during the knowledge distillation, there is a big gap between the effect of the student model and that of the teacher model.
- the effect of the student model set obtained through the many-to-many method is better than that of the student model obtained through the one-to-one method, but since the method performs model integration after over-fitting deviation from each teacher model has been passed to student models, there is still a gap between the effect of the student model and that of the teacher model. In addition, a large amount of memory resources are consumed to host the student model set.
- Embodiments of the present disclosure propose a model generation method based on model compression. According to the embodiments of the present disclosure, a student model with a simple structure that is able to be deployed can be generated and trained to apply to various tasks, such as tasks related to natural language processing.
- the embodiments of the present disclosure propose a student model with a novel structure.
- the student model can learn knowledge from a plurality of teacher models at the same time.
- the model generation method involved in the embodiments of the present disclosure may also be referred to as a many-to-one method. Since the student model constructed according to the embodiments of the present disclosure can learn knowledge from the plurality of teacher models at the same time, over-fitting deviation caused by the plurality of teacher models can be "early calibrated" compared to the many-to-many method, and the student model can obtain more general knowledge from the plurality of teacher models.
- the student model can be applied to a wide variety of tasks, such as tasks related to natural language processing.
- the embodiments of the present disclosure can train a student model through two phases.
- the first phase can be referred to as a pre-training phase, where the student model can be pre-trained with a large number of unlabeled datasets.
- the second phase can be referred to as a training phase, where the student model can be trained for a task to be performed by the student model at the time of actual deployment and with corresponding labeled datasets.
- a large number of datasets without labels can be scored using a teacher model to obtain a large amount of training data for pre-training a student model.
- an unlabeled dataset can comprise a huge amount of data collected by a search engine. Since the teacher model can be a model with better effects, its scoring on samples in the dataset will have higher accuracy and can better approximate human labeling.
- the student model can be pre-trained using the large amount of training data obtained. Since the amount of these training data will greatly exceed the available human labeled training data, and the scoring on the training data by the teacher model has higher accuracy, this will help to pre-train the student model with better effects.
- FIG.1 illustrates an exemplary process 100 for model generation based on model compression according to an embodiment of the present disclosure.
- a student model can be generated by using a plurality of teacher models.
- the teacher model can be a model with a complex structure and the student model can be a lightweight model.
- a pre-training dataset 102 and a plurality of pre-training models 104 can be obtained.
- the pre-training dataset refers to a dataset for pre-training a model
- the pre-training model refers to a teacher model for pre-training a student model.
- the pre training dataset 102 can be, for example, a dataset without labels.
- the pre-training dataset 102 can comprise a plurality of samples collected by a search engine.
- the plurality of pre-training models 104 can perform, for example, a first task related to natural language processing, such as, a matching task, a classification task, a generation task, a language understanding task and a named entity recognition task, etc.
- the matching task can be used to determine relevance between inputs.
- a question answering task in the matching task can be used to determine relevance between an input question and a text paragraph.
- the classification task can be used to classify inputs.
- a sentence classification task may classify a input sentence into a plurality of classes, such as a food class, an entertainment class, a music class, a learning class, a tourism class, etc.
- the generation task can generate text that can be understood by humans based on an input.
- a text summary generation task can output a summary by automatically analyzing a given document or a set of documents and extracting key points therein.
- the language understanding task can obtain key information in user intent and a user input by analyzing the user input.
- the named entity recognition task can be used to identify entities with specific meaning in a text, such as person names, place names, institution names, proper nouns, and so on.
- the plurality of pre-training models 104 can have different model structures.
- a pre-training model 104-1 may be a BERT model
- a pre-training model 104-2 may be a GPT model
- a pre-training model 104-3 may be an Embeddings from Language Models (ELMo) model, and the like.
- the plurality of pre-training models 104 may have different hyper-parameters.
- the plurality of pre-training models 104 may all be BERT models, but with different layers of converter blocks, different hidden sizes, different numbers of attention heads, etc.
- M pre-training models 104 are shown in FIG. 1, including pre-training models 104-1, 104-2, ..., 104 -m.
- the m pre-training models 104-1, 104-2, ..., 104 -m can score the pre-training dataset 102 to obtain a scored pre-training dataset 106.
- the scored pre-training dataset 106 can be used to pre-train a student model.
- FIG.2 illustrates a schematic diagram for a student model 200 according to an embodiment of the present disclosure.
- the student model 200 can include an encoder layer 210 that converts an input into continuous vector representations.
- the student model 200 can also include a presentation layer 220 that converts the continuous vector representations into contextual continuous vector representations.
- the presentation layer 220 can be implemented using a bidirectional converter encoder.
- the presentation layer 220 can be implemented using bidirectional Long Short-Term Memory (LSTM).
- the student model 200 can also include a multi-decoder layer 230 that includes a plurality of decoders corresponding to teacher models. Each decoder scores vectors output by the presentation layer 220. These decoders can provide different outputs for different tasks performed by the student model 200.
- the decoder can provide an output indicating whether an input question and an input paragraph are relevant, and for a named entity recognition task, the decoder can provide an output labeling the start and the end of an entity.
- the student model 200 can also include an aggregation layer 240 that aggregates outputs of the multi-decoder layer to provide a final result. It should be appreciated that the model shown in FIG. 2 is only one example of the student model 200. The student model 200 can have any other structure and can include more or fewer layers depending on actual application requirements.
- the student model may be referred to as an initial model, such as an initial model 108 shown in FIG. l .
- the initial model is relative to a target model mentioned below that will be actually deployed. A specific structure of the initial model 108 will be explained later in conjunction with FIG. 3 and details about pre-training the initial model will be explained in conjunction with FIGS. 4 and 5.
- the task can be, for example, a second task related to natural language processing, such as, the aforementioned matching task, classification task, generation task, language understanding task and named entity recognition task, etc..
- the second task may be the same as or different from the first task performed by the plurality of pre-training models 104.
- a reference dataset 110 associated with the second task and a plurality of reference models 112 performing the second task may be obtained.
- the reference dataset refers to a dataset for training a student model
- the reference model refers to a teacher model for training the student model.
- the reference dataset 110 may be, for example, a dataset with an initial label.
- the initial label may be provided by humans or in any other way.
- the plurality of reference models 112 can have different model structures.
- a reference model 112-1 may be a BERT model
- a reference model 112-2 may be a GPT model
- a reference model 112-3 may be an ELMo model, etc.
- the plurality of reference models 112 may have different hyper-parameters.
- the plurality of reference models 1 12 may all be BERT models, but with different layers of converter blocks, different hidden sizes, different numbers of attention heads, etc.
- N reference models 112 are shown in FIG. 1, including reference models 112-1, 112-2, 112 -n.
- the initial model 108 can be updated based on the n reference models 112-1, 112-2, ..., 112-/7 to obtain a target model 116.
- the target model 116 can be obtained by updating a multi-decoder layer of the initial model 108 to include a first target decoder corresponding to an initial label in the reference dataset 110 and n second target decoders corresponding to the n reference models 112-1, 112-2, ..., 112- n.
- a specific structure of the target model 116 will be explained later in conjunction with FIG.6.
- the n reference models 112-1, 112-2, ..., 112 -n may score the reference dataset 110 to obtain a scored reference dataset 118.
- the scored reference dataset 118 can be used to training the target model 116. Details about training the target model will be explained later in conjunction with FIGS.7 and 8.
- the target model 116 after being trained, can be used to perform, for example, the same second task as the task performed by the plurality of reference models 112.
- the process 100 shown in FIG.1 may include a pre-training phase and a training phase.
- the initial model may be pre-trained through the plurality of pre-training models with the pre-training dataset, as shown by 102-108 in FIG. l .
- the training phase the initial model may be updated based on the plurality of reference models to obtain the target model, and the target model may be trained through the plurality of reference models with the reference dataset having initial labels, as shown by 110-118 in FIG. l .
- FIG.3 illustrates a schematic diagram for an initial model 300 according to an embodiment of the present disclosure.
- the initial model 300 can be a student model that is pre-trained during the pre-training phase.
- the initial model 300 may include an encoder layer 310, a presentation layer 320, a multi-decoder layer 330, and an aggregation layer 340, which may correspond to the encoder layer 210, the presentation layer 220, the multi-decoder layer 230 and the aggregation layer 240 in FIG. 2, respectively.
- the multi-decoder layer 330 in the initial model 300 may include initial decoders corresponding to teacher models used to pre-train the initial model 300, i.e., pre-training models.
- each initial decoder is intended to learn a corresponding pre-training model.
- the multi-decoder layer 330 may include m initial decoders corresponding to m pre-training models, such as initial decoders 332-1, 332-2, ..., 332 -m.
- FIG.4 illustrates an exemplary process 400 for pre-training an initial model according to an embodiment of the present disclosure.
- the initial model can be pre-trained through a plurality of pre-training models with a pre-training dataset.
- the plurality of pre-training models can perform a first task.
- the pre-training dataset can be scored through the plurality of pre training models.
- the pre-training dataset can include a plurality of samples. Each sample i in the pre-training dataset may be scored through the plurality of pre-training models to obtain a plurality of target probability distributions for the sample i.
- the target probability distribution can comprise a set of probabilities that the sample i is labeled, by a pre-training model, as each class in a class set related to the sample i.
- the target probability distribution for the sample i produced by the pre-training model is ⁇ ft ti ( c i) ⁇ ⁇ t 2 ( c 2) ⁇ ⁇
- the sample i can be scored through the initial model, and more specifically through a plurality of initial decoders corresponding to the plurality of pre training models in the initial model, to obtain a plurality of predicted probability distributions for the sample i.
- the predicted probability distribution can comprise a set of probabilities that the sample i is labeled, by an initial decoder, as each class in a class set related to the sample i. For example, if the probability that the sample i is labeled as c x by the initial decoder is i?
- the probability that the sample i is labeled as c 2 by the initial decoder is R s2 , where R sl and R s2 are both original logits before normalization by the softmax function, the target probability distribution for the sample i produced by the initial decoder is ⁇ ft si ( c i)’ ⁇ s2( c 2) ⁇
- a plurality of prediction losses corresponding to the plurality of predicted probability distributions produced by the plurality of initial decoders may be calculated.
- an initial decoder that produces the predicted probability distribution j can be identified, and a pre-training model corresponding to the initial decoder can be further determined.
- a target probability distribution produced by the pre-training model can be extracted from the plurality of target probability distributions of the sample i.
- a prediction loss corresponding to the predicted probability distribution j can be calculated based on the extracted target probability distribution and the predicted probability distribution j.
- the prediction loss may be calculated through a mean square error loss function based on the difference between the extracted target probability distribution and the predicted probability distribution j.
- the steps 406-410 may be performed for each predicted probability distribution to obtain a plurality of prediction losses corresponding to the plurality of predicted probability distributions.
- a comprehensive prediction loss corresponding to the sample i can be calculated based on the plurality of prediction losses.
- the comprehensive prediction loss may be calculated by calculating an arithmetic mean of the plurality of prediction losses.
- the initial model can be optimized by minimizing the comprehensive prediction loss.
- FIG.5 illustrates a specific example 500 for pre-training an initial model according to an embodiment of the present disclosure.
- a plurality of pre-training models 520 can perform a question answering task, and a pre-training dataset for pre-training the initial model can be for the question answering task accordingly.
- all samples in the pre-training dataset may be represented as X where
- C ⁇ c 1 , c 2 , c 3 ⁇ may indicate that relevance between the question " Q " and the paragraph " P " in the sample 510 ⁇ Q, P> can be labeled as the following three classes: c , c 2 , and c 3.
- c c , c 2 , and c 3.
- the relevance between the question "Q” and the paragraph " P " in the sample 510 ⁇ Q, P> can be scored through respective pre-training model of the m pre-training models 520-1, 520-2, ..., 520 -m to obtain a target probability distribution that the sample is labeled as respective class C j in the class set C.
- a target probability distribution ti produced by the pre-training model 520-/ can be represented, for example, by the following formula:
- R t (c j I ⁇ Q, P >) represents the probability that the sample 510 ⁇ Q, P > is labeled as the class C j by the pre-training model 520-i , which is original logits before normalization by the softmax function.
- the sample 510 ⁇ Q, P> may also be provided to an initial model 530.
- the initial model 530 may include an encoder layer 540, a presentation layer 550, a multi-decoder layer 560, and an aggregation layer 570.
- the encoder layer 540 can convert the input into continuous vector representations.
- the presentation layer 550 can convert the continuous vector representations into contextual continuous vector representations.
- the multi-decoder layer 560 may include m initial decoders 562-1, 562-2, ..., 562-m corresponding to the m pre-training models 520-1, 520-2, ..., 520 -m, respectively, where the initial decoder 562 -i corresponds to the pre-training model 520-/ Relevance between the question "Q" and the paragraph "P" in the sample 510 ⁇ Q, P > can be scored through respective initial decoder of the m initial decoders 562-1, 562-2, ..., 562 -m to obtain a predicted probability distribution that the sample is labeled as respective class C j in the class set C.
- the probability distribution s t produced by the initial decoder 562 -i can be represented, for example, by the following formula:
- R s ⁇ Q, P > represents the probability that the sample 510 ⁇ Q, P > is labeled as the class C j by the initial decoder 562 -i.
- W S T is a learnable parameter matrix related to parameters in the initial model 530, is a hidden state of the first token as a global representation of the sample 510 ⁇ Q, P>.
- R s (c j I ⁇ Q, P >) is original logits before normalization by the softmax function.
- a prediction loss corresponding to the predicted probability distribution s L can be calculated based on the target probability distribution t t and the predicted probability distribution s L .
- the prediction loss may be calculated by a mean square error loss function, as shown by the following formula:
- the above steps can be performed for a predicted probability distribution produced by each of the plurality of initial decoders, thereby obtaining a plurality of prediction losses corresponding to the plurality of predicted probability distributions, and a comprehensive prediction loss corresponding to the sample can be calculated based on the plurality of prediction losses.
- a comprehensive prediction loss l corresponding to the sample 510 ⁇ Q, P> can be calculated.
- the comprehensive prediction loss l may be calculated by calculating an arithmetic mean of the plurality of prediction losses, as shown by the following formula:
- the initial model 530 can be optimized by minimizing the comprehensive prediction loss 1.
- the example 500 involves pre-training the initial model for the question answering task
- the embodiments of the present disclosure are not limited thereto, but the initial model may be pre-trained for other tasks related to natural language processing in a similar manner.
- the pre-training phase can be used to debug parameters in the initial model, such as parameters of the encoder layer and the presentation layer. These parameters can be saved for subsequent training phases.
- the initial model can be further trained for a specific task to be performed during actual deployment. In the case that this specific task is different from the task in the pre-training phase, the multi-decoder layer in the initial model can be redesigned for the specific task to obtain the target model.
- the target model can be trained through a plurality of reference models with a training dataset related to the specific task, i.e., a reference dataset.
- the reference dataset may be, for example, a dataset with initial labels.
- the initial labels may be provided by humans or in any other way.
- FIG.6 illustrates a schematic diagram for a target model 600 according to an embodiment of the present disclosure.
- the target model 600 may include an encoder layer 610, a presentation layer 620, a multi-decoder layer 630, and an aggregation layer 640, which may correspond to the encoder layer 210, the presentation layer 220, the multi-decoder layer 230 and the aggregation layer 240 in FIG. 2, respectively.
- the multi-decoder layer 630 in the target model 600 may include a first target decoder 632 corresponding to an initial label in the reference dataset. The first target decoder 632 intends to learn the initial label in the reference dataset.
- the multi-decoder layer 630 may further include a plurality of second target decoders corresponding to a plurality of teacher models for training the target model 600, i.e., a plurality of reference models. Each of the plurality of second target decoders intends to learn a reference model corresponding to the second target decoder.
- the multi-decoder layer 330 may include n second target decoders respectively corresponding to the n reference models, i.e., second target decoders 634-1 , 634-2, ..., 634 -n.
- FIG.7 illustrates an exemplary process 700 for training a target model according to an embodiment of the present disclosure.
- a target model may be trained through a plurality of reference models with a reference dataset.
- the plurality of reference models may perform a second task.
- the reference dataset may be scored through the plurality of reference models.
- the reference dataset may include a plurality of samples, each sample i having an initial label.
- the initial label may be provided by humans or in any other way.
- Each sample i in the reference dataset may be scored through the plurality of reference models to obtain a plurality of target probability distributions for the sample i.
- the target probability distribution for the sample i produced by the reference model is ( c 2 ) ⁇ -
- the sample i can be scored through the target model, and more specifically through a first target decoder corresponding to the initial label in the target model, to obtain a first predicted probability distribution for the sample i.
- the first predicted probability distribution is a probability that the sample i is labeled by the first target decoder as an initial label corresponding to the sample i. For example, if the initial label corresponding to the sample i is t g , the first predicted probability distribution of the sample i may be obtained by calculating the probability that the sample i is labeled as the initial label t g by the first target decoder.
- a first prediction loss corresponding to the first predicted probability distribution may be calculated based on the initial label and the first predicted probability distribution produced by the first target decoder.
- the first prediction loss can be calculated by a cross-entropy function.
- the sample i can be scored through the target model, and more specifically through a plurality of second target decoders corresponding to the plurality of reference models in the target model, to obtain a plurality of second predicted probability distributions for the sample i.
- the second predicted probability distribution for the sample i produced by the second target decoder is ⁇ R' si ( c i) > R' s 2( c 2) ⁇ ⁇
- a plurality of second prediction losses corresponding to the plurality of second predicted probability distributions produced by the plurality of second target decoders may be calculated.
- a second target decoder that produces the second predicted probability distribution j can be identified, and a reference model corresponding to the second target decoder can be further determined.
- a target probability distribution produced by the reference model can be extracted from the plurality of target probability distributions of the sample i.
- a second prediction loss corresponding to the second predicted probability distribution j can be calculated based on the extracted target probability distribution and the second predicted probability distribution j.
- the second prediction loss may be calculated by a mean square error loss function based on difference between the extracted target probability distribution and the second predicted probability distribution j.
- the steps 710-714 may be performed for each second predicted probability distribution to obtain a plurality of second prediction losses corresponding to the plurality of second predicted probability distributions.
- a comprehensive prediction loss for the sample i can be calculated based on the first prediction loss and the plurality of second prediction losses.
- the comprehensive prediction loss may be calculated by calculating an arithmetic mean of the plurality of second prediction losses and performing a weighted summation of the arithmetic mean with the first prediction loss.
- the target model may be optimized by minimizing the comprehensive prediction loss.
- FIG.8 illustrates a specific example 800 for training a target model according to an embodiment of the present disclosure.
- a plurality of reference models may perform a classification task, and a reference dataset for training the target model may be for the classification task accordingly.
- Each sample ⁇ Q> has an initial label t g indicating the class of the query "Q", where t g eC'.
- the initial label t g may be provided by humans or in any other way.
- the query "Q" in the sample 810 ⁇ Q> can be scored through n reference models 820-1, 820-2, ..., 820-// to obtain a target probability distribution that the sample is labeled as respective class c' j in the class set C' .
- the target probability distribution t produced by the reference model 820 -i may be represented, for example, by the following formula:
- R' t ⁇ c' j I ⁇ Q > represents the probability that the sample 810 ⁇ Q > is labeled as the class c' j by the reference model 820-/, which is original logits before normalization by the softmax function.
- the sample 810 ⁇ Q > may also be provided to a target model 830.
- the target model 830 can include an encoder layer 840, a presentation layer 850, a multi-decoder layer 860, and an aggregation layer 870.
- the encoder layer 840 can convert an input into continuous vector representations in a manner similar to the encoder layer 540 in the initial model 530.
- the presentation layer 850 can convert the continuous vector representations into contextual continuous vector representations in a manner similar to the presentation layer 550 in the initial model 530.
- the multi-decoder layer 860 may include a first target decoder 862 corresponding to the initial label t g and n second target decoders 864-1, 864-2...864 -n corresponding to the n reference models 820-1, 820-2, ..., 820- /, respectively, where a second target decoder 864-/ corresponds to a reference model 820-/.
- the query "Q" in the sample 810 ⁇ Q > can be scored through the first target decoder 862 to obtain a first predicted probability distribution s g that the sample is labeled as the initial label t g .
- the first predicted probability distribution s g can be calculated, for example, by the following formula:
- W is a leamable parameter matrix related to parameters in the target model 830
- h is a hidden state of the first token as a global representation of the sample 810 ⁇ Q >.
- a first prediction loss corresponding to the first predicted probability distribution s g may be calculated based on the initial label t g and the first predicted probability distribution s g .
- the first prediction loss l g corresponding to the sample 810 ⁇ Q > can be calculated through a cross-entropy function, as shown by the following formula:
- the query "Q" in the sample 810 ⁇ Q > may be scored through each of the second target decoders 884-1, 864-2, ..., 864 -n to obtain a second predicted probability distribution that the sample 810 ⁇ Q > is labeled as respective class c' j in the class set C .
- a probability distribution s’ L produced by a second target decoder 864-/ can be represented, for example, by the following formula:
- R' s (c' j ⁇ ⁇ Q >) represents the probability that the sample 810 ⁇ Q > is labeled as the class c' j by the second target decoder 864-/.
- R' s (c' j ⁇ ⁇ Q >) can be calculated, for example, by the following formula:
- W s Tl is a learnable parameter matrix related to parameters in the target model 830
- h is a hidden state of the first token as a global representation of the sample 810 ⁇ Q >.
- R' s (c' j ⁇ ⁇ Q >) is original logits before normalization by the softmax function.
- a prediction loss l for the second predicted probability distribution s can be calculated based on the target probability distribution t and the second predicted probability distribution s .
- the prediction loss may be calculated by a mean square error loss function, as shown by the following formula:
- the above steps can be performed for a second predicted probability distribution produced by each of the plurality of second target decoders, thereby obtaining a plurality of second prediction losses corresponding to the plurality of second predicted probability distributions, and a comprehensive prediction loss corresponding to the sample can be calculated based on the plurality of second prediction losses.
- the comprehensive prediction loss V corresponding to the sample 810 ⁇ Q > can be calculated.
- the comprehensive prediction loss V can be calculated, for example, by the following formula: where a is a weighted factor set by the system.
- the target model 830 can be optimized by minimizing the comprehensive prediction loss l'.
- example 800 involves training the target model for the classification task
- the embodiments of the present disclosure are not limited thereto, but the target model may be trained for other tasks related to natural language processing in a similar manner.
- the target model after being trained, can be used to perform, during a deployment phase, the same task as the task performed during the training phase, i.e., the second task.
- FIG.9 illustrates an exemplary process 900 for performing a task during a deployment phase according to an embodiment of the present disclosure.
- an input related to the second task can be received by the target model.
- the input can be scored through a plurality of target decoders in the target model to obtain a plurality of predicted probability distributions of the input.
- the input can be scored respectively through the first target decoder 632 and the n second target decoders 634-1, 634-2, ..., 634 -n in the multi-decoder layer 630 of the target model 610 shown in FIG. 6.
- a class set C ⁇ c ⁇ , c' 2 , ... , c' ⁇ c > ⁇ related to the input may be determined firstly.
- a predicted probability that each of the plurality of target decoders labels the input as the class c' j can be determined.
- the predicted probability that the input is labeled as the class c' j by the first target decoder 632 can be represented as, for example, P ⁇ c' j ).
- the predicted probability that the input is labeled as the class c' j by the second target decoder 634-/ can be represented as, for example,
- a predicted result of the input can be determined based on a plurality of predicted probability distributions produced by the plurality of target decoders.
- the predicted result of the input can be determined through the aggregation layer 640 in the target model 610 shown in FIG.6.
- a comprehensive predicted probability 0(c'j ) corresponding to each class c'j can be calculated.
- the comprehensive predicted probability 0(c'j ) can be calculated through, for example, the following formula:
- FIG.10 illustrates a specific example 1000 for performing a task during a deployment phase according to an embodiment of the present disclosure. Continue with the classification task as an example.
- a target model 1020 can obtain an inputs related to the classification task, such as a sample 1010 ⁇ Q >, where " Q " represents a query. Assume that a class set related to the sample is represented as C where the number of classes can be represented as ⁇ C ⁇ .
- the target model 1020 can include an encoder layer 1030, a presentation layer 1040, a multi-decoder layer 1050, and an aggregation layer 1060.
- the encoder layer 1030 can convert the input into continuous vector representations.
- the presentation layer 1040 can convert the continuous vector representations into contextual continuous vector representations.
- the multi-decoder layer 1050 may include a plurality of target decoders, such as a first target decoder 1052 and n second target decoders 1054-1, 1054-2, ..., 1054 -n.
- a predicted probability that the sample 1010 ⁇ Q > is labeled as the class c' j by each of the plurality of target decoders can be determined.
- the predicted probability that the sample 1010 ⁇ z) > is labeled as the class c' j by the first target decoder 1052 can be represented as P ⁇ c' j ⁇ ⁇ Q >).
- the predicted probability that the sample 1010 ⁇ Q > is labeled as the class c' j by the second target decoder 1054-/ can be represented as R’i(c' j ⁇ ⁇ Q >).
- the predicted probabilities P(c' j ⁇ ⁇ Q >) and R'i(c' j ⁇ ⁇ Q >) for the class c' j can be provided to the aggregation layer 1060.
- the comprehensive predicted probability 0(c' j ⁇ ⁇ Q >) for the class c' j can be calculated, for example, through the above formula (13). Subsequently, for example, a class whose comprehensive predicted probability is the maximum value may be selected as a final predicted result S output by the target model 1020.
- FIG.11 is a flowchart of an exemplary method 1100 for model generation according to an embodiment of the present disclosure.
- a pre-training dataset can be scored through a plurality of pre training models, the plurality of pre-training models performing a first task.
- an initial model can be pre-trained with the scored pre-training dataset.
- the initial model can be updated based on a plurality of reference models to obtain a target model, the plurality of reference models performing a second task.
- a reference dataset can be scored through the plurality of reference models.
- the target model can be trained with the scored reference dataset.
- the pre-training dataset comprises a plurality of samples
- the scoring the pre-training dataset comprises, for each sample of the plurality of samples: scoring the sample through the plurality of pre-training models, respectively, to obtain a plurality of target probability distributions of the sample.
- the initial model comprises a multi-decoder layer, the multi-decoder layer comprising a plurality of initial decoders corresponding to the plurality of pre-training models.
- the scored pre-training dataset comprises a plurality of samples, each sample having a plurality of target probability distributions produced by the plurality of pre-training models
- the pre-training the initial model includes, for each sample: scoring the sample through the plurality of initial decoders, respectively, to obtain a plurality of predicted probability distributions of the sample; calculating a plurality of prediction losses corresponding to the plurality of predicted probability distributions; calculating a comprehensive prediction loss corresponding to the sample based on the plurality of prediction losses; and optimizing the initial model by minimizing the comprehensive prediction loss.
- the calculating the plurality of prediction losses corresponding to the plurality of predicted probability distributions comprises, for each predicted probability distribution of the plurality of predicted probability distributions: determining a pre-training model corresponding to an initial decoder that produces the predicted probability distribution; extracting a target probability distribution produced by the pre-training model from the plurality of target probability distributions of the sample; and calculating a prediction loss corresponding to the predicted probability distribution based on the extracted target probability distribution and the predicted probability distribution.
- the reference dataset comprises a plurality of samples
- the scoring the reference dataset comprises, for each sample of the plurality of samples: scoring the sample through the plurality of reference models, respectively, to obtain a plurality of target probability distributions of the sample.
- the updating comprises: updating a multi-decoder layer of the initial model to include a first target decoder and a plurality of second target decoders corresponding to the plurality of reference models.
- the scored reference dataset comprises a plurality of samples, each sample having an initial label and a plurality of target probability distributions produced by the plurality of reference models
- the training the target model includes, for each sample: scoring the sample through the first target decoder to obtain a first predicted probability distribution of the sample; calculating a first prediction loss corresponding to the first predicted probability distribution based on the initial label and the first predicted probability distribution; scoring the sample through the plurality of second target decoders to obtain a plurality of second predicted probability distributions of the sample; calculating a plurality of second prediction losses corresponding to the plurality of second predicted probability distributions; calculating a comprehensive prediction loss corresponding to the sample based on the first prediction loss and the plurality of second prediction losses; and optimizing the target model by minimizing the comprehensive prediction loss.
- the calculating the plurality of second prediction losses corresponding to the plurality of second predicted probability distributions comprises, for each second predicted probability distribution of the plurality of second predicted probability distributions: determining a reference model corresponding to a second target decoder that produces the second predicted probability distribution; extracting a target probability distribution produced by the reference model from the plurality of target probability distributions of the sample; calculating a second prediction loss corresponding to the second predicted probability distribution based on the extracted target probability distribution and the second predicted probability distribution.
- the target model is trained to perform the second task.
- the method 800 further comprises: receiving an input related to the second task; scoring the input through a plurality of decoders in the target model to obtain a plurality of predicted probability distributions of the input; and determining a predicted result of the input based on the plurality of predicted probability distributions.
- the first task or the second task comprises one or more of a matching task, a classification task, a generation task, a language understanding task, and a named entity recognition task.
- the plurality of pre-training models have different model structures or have different hyper-parameters
- the plurality of reference models have different model structures or have different hyper-parameters
- the plurality of pre-training models or the plurality of reference models are models with complex structures, and the target model is a lightweight model.
- the pre-training dataset comprises a plurality of samples collected by a search engine.
- the method 800 may further comprise any steps/processes for model generation according to the embodiments of the present disclosure as mentioned above.
- FIG.12 illustrates an exemplary apparatus 1200 for model generation according to an embodiment of the present disclosure.
- the apparatus 1200 may comprise a first scoring module, for scoring a pre-training dataset through a plurality of pre-training models, the plurality of pre-training models performing a first task; a pre-training module, for pre training an initial model with the scored pre-training dataset; a updating module, for updating the initial model based on a plurality of reference models to obtain a target model, the plurality of reference models performing a second task; a second scoring module, for scoring a reference dataset through the plurality of reference models; and a training module, for training the target model with the scored reference dataset.
- the initial model comprises a multi-decoder layer, the multi-decoder layer comprising a plurality of initial decoders corresponding to the plurality of pre-training models.
- the updating module is further configured for: updating a multi-decoder layer of the initial model to include a first target decoder and a plurality of second target decoders corresponding to the plurality of reference models.
- the scored reference dataset comprises a plurality of samples, each sample having an initial label and a plurality of target probability distributions produced by the plurality of reference models
- the training module is further configured for: scoring the sample through the first target decoder to obtain a first predicted probability distribution of the sample; calculating a first prediction loss corresponding to the first predicted probability distribution based on the initial label and the first predicted probability distribution; scoring the sample through the plurality of second target decoders to obtain a plurality of second predicted probability distributions of the sample; calculating a plurality of second prediction losses corresponding to the plurality of second predicted probability distributions; calculating a comprehensive prediction loss corresponding to the sample based on the first prediction loss and the plurality of second prediction losses; and optimizing the target model by minimizing the comprehensive prediction loss.
- the apparatus 1200 may further comprise any other modules configured for model generation according to the embodiments of the present disclosure as mentioned above.
- FIG.13 illustrates an exemplary apparatus 1300 for model generation according to an embodiment of the present disclosure.
- the apparatus 1300 may comprise at least one processor 1310.
- the apparatus 1300 may further comprise a memory 1320 connected with the processor 1310.
- the memory 1320 may store computer-executable instructions that, when executed, cause the processor 1310 to perform any operations of the methods for model generation according to the embodiments of the present disclosure as mentioned above.
- the embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium.
- the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for model generation according to the embodiments of the present disclosure as mentioned above.
- modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
- processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system.
- a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
- DSP digital signal processor
- FPGA field programmable gate array
- PLD programmable logic device
- state machine gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
- the functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a micro
- Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium.
- Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk.
- a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
- Feedback Control In General (AREA)
- Stored Programmes (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
La présente invention concerne un procédé et un appareil de génération de modèle. Un ensemble de données de pré-apprentissage peut être noté par l'intermédiaire d'une pluralité de modèles de pré-apprentissage, la pluralité de modèles de pré-apprentissage effectuant une première tâche. Un modèle initial peut être pré-instruit avec l'ensemble de données de pré-apprentissage noté. Le modèle initial peut être mis à jour sur la base d'une pluralité de modèles de référence pour obtenir un modèle cible, la pluralité de modèles de référence effectuant une seconde tâche. Un ensemble de données de référence peut être noté à travers la pluralité de modèles de référence. Le modèle cible peut être entraîné à l'aide de l'ensemble de données noté.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910588384.6 | 2019-07-02 | ||
| CN201910588384.6A CN112257860B (zh) | 2019-07-02 | 2019-07-02 | 基于模型压缩的模型生成 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021002968A1 true WO2021002968A1 (fr) | 2021-01-07 |
Family
ID=71016708
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2020/033891 Ceased WO2021002968A1 (fr) | 2019-07-02 | 2020-05-21 | Génération de modèle basée sur une compression de modèle |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN112257860B (fr) |
| WO (1) | WO2021002968A1 (fr) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113642605A (zh) * | 2021-07-09 | 2021-11-12 | 北京百度网讯科技有限公司 | 模型蒸馏方法、装置、电子设备及存储介质 |
| CN114169501A (zh) * | 2021-12-02 | 2022-03-11 | 深圳市华尊科技股份有限公司 | 神经网络压缩方法及相关设备 |
| CN114357152A (zh) * | 2021-09-03 | 2022-04-15 | 北京大学 | 信息处理方法、装置、计算机可读存储介质和计算机设备 |
| WO2022178948A1 (fr) * | 2021-02-26 | 2022-09-01 | 平安科技(深圳)有限公司 | Procédé et appareil de distillation de modèle, dispositif, et support d'enregistrement |
| CN116453139A (zh) * | 2023-04-19 | 2023-07-18 | 科大讯飞股份有限公司 | 一种预训练方法及相关方法和设备 |
| CN116484005A (zh) * | 2023-06-25 | 2023-07-25 | 北京中关村科金技术有限公司 | 一种分类模型构建方法、装置及存储介质 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115130465A (zh) * | 2022-07-18 | 2022-09-30 | 浙大城市学院 | 文献数据集上知识图谱实体标注错误识别方法和系统 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018126213A1 (fr) * | 2016-12-30 | 2018-07-05 | Google Llc | Apprentissage multitâche à l'aide d'une distillation de connaissances |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11074470B2 (en) * | 2017-12-06 | 2021-07-27 | La Société Dataperformers Inc. | System and method for automatically improving gathering of data using a data gathering device |
| CN109145828B (zh) * | 2018-08-24 | 2020-12-25 | 北京字节跳动网络技术有限公司 | 用于生成视频类别检测模型的方法和装置 |
| CN109191453A (zh) * | 2018-09-14 | 2019-01-11 | 北京字节跳动网络技术有限公司 | 用于生成图像类别检测模型的方法和装置 |
| CN109214700A (zh) * | 2018-09-20 | 2019-01-15 | 北京绿橙天下信息技术有限公司 | 一种基于进阶式能力轴的学生能力分析方法 |
-
2019
- 2019-07-02 CN CN201910588384.6A patent/CN112257860B/zh active Active
-
2020
- 2020-05-21 WO PCT/US2020/033891 patent/WO2021002968A1/fr not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018126213A1 (fr) * | 2016-12-30 | 2018-07-05 | Google Llc | Apprentissage multitâche à l'aide d'une distillation de connaissances |
Non-Patent Citations (4)
| Title |
|---|
| XIAODONG LIU ET AL: "Multi-Task Deep Neural Networks for Natural Language Understanding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 January 2019 (2019-01-31), XP081365050 * |
| XIAODONG LIU ET AL: "The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 May 2020 (2020-05-15), XP081660801 * |
| ZE YANG ET AL: "Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 April 2019 (2019-04-21), XP081171965 * |
| ZE YANG ET AL: "Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 October 2019 (2019-10-18), XP081517466 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022178948A1 (fr) * | 2021-02-26 | 2022-09-01 | 平安科技(深圳)有限公司 | Procédé et appareil de distillation de modèle, dispositif, et support d'enregistrement |
| CN113642605A (zh) * | 2021-07-09 | 2021-11-12 | 北京百度网讯科技有限公司 | 模型蒸馏方法、装置、电子设备及存储介质 |
| CN114357152A (zh) * | 2021-09-03 | 2022-04-15 | 北京大学 | 信息处理方法、装置、计算机可读存储介质和计算机设备 |
| CN114169501A (zh) * | 2021-12-02 | 2022-03-11 | 深圳市华尊科技股份有限公司 | 神经网络压缩方法及相关设备 |
| CN116453139A (zh) * | 2023-04-19 | 2023-07-18 | 科大讯飞股份有限公司 | 一种预训练方法及相关方法和设备 |
| CN116484005A (zh) * | 2023-06-25 | 2023-07-25 | 北京中关村科金技术有限公司 | 一种分类模型构建方法、装置及存储介质 |
| CN116484005B (zh) * | 2023-06-25 | 2023-09-08 | 北京中关村科金技术有限公司 | 一种分类模型构建方法、装置及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112257860A (zh) | 2021-01-22 |
| CN112257860B (zh) | 2025-03-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10635977B2 (en) | Multi-task learning using knowledge distillation | |
| WO2021002968A1 (fr) | Génération de modèle basée sur une compression de modèle | |
| US12008473B2 (en) | Augmenting machine learning language models using search engine results | |
| US11113479B2 (en) | Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query | |
| KR102342066B1 (ko) | 뉴럴 네트워크 모델을 이용한 기계 번역 방법, 장치 및 그 장치를 학습시키기 위한 방법 | |
| CN111832290B (zh) | 用于确定文本相关度的模型训练方法、装置、电子设备及可读存储介质 | |
| US11675975B2 (en) | Word classification based on phonetic features | |
| CN114911892A (zh) | 用于搜索、检索和排序的交互层神经网络 | |
| US20130103382A1 (en) | Method and apparatus for searching similar sentences | |
| WO2021257160A1 (fr) | Apprentissage de sélection de modèle pour une distillation de connaissances | |
| KR20220059288A (ko) | 지식 증류 기반의 멀티모달 매핑 정보를 활용한 이미지 생성 기법 | |
| CN118227850B (zh) | 一种试题查重方法及系统 | |
| AU2023236937A1 (en) | Generating output sequences with inline evidence using language model neural networks | |
| US20230177097A1 (en) | Multi-phase training of machine learning models for search ranking | |
| US10657203B2 (en) | Predicting probability of occurrence of a string using sequence of vectors | |
| CN116340519A (zh) | 文本分类模型训练方法、文本分类方法及相关装置 | |
| US12197535B2 (en) | Determining a denoised named entity recognition model and a denoised relation extraction model | |
| US20220366317A1 (en) | Systems and methods for field extraction from unlabeled data | |
| US20230281392A1 (en) | Computer-readable recording medium storing computer program, machine learning method, and natural language processing apparatus | |
| WO2022203840A1 (fr) | Apprentissage de représentation de textes dans plusieurs langues | |
| CN120087368A (zh) | 一种基于模板对比学习的新词命名实体识别系统 | |
| CN117172323B (zh) | 一种基于特征对齐的专利多领域知识抽取方法及系统 | |
| US20250181620A1 (en) | Fine-grained attribution for document question answering | |
| CN113536790A (zh) | 基于自然语言处理的模型训练方法及装置 | |
| CN118503821A (zh) | 一种创新的跨领域自适应提示学习方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20731352 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20731352 Country of ref document: EP Kind code of ref document: A1 |