WO2022143723A1 - Voice recognition model training method, voice recognition method, and corresponding device - Google Patents
Voice recognition model training method, voice recognition method, and corresponding device Download PDFInfo
- Publication number
- WO2022143723A1 WO2022143723A1 PCT/CN2021/142307 CN2021142307W WO2022143723A1 WO 2022143723 A1 WO2022143723 A1 WO 2022143723A1 CN 2021142307 W CN2021142307 W CN 2021142307W WO 2022143723 A1 WO2022143723 A1 WO 2022143723A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- model
- training
- recognition model
- feature data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- the present disclosure relates to the technical field of artificial intelligence, and in particular, to a speech recognition model training method, a speech recognition method and a corresponding device.
- the traditional training process of machine learning requires a large amount of training data, and it is difficult to obtain sufficient training data for some sub-fields (such as announcement sound). Even if a large amount of training data is obtained, a lot of manpower and material resources are needed to label the data.
- announcement sound data is different, which will cause the existing model to have a low recognition rate for new scenarios.
- the previous practice was to start training from scratch by adding new announcement sounds on the basis of the previous training data. Not only did the training time take a long time, but also because the newly added data had less obvious characteristics than the previous massive data, the recognition effect was average.
- the training data of the new scenario is more or less different from the source data, there are still some instances in the source training data that are suitable for the new application scenario. Directly discarding the existing training model and starting the training from scratch results in the duplication of some data. Huge waste of training and resources.
- the purpose of one or more embodiments of the present disclosure is to provide a speech recognition model training method, speech recognition method and corresponding apparatus, which can reduce the training cost and training time of the speech recognition model, and improve the training efficiency and model recognition rate.
- one or more embodiments of the present disclosure include the following first to eighth aspects.
- a first aspect provides a speech recognition model training method, comprising: creating a migration model to be trained in a target domain based on an original speech recognition model trained in a source domain, wherein the output layer of the migration model is different from the original speech recognition model The output layer of the model; extracts a feature data set from the sample data of the target domain, wherein each feature data in the feature data set carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; The centralized training set is substituted into the migration model as input parameters for iterative training, and the target speech recognition model is obtained; wherein, during the iterative training process, the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
- a speech recognition method including: determining speech data to be recognized; extracting characteristic data from the speech data, the characteristic data carrying an environment that can reflect the similar possibility of different sample data in terms of the same speech physical properties factor; the feature data is substituted into the target speech recognition model for speech recognition, and the target speech recognition model is trained based on the speech recognition model training method of the first aspect.
- a speech recognition model training device including: a creation module for creating a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, wherein the output layer of the migration model is Different from the output layer of the original speech recognition model; the extraction module is used to extract the feature data set from the sample data of the target domain, wherein each feature data in the feature data set carries information that can reflect the physical properties of the same voice in different sample data.
- the training module is used to substitute the training set in the feature data set as the input parameter into the migration model for iterative training to obtain the target speech recognition model; wherein, in the iterative training process, the output layer of the migration model is randomly adjusted Weights, the weights of other layers of the transfer model remain unchanged.
- a speech recognition device comprising: a determination module for determining speech data to be recognized; an extraction module for extracting characteristic data from the speech data, wherein the characteristic data carries data that can reflect different sample data in The environmental factor of the similar possibility in terms of the physical properties of the same speech; the recognition module is used to substitute the feature data into the target speech recognition model for speech recognition, and the target speech recognition model is trained based on the speech recognition model training method of the first aspect.
- an electronic device comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: based on the original speech recognition obtained by training in the source domain model, create a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; extract a feature dataset from the sample data in the target domain, where each feature data in the feature dataset is Both carry environmental factors that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same speech; the training set in the feature data set is used as the input parameter to be substituted into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process , the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
- a computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: based on The original speech recognition model trained in the source domain creates a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; the feature dataset is extracted from the sample data in the target domain, Among them, each feature data in the feature data set carries an environmental factor that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech is obtained. Recognition model; wherein, in the iterative training process, the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
- an electronic device comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: determining speech data to be recognized; The feature data is extracted from the feature data, and the feature data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the feature data is substituted into the target speech recognition model for speech recognition, and the target speech recognition model is based on the first aspect.
- the speech recognition model training method is trained.
- a computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: determine The voice data to be recognized; extract feature data from the voice data, and the feature data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same voice; the feature data is substituted into the target voice recognition model for voice recognition, the target The speech recognition model is obtained by training based on the speech recognition model training method of the first aspect.
- FIG. 1 is a schematic diagram of steps of a speech recognition model training method provided by the present disclosure.
- FIG. 2 is a schematic diagram of steps of a speech recognition method provided by the present disclosure.
- 3a and 3b are schematic diagrams of recognition results before and after training provided by the present disclosure, respectively.
- FIG. 4 is a flowchart of the training and application of the speech recognition model provided by the present disclosure by taking the announcement tone as an example.
- FIG. 5 is a schematic structural diagram of a speech recognition model training apparatus provided by the present disclosure.
- FIG. 6 is a schematic structural diagram of a speech recognition device provided by the present disclosure.
- FIG. 7 is a schematic structural diagram of an electronic device provided by the present disclosure.
- Frozen layers The parameters of a fixed layer remain unchanged during training.
- Retrain layers Remove the last few layers of the neural network so that it is trained on the new dataset.
- Learning rate Used to control the speed of parameter updates during gradient descent. If the learning rate is too small, the convergence can be guaranteed, but it will greatly reduce the optimization speed of the model. If the learning rate is too large, it may lead to overshooting the global optimal point, resulting in parameter "oscillation".
- Announcement Tone The prompt tone that appears during a phone call.
- the main purpose of the present disclosure is to provide a solution that can speed up the generation of an effective training model by using the transfer learning method in the case of insufficient training data for different application scenarios, reduce costs, improve training efficiency, and at the same time,
- the feature data used in the retraining layer carries environmental factors that can affect the physical properties of speech. Therefore, the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- training and recognition scheme of the speech recognition model involved in the present disclosure is not limited to the recognition scenario of the announcement sound, but can also be applied to other speech recognition scenarios with differences in the physical properties of speech.
- the method may include the following steps 102 to 106 .
- Step 102 Create a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, wherein the output layer of the migration model is different from the output layer of the original speech recognition model.
- the source domain which can be the feature range formed by the sample data on which the original speech recognition model is trained.
- the sample data in the range of these features is known and has a sufficient number to be well trained to generate a high-quality model.
- the target domain can be understood as the domain to be learned.
- the number of sample data in this domain is limited or relatively small, and it is impossible to train a high-quality model alone.
- step 102 uses the original speech recognition model trained in the source domain to perform transfer learning, so that a small amount of sample data in the target domain can be trained based on the transferred model parameters.
- step 102 when creating a transfer model to be trained in the target domain based on the original speech recognition model trained in the source domain, the following two methods can be used: one and two.
- the speech recognition model involved in the present disclosure is a deep neural network model.
- the difference from the first method is that the speech recognition model is not newly created, but is directly processed on the basis of the obtained original speech recognition model, and the original speech recognition model with the weight of the output layer removed is used as the target speech recognition model.
- the output layer is equivalent to a retraining layer, which is mainly based on a small number of speech samples in the target domain for migration training.
- the other fully connected layers act as frozen layers, and no parameter changes occur during model training.
- Step 104 Extract a feature data set from the sample data in the target domain, wherein each feature data set in the feature data set carries an environmental factor that can reflect the similar possibility of different sample data in terms of the same physical properties of speech.
- the physical attributes of speech mainly include four types of attribute elements: pitch, intensity, length, and timbre. Generally speaking, it can be understood as intonation, accent, etc.
- step 104 when step 104 extracts the feature dataset from the sample data of the target domain, it includes:
- the first step is to preprocess the sample data of the target domain separately to obtain the MFCC.
- Step 1 Prepare the translation file translation.txt of the original announcement sound and the corresponding dictionary lexicon.txt.
- the translation file needs to be segmented.
- Step 2 Pre-emphasis, framing and windowing are performed on the announcement tone data.
- Step 3 Perform discrete Fourier transform on the original announcement sound data, convert the time domain sequence of the voice stream into a frequency spectrum sequence, and extract the sound spectrum information.
- Step 4 Configure a triangular filter bank and calculate the output after filtering the signal amplitude spectrum by each triangular filter.
- Step 5 Perform logarithmic operation on the outputs of all filters, and further perform discrete cosine transform to obtain MFCC; wherein, MFCC: Mel Frequency Cepstrum Coefficient, Mel frequency cepstrum coefficient.
- MFCC-p an environmental factor based on MFCC
- the second step is to determine the environmental factor of the MFCC based on the MFCC and the likelihood of the MFCC.
- the likelihood of the MFCC may be determined according to the MFCC and the likelihood function; and then the set of corresponding weights when the likelihood converges to an optimal value is determined as the environmental factor of the MFCC.
- ⁇ ) is the likelihood value of MFCC
- w k is the weight of the k-th feature component
- p k is the weight of each Gaussian in the Gaussian mixture model
- u k and ⁇ k are the mean and variance of each Gaussian weight, respectively.
- the third step is to combine MFCC and its corresponding environmental factor MFCC-p into feature data.
- the characteristic data of all sample data is composed to obtain a characteristic data set.
- the second - fourth step can include:
- GMM Global System for Mobile Communications
- the MFCC feature is based on the difference in the sensitivity of the human ear to low-frequency and high-frequency sounds to achieve the recognition effect. And because the same sentence, even if it is spoken by the same person, has different characteristics in different situations, for example, the intonation, speed and accent of dialect spoken by different people may be different. In order to better refine and accurately recognize speech, we can consider combining the likelihood function, through accurate analysis of different intonation, speech speed, accent and other attributes, increase environmental factors, and perform feature correction on MFCC. Compared with MFCC, MFCC The -p feature can just reflect these effects. Combined with the original MFCC feature, it can recognize speech more accurately and in detail, especially the announcement sound and other similar speech.
- Step 106 Substitute the training set in the feature data set as an input parameter into the migration model for iterative training to obtain a target speech recognition model.
- the weights of the output layer of the migration model are adjusted randomly, and the weights of other layers of the migration model remain unchanged.
- the error rate of the model test after this iteration can be compared with the error rate of the model test after the previous iteration; if the error rate of this iteration is greater than the previous iteration If the error rate is less than or equal to the previous error rate, the learning rate will be increased by the preset rate.
- the preset amplitude may be about 5%.
- the recognition error rate is continuously calculated.
- the training is terminated and a new training model is obtained.
- a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain.
- the output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency.
- the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- the method may include:
- Step 202 determine the voice data to be recognized
- Step 204 extracting characteristic data from the speech data, and the characteristic data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical property of speech;
- Step 206 Substitute the feature data into the target speech recognition model to perform speech recognition, and the target speech recognition model is obtained by training based on the method of the first embodiment.
- the method of extracting characteristic data in step 204 may refer to Embodiment 1, which is the description of extracting characteristic data from sample data. I won't go into details here.
- this solution uses the language model generated by training to adjust the hyperparameters of the speech recognition model.
- the language model can be used to implement functions such as correcting homophone typos, for example, correcting "Dialing” in the recognition result to "Dialing”, which greatly improves the coupling between the speech recognition model and the announcement sound field.
- the present disclosure allows engineers to spend less time to obtain a training model with improved recognition rate when the existing announcement sound data is insufficient and the existing model has a low recognition rate for the announcement sound.
- There is no need to spend manpower to label a large amount of training data no need to relabel expired data, and no need to train from scratch on trained data.
- Engineers only need to perform feature extraction on the newly added announcement sound data, and on the basis of the original model, train on the announcement sound feature data to obtain a better training model. Since the output layer is only trained for the newly added data, the training time can be effectively shortened.
- this method is also very flexible compared to traditional methods.
- the traditional method is to add the data of the specific field on the basis of the original data, and retrain to obtain the model.
- the results obtained after spending a lot of time training are not necessarily better.
- the method provided by the present disclosure not only a model can be obtained quickly, but also a better recognition result can be obtained by flexibly adjusting the initial weight of the output layer. From the perspective of engineers, the training process is simplified and the training efficiency is improved; from the perspective of users, when the existing solution is not perfect, a mature solution can be obtained in less time, which greatly improves the efficiency of training. Improved user experience.
- the obtained target speech recognition model can be sent to the model application link.
- the feature extraction is performed on the speech samples of the original model and the announcement sound recognition model, and these feature data are mapped, so that the features of the original speech and announcement sound can be mapped to a new common feature space, and the dimension of these data is guaranteed. Controlled within a certain range (to prevent the curse of dimensionality), and then iteratively optimizes the pseudo-labels of the announcement domain samples within this new feature space until convergence.
- This pre-training model is based on a large data set.
- the fixed feature extraction machine sets a small learning rate for the migration layer, and the newly added output layer is updated with a normal learning rate and applied to the new data set. Predict and score to get the final model. For this newly generated model, if the recognition result is not ideal, the weights of some layers at the beginning of the original model can also be kept unchanged, and the subsequent layers can be retrained to obtain new weights. In this process, the user can make multiple attempts, so that the best training model can be obtained.
- the speech recognition model training apparatus 500 may include: a creation module 502 , an extraction module 504 and a training module 506 .
- the creation module 502 is configured to create a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, wherein the output layer of the migration model is different from the output layer of the original speech recognition model.
- the extraction module 504 is configured to extract a feature data set from the sample data of the target domain, wherein each feature data set in the feature data set carries an environmental factor that can reflect the similar possibility of different sample data in terms of the same physical property of speech.
- the training module 506 is used for substituting the training set in the feature data set as an input parameter into the migration model for iterative training to obtain the target speech recognition model;
- the weights of the output layer of the migration model are adjusted randomly, and the weights of other layers of the migration model remain unchanged.
- the creation module 502 is used to extract the parameters of the original speech recognition model trained in the source domain when creating the migration model to be trained in the target domain based on the original speech recognition model obtained by training in the source domain, Among them, the parameters include at least the weight of each layer in the neural network model; create a new speech recognition model based on the neural network, and transfer the weights of other layers except the output layer in the extracted parameters to the new speech recognition model, as the target Domain transfer model to be trained.
- the creation module 502 is used to obtain the original speech recognition model trained in the source domain when creating the migration model to be trained in the target domain based on the original speech recognition model trained in the source domain; The weight of the output layer in the original speech recognition model is removed to obtain the transfer model to be trained in the target domain.
- the extraction module 504 when extracting the feature data set from the sample data of the target domain, is used to preprocess the sample data of the target domain respectively to obtain the Mel frequency cepstral coefficient MFCC; MFCC and its likelihood, determine the environmental factors of MFCC; combine MFCC and its corresponding environmental factors into feature data; the feature data set of all sample data is composed of feature data sets.
- the extraction module 504 when determining the environmental factor of the MFCC based on the likelihood of the MFCC, is used to determine the likelihood of the MFCC according to the MFCC and the likelihood function; the likelihood is converged to the optimum When the value is taken, the corresponding weight set is determined as the environmental factor of the MFCC.
- ⁇ ) is the likelihood value of MFCC
- w k is the weight of the k-th feature component
- p k is the weight of each Gaussian in the Gaussian mixture model
- u k and ⁇ k are the mean and variance of each Gaussian weight, respectively.
- the learning rate set for the output layer is lower than the learning rate set for other layers.
- a comparison module configured to compare the error rate of the model test after this iteration with the model after the previous iteration The error rate of the test is compared; if the error rate of this time is greater than the previous error rate, the learning rate is decreased by the preset range; otherwise, the learning rate is increased by the preset range.
- a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain.
- the output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency.
- the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- the apparatus 600 may include: a determination module 602 , an extraction module 604 , and an identification module 606 .
- the determining module 602 is used for determining the speech data to be recognized.
- the extraction module 604 is used for extracting characteristic data from the speech data, where the characteristic data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same speech physical property.
- the recognition module 606 is used for substituting the feature data into the target speech recognition model for speech recognition, and the target speech recognition model is obtained by training based on the method of the first embodiment.
- a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain.
- the output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency.
- the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- the present disclosure also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program is executed by the processor: based on the original speech recognition model trained in the source domain, Create a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; extract a feature dataset from the sample data in the target domain, where each feature data in the feature dataset carries There are environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as the input parameter to be substituted into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, The output layer of the transfer model adjusts the weights randomly, and the weights of other layers of the transfer model remain unchanged.
- a computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations:
- the original speech recognition model obtained by domain training is used to create a migration model to be trained in the target domain, wherein the output layer of the migration model is different from the output layer of the original speech recognition model;
- the feature dataset is extracted from the sample data of the target domain, wherein,
- Each feature data in the feature data set carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech;
- the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained. ;
- the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
- a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain.
- the output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency.
- the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- the present disclosure also provides another electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: determining speech data to be recognized; Extract feature data, which carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; substitute the feature data into the target speech recognition model for speech recognition, and the target speech recognition model is based on the method of the first embodiment Trained to get.
- the computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: Recognized speech data; extract characteristic data from speech data, which carries environmental factors that can reflect the similarity of different sample data in terms of the same physical properties of speech; substitute characteristic data into the target speech recognition model for speech recognition, the target speech
- the recognition model is obtained by training based on the method of the first embodiment.
- a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain.
- the output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency.
- the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
- a typical implementation device is a computer.
- the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer , wearable devices, or a combination of any of these devices.
- Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology.
- Information may be computer readable instructions, data structures, modules of programs, or other data.
- Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
- computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开要求享有2020年12月31日提交的名称为“语音识别模型训练方法、语音识别方法及相应装置”的中国专利申请CN202011629211.3的优先权,其全部内容通过引用并入本公开中。The present disclosure claims the priority of Chinese patent application CN202011629211.3, filed on December 31, 2020, entitled "Speech Recognition Model Training Method, Speech Recognition Method and Corresponding Device", the entire contents of which are incorporated into this disclosure by reference.
本公开涉及人工智能技术领域,尤其涉及一种语音识别模型训练方法、语音识别方法及相应装置。The present disclosure relates to the technical field of artificial intelligence, and in particular, to a speech recognition model training method, a speech recognition method and a corresponding device.
随着大带宽、低时延的5G时代的到来,音视频等实时媒体的应用必将成为未来的主流,其产生的海量数据对用户习惯的分析有很大的帮助。With the advent of the 5G era with large bandwidth and low latency, the application of real-time media such as audio and video will definitely become the mainstream in the future, and the massive data generated will be of great help to the analysis of user habits.
在传统的机器学习框架下,基本的流程是在给定充分训练数据的基础上学习一个分类模型,然后利用这个模型完成对测试数据的分类和预测。然而,随着5G时代的到来,各种与实时媒体相关的应用井喷式爆发,随之带来的便是各个领域的细化,这些新划分的领域的大量训练数据非常难以得到。如何利用这些有限的细化领域数据有效的分析用户习惯,针对性地优化,从而提高用户体验,是摆在联络中心产品前的一个巨大挑战。Under the traditional machine learning framework, the basic process is to learn a classification model on the basis of given sufficient training data, and then use this model to complete the classification and prediction of the test data. However, with the advent of the 5G era, various real-time media-related applications have exploded, which has brought about the refinement of various fields. It is very difficult to obtain a large amount of training data in these newly divided fields. How to use these limited and refined field data to effectively analyze user habits and make targeted optimization to improve user experience is a huge challenge before contact center products.
目前,机器学习传统的训练流程需要大量的训练数据,而对于一些细分领域(如宣告音),很难去获取充分的训练数据。即使获取了大量的训练数据,也需要花费大量的人力和物力进行数据的标注。联络中心存在很多局点,对于不同的局点,宣告音数据是不同的,这会导致现有的模型对于新的场景识别率很低。之前的做法是通过在之前训练数据的基础上增加新的宣告音进行从头开始训练,不仅训练时间长,而且由于新增的数据相对于之前海量的数据特征不明显,识别效果一般。尽管新的场景的训练数据和源数据或多或少会有些不同,但是源训练数据中还是有部分实例适合新的应用场景,直接丢弃现有的训练模型重头开始训练造成了某些数据的重复训练和资源的极大浪费。At present, the traditional training process of machine learning requires a large amount of training data, and it is difficult to obtain sufficient training data for some sub-fields (such as announcement sound). Even if a large amount of training data is obtained, a lot of manpower and material resources are needed to label the data. There are many offices in the contact center. For different offices, the announcement sound data is different, which will cause the existing model to have a low recognition rate for new scenarios. The previous practice was to start training from scratch by adding new announcement sounds on the basis of the previous training data. Not only did the training time take a long time, but also because the newly added data had less obvious characteristics than the previous massive data, the recognition effect was average. Although the training data of the new scenario is more or less different from the source data, there are still some instances in the source training data that are suitable for the new application scenario. Directly discarding the existing training model and starting the training from scratch results in the duplication of some data. Huge waste of training and resources.
因此,如何提升语音识别模型的训练效率,以及提升语音识别效率成为亟待解决的问题。Therefore, how to improve the training efficiency of the speech recognition model and how to improve the speech recognition efficiency has become an urgent problem to be solved.
发明内容SUMMARY OF THE INVENTION
本公开一个或多个实施例的目的是提供一种语音识别模型训练方法、语音识别方法及相应装置,可以降低语音识别模型的训练成本和训练时长,提升训练效率以及模型识别率。The purpose of one or more embodiments of the present disclosure is to provide a speech recognition model training method, speech recognition method and corresponding apparatus, which can reduce the training cost and training time of the speech recognition model, and improve the training efficiency and model recognition rate.
为解决上述技术问题,本公开一个或多个实施例包括如下第一方面至第八方面。To solve the above technical problems, one or more embodiments of the present disclosure include the following first to eighth aspects.
第一方面,提供了一种语音识别模型训练方法,包括:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。A first aspect provides a speech recognition model training method, comprising: creating a migration model to be trained in a target domain based on an original speech recognition model trained in a source domain, wherein the output layer of the migration model is different from the original speech recognition model The output layer of the model; extracts a feature data set from the sample data of the target domain, wherein each feature data in the feature data set carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; The centralized training set is substituted into the migration model as input parameters for iterative training, and the target speech recognition model is obtained; wherein, during the iterative training process, the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
第二方面,提出了一种语音识别方法,包括:确定待识别的语音数据;从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于第一方面的语音识别模型训练方法训练得到。In a second aspect, a speech recognition method is proposed, including: determining speech data to be recognized; extracting characteristic data from the speech data, the characteristic data carrying an environment that can reflect the similar possibility of different sample data in terms of the same speech physical properties factor; the feature data is substituted into the target speech recognition model for speech recognition, and the target speech recognition model is trained based on the speech recognition model training method of the first aspect.
第三方面,提出了一种语音识别模型训练装置,包括:创建模块,用于基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;提取模块,用于从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;训练模块,用于将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。In a third aspect, a speech recognition model training device is proposed, including: a creation module for creating a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, wherein the output layer of the migration model is Different from the output layer of the original speech recognition model; the extraction module is used to extract the feature data set from the sample data of the target domain, wherein each feature data in the feature data set carries information that can reflect the physical properties of the same voice in different sample data. The environmental factor of similar possibility; the training module is used to substitute the training set in the feature data set as the input parameter into the migration model for iterative training to obtain the target speech recognition model; wherein, in the iterative training process, the output layer of the migration model is randomly adjusted Weights, the weights of other layers of the transfer model remain unchanged.
第四方面,提出了一种语音识别装置,包括:确定模块,用于确定待识别的语音数据;提取模块,用于从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;识别模块,用于将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于第一方面的语音识别模型训练方法训练得到。In a fourth aspect, a speech recognition device is proposed, comprising: a determination module for determining speech data to be recognized; an extraction module for extracting characteristic data from the speech data, wherein the characteristic data carries data that can reflect different sample data in The environmental factor of the similar possibility in terms of the physical properties of the same speech; the recognition module is used to substitute the feature data into the target speech recognition model for speech recognition, and the target speech recognition model is trained based on the speech recognition model training method of the first aspect.
第五方面,提出了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中 的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。In a fifth aspect, an electronic device is proposed, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: based on the original speech recognition obtained by training in the source domain model, create a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; extract a feature dataset from the sample data in the target domain, where each feature data in the feature dataset is Both carry environmental factors that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same speech; the training set in the feature data set is used as the input parameter to be substituted into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process , the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
第六方面,提出了一种计算机可读存储介质,计算机可读存储介质存储一个或多个程序,一个或多个程序当被包括多个应用程序的服务器执行时,使得服务器执行以下操作:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。In a sixth aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: based on The original speech recognition model trained in the source domain creates a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; the feature dataset is extracted from the sample data in the target domain, Among them, each feature data in the feature data set carries an environmental factor that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech is obtained. Recognition model; wherein, in the iterative training process, the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
第七方面,提出了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行:确定待识别的语音数据;从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于第一方面的语音识别模型训练方法训练得到。In a seventh aspect, an electronic device is proposed, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: determining speech data to be recognized; The feature data is extracted from the feature data, and the feature data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the feature data is substituted into the target speech recognition model for speech recognition, and the target speech recognition model is based on the first aspect. The speech recognition model training method is trained.
第八方面,提出了一种计算机可读存储介质,计算机可读存储介质存储一个或多个程序,一个或多个程序当被包括多个应用程序的服务器执行时,使得服务器执行以下操作:确定待识别的语音数据;从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于第一方面的语音识别模型训练方法训练得到。In an eighth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: determine The voice data to be recognized; extract feature data from the voice data, and the feature data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical attributes of the same voice; the feature data is substituted into the target voice recognition model for voice recognition, the target The speech recognition model is obtained by training based on the speech recognition model training method of the first aspect.
为了更清楚地说明本公开一个或多个实施例或一些情形中的技术方案,下面将对一个或多个实施例或一些情形描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in one or more embodiments or some situations of the present disclosure, the following briefly introduces the accompanying drawings that are needed in the description of one or more embodiments or some situations. Obviously, the following The drawings in the description are only some embodiments described in the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1是本公开提供的一种语音识别模型训练方法的步骤示意图。FIG. 1 is a schematic diagram of steps of a speech recognition model training method provided by the present disclosure.
图2是本公开提供的一种语音识别方法的步骤示意图。FIG. 2 is a schematic diagram of steps of a speech recognition method provided by the present disclosure.
图3a和图3b分别是本公开提供的训练前后识别结果的示意图。3a and 3b are schematic diagrams of recognition results before and after training provided by the present disclosure, respectively.
图4是本公开提供的以宣告音为例实现的语音识别模型训练及应用的流程图。FIG. 4 is a flowchart of the training and application of the speech recognition model provided by the present disclosure by taking the announcement tone as an example.
图5是本公开提供的一种语音识别模型训练装置的结构示意图。FIG. 5 is a schematic structural diagram of a speech recognition model training apparatus provided by the present disclosure.
图6是本公开提供的一种语音识别装置的结构示意图。FIG. 6 is a schematic structural diagram of a speech recognition device provided by the present disclosure.
图7是本公开提供的一种电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by the present disclosure.
为了使本技术领域的人员更好地理解本公开中的技术方案,下面将结合本公开一个或多个实施例中的附图,对本公开一个或多个实施例中的技术方案进行清楚、完整地描述,显然,所描述的一个或多个实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的一个或多个实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本文件的保护范围。In order to make those skilled in the art better understand the technical solutions in the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present disclosure. It is apparent that the described embodiment or embodiments are only some, but not all, of the embodiments of the present disclosure. Based on one or more embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of this document.
在此之前,简单介绍本公开可能涉及到的专业术语。Before that, the technical terms that the present disclosure may involve are briefly introduced.
冻结层(Frozen layers):固定某层参数在训练的时候不变。Frozen layers: The parameters of a fixed layer remain unchanged during training.
再训练层(Retrain layers):删除神经网络的最后几层,使其针对新的数据集训练。Retrain layers: Remove the last few layers of the neural network so that it is trained on the new dataset.
学习率:用于控制梯度下降过程中参数更新的速度。学习率过小,能保证收敛性,但会大大降低模型的优化速度,学习率过大,则可能导致越过全局最优点,导致参数“振荡”。Learning rate: Used to control the speed of parameter updates during gradient descent. If the learning rate is too small, the convergence can be guaranteed, but it will greatly reduce the optimization speed of the model. If the learning rate is too large, it may lead to overshooting the global optimal point, resulting in parameter "oscillation".
宣告音:电话呼叫过程中出现的提示音。Announcement Tone: The prompt tone that appears during a phone call.
本公开的主要目的在于提供一种针对不同应用场景,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性,且识别准确率高。The main purpose of the present disclosure is to provide a solution that can speed up the generation of an effective training model by using the transfer learning method in the case of insufficient training data for different application scenarios, reduce costs, improve training efficiency, and at the same time, During training, the feature data used in the retraining layer carries environmental factors that can affect the physical properties of speech. Therefore, the target speech recognition model obtained by training has good scalability and high recognition accuracy.
需要说明的是,本公开所涉及的语音识别模型的训练及识别方案,并不限于针对宣告音的识别场景,还可以适用于其它在语音物理属性方面存在差异的语音识别场景中。It should be noted that the training and recognition scheme of the speech recognition model involved in the present disclosure is not limited to the recognition scenario of the announcement sound, but can also be applied to other speech recognition scenarios with differences in the physical properties of speech.
实施例一Example 1
参照图1所示,为本公开提供的一种语音识别模型训练方法的步骤示意图,该方法可以包括以下步骤102至步骤106。Referring to FIG. 1 , which is a schematic diagram of steps of a speech recognition model training method provided in the present disclosure, the method may include the
步骤102:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层。Step 102: Create a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, wherein the output layer of the migration model is different from the output layer of the original speech recognition model.
源域,可以是训练原始语音识别模型时所基于的样本数据所构成的特征范围。这些特征范围内的样本数据是已知的,且拥有足够多的数量,能够很好的训练生成高质量的模型。The source domain, which can be the feature range formed by the sample data on which the original speech recognition model is trained. The sample data in the range of these features is known and has a sufficient number to be well trained to generate a high-quality model.
目标域,可以理解为待学习的域,该域内的样本数据的数量有限,或者比较少,无法单独训练得到高质量模型。The target domain can be understood as the domain to be learned. The number of sample data in this domain is limited or relatively small, and it is impossible to train a high-quality model alone.
在这里,步骤102就使用在源域训练得到的原始语音识别模型进行迁移学习,使得目标域的少量样本数据可以基于迁移来的模型参数进行训练。Here,
在一示例性实施例中,步骤102在基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型时,可以通过以下两种方式一和方式二。In an exemplary embodiment, in
方式一method one
1.1、提取在源域训练得到的原始语音识别模型的参数,其中,参数至少包括神经网络模型中每一层的权重。1.1. Extract the parameters of the original speech recognition model trained in the source domain, wherein the parameters include at least the weight of each layer in the neural network model.
1.2、基于神经网络创建新的语音识别模型,将提取的参数中除输出层外的其它全连接层的权重迁移到新的语音识别模型,作为在目标域待训练的迁移模型。1.2. Create a new speech recognition model based on the neural network, and transfer the weights of other fully connected layers except the output layer in the extracted parameters to the new speech recognition model as the migration model to be trained in the target domain.
应理解,本公开所涉及的语音识别模型是深度神经网络模型。实现时,可以先获取原始语音识别模型的参数并加载模型,主要是神经网络的每一层的权重矩阵。然后,重新初始化创建一个新的目标语音识别模型,将前面取到的权值,对应的放到新模型中,剔除掉最后一层的参数。It should be understood that the speech recognition model involved in the present disclosure is a deep neural network model. When implementing, you can first obtain the parameters of the original speech recognition model and load the model, mainly the weight matrix of each layer of the neural network. Then, re-initialize to create a new target speech recognition model, put the weights obtained earlier into the new model, and remove the parameters of the last layer.
方式二Method 2
2.1、获取在源域训练得到的原始语音识别模型。2.1. Obtain the original speech recognition model trained in the source domain.
2.2、将原始语音识别模型中输出层的权重剔除,得到目标域待训练的迁移模型。2.2. Eliminate the weight of the output layer in the original speech recognition model to obtain the transfer model to be trained in the target domain.
与方式一不同之处在于,并未新创建语音识别模型,而是直接在获取的原始语音识别模型的基础上进行处理,将剔除输出层权重的原始语音识别模型作为目标语音识别模型。The difference from the first method is that the speech recognition model is not newly created, but is directly processed on the basis of the obtained original speech recognition model, and the original speech recognition model with the weight of the output layer removed is used as the target speech recognition model.
应理解,无论方式一或方式二,得到的目标语音识别模型中,其输出层相当于是再训练层,主要是基于目标域存在的少量语音样本进行迁移训练。而其它全连接层则作为冷冻层,在模型训练时不发生参数的变化。It should be understood that, in the obtained target speech recognition model, the output layer is equivalent to a retraining layer, which is mainly based on a small number of speech samples in the target domain for migration training. The other fully connected layers act as frozen layers, and no parameter changes occur during model training.
步骤104:从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子。Step 104: Extract a feature data set from the sample data in the target domain, wherein each feature data set in the feature data set carries an environmental factor that can reflect the similar possibility of different sample data in terms of the same physical properties of speech.
语音物理属性,主要包括:音高、音强、音长、音色这四类属性元素,通俗而言,可以 理解为语调、口音等。The physical attributes of speech mainly include four types of attribute elements: pitch, intensity, length, and timbre. Generally speaking, it can be understood as intonation, accent, etc.
在本公开中,步骤104在从目标域的样本数据中提取特征数据集时,包括:In the present disclosure, when
第一步,将目标域的样本数据分别进行预处理,得到MFCC。The first step is to preprocess the sample data of the target domain separately to obtain the MFCC.
在对样本数据进行预处理时,可以按照以下步骤1至步骤5实现。When preprocessing the sample data, it can be implemented according to the following steps 1 to 5.
步骤1、准备好原始宣告音的翻译文件translation.txt以及相应的词典lexicon.txt,翻译文件需要进行分词。Step 1. Prepare the translation file translation.txt of the original announcement sound and the corresponding dictionary lexicon.txt. The translation file needs to be segmented.
步骤2、对宣告音数据进行预加重、分帧和加窗。Step 2: Pre-emphasis, framing and windowing are performed on the announcement tone data.
步骤3、对原始宣告音数据进行离散傅里叶变换,将语音流的时域序列转换到频谱序列,提取出声音频谱信息。Step 3: Perform discrete Fourier transform on the original announcement sound data, convert the time domain sequence of the voice stream into a frequency spectrum sequence, and extract the sound spectrum information.
步骤4、配置三角形滤波器组并计算每一个三角形滤波器对信号幅度谱滤波后的输出。Step 4: Configure a triangular filter bank and calculate the output after filtering the signal amplitude spectrum by each triangular filter.
步骤5、对所有滤波器的输出进行对数运算,并进一步做离散余弦变换,得到MFCC;其中,MFCC:Mel Frequency Cepstrum Coefficient,Mel频率倒谱系数。Step 5. Perform logarithmic operation on the outputs of all filters, and further perform discrete cosine transform to obtain MFCC; wherein, MFCC: Mel Frequency Cepstrum Coefficient, Mel frequency cepstrum coefficient.
一般来说,语音识别的MFCC特征提取到这里基本就算完成了,但是为了识别结果受外在条件的影响缩到最小,我们基于MFCC提出了一个环境因子,暂且命名为MFCC-p。Generally speaking, the MFCC feature extraction of speech recognition is basically completed here, but in order to minimize the impact of external conditions on the recognition results, we propose an environmental factor based on MFCC, temporarily named MFCC-p.
第二步,基于MFCC及MFCC的似然,确定MFCC的环境因子。The second step is to determine the environmental factor of the MFCC based on the MFCC and the likelihood of the MFCC.
在一示例性实施例中,可以根据MFCC及似然函数确定该MFCC的似然;然后将似然收敛到最佳取值时对应的权重集合确定为MFCC的环境因子。In an exemplary embodiment, the likelihood of the MFCC may be determined according to the MFCC and the likelihood function; and then the set of corresponding weights when the likelihood converges to an optimal value is determined as the environmental factor of the MFCC.
该MFCC的似然可以确定任一样本数据的MFCC为X,其中,X={x 1,x 2,...x T},X包含T个特征分量;根据下述似然函数公式,计算该MFCC的似然; The likelihood of the MFCC can determine that the MFCC of any sample data is X, where X={x 1 , x 2 ,...x T }, X contains T feature components; according to the following likelihood function formula, calculate the likelihood of the MFCC;
p(x t|λ)为MFCC的似然取值,w k为第k个特征分量的权重, 为高斯模型的参数,p k为高斯混合模型中各个高斯所占的权重,u k和Σ k分别为各个高斯权重的均值和方差。 p(x t |λ) is the likelihood value of MFCC, w k is the weight of the k-th feature component, are the parameters of the Gaussian model, p k is the weight of each Gaussian in the Gaussian mixture model, and u k and Σ k are the mean and variance of each Gaussian weight, respectively.
第三步,将MFCC和其对应的环境因子MFCC-p组合为特征数据。组合时可以将两个维度相加组合在一起。The third step is to combine MFCC and its corresponding environmental factor MFCC-p into feature data. When combining, you can combine the two dimensions together.
第四步,所有样本数据的特征数据组成得到特征数据集。In the fourth step, the characteristic data of all sample data is composed to obtain a characteristic data set.
其实,第二步-第四步可以包括:In fact, the second - fourth step can include:
提取出的某条语音数据对应的MFCC特征为X,其中X={x 1,x 2,...x T},且假设维度为D,我们可以计算其对应的似然,似然函数公式为: The MFCC feature corresponding to an extracted piece of speech data is X, where X={x 1 , x 2 ,...x T }, and assuming the dimension is D, we can calculate its corresponding likelihood, the likelihood function formula for:
由GMM(高斯混合模型)得知,我们可以通过由k个高斯密度函数加权得到提取出的特征,每个高斯分量的均值u k和协方差∑ k的大小分别为1×D和D×D,权重的集合λ={w k,u i,∑ k}。给λ一个初始值集合,初始一个基础的学习率,使用神经网络使得高斯分量加权得到的结果与原始特征的欧式距离不断减小,最后我们便可以得到新的特征MFCC-p。在原始MFCC的基础上加上我们新提取出的影响因子MFCC-p特征,便是我们下面训练所需要的特征。 It is known from GMM (Gaussian Mixture Model) that we can obtain the extracted features by weighting by k Gaussian density functions. The mean value u k and covariance Σ k of each Gaussian component are 1×D and D×D respectively. , the set of weights λ={w k , ui , ∑ k }. Give λ an initial value set, a basic learning rate, and use a neural network to make the Euclidean distance between the Gaussian component weighted and the original feature continuously decrease, and finally we can get a new feature MFCC-p. Adding our newly extracted impact factor MFCC-p feature to the original MFCC is the feature we need for the training below.
MFCC特征是基于人耳对于低频和高频声音的敏感度不同来达到识别的效果。而由于同一句话,即使由同一个人,在不同情况下说出来的特点也是不一样的,例如,方言由不同的人说出的语调、语速以及口音等都可能不同。为了更好的对语音进行细化、准确识别,可以考虑结合似然函数,通过对不同语调、语速、口音等属性的准确分析,增加环境因子,对MFCC进行特征修正,相比较MFCC,MFCC-p特征刚好能够体现出这些影响,结合原始的MFCC特征能够更准确、细化地识别语音,尤其是宣告音等类似语音。The MFCC feature is based on the difference in the sensitivity of the human ear to low-frequency and high-frequency sounds to achieve the recognition effect. And because the same sentence, even if it is spoken by the same person, has different characteristics in different situations, for example, the intonation, speed and accent of dialect spoken by different people may be different. In order to better refine and accurately recognize speech, we can consider combining the likelihood function, through accurate analysis of different intonation, speech speed, accent and other attributes, increase environmental factors, and perform feature correction on MFCC. Compared with MFCC, MFCC The -p feature can just reflect these effects. Combined with the original MFCC feature, it can recognize speech more accurately and in detail, especially the announcement sound and other similar speech.
步骤106:将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型。Step 106: Substitute the training set in the feature data set as an input parameter into the migration model for iterative training to obtain a target speech recognition model.
其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。Among them, in the iterative training process, the weights of the output layer of the migration model are adjusted randomly, and the weights of other layers of the migration model remain unchanged.
对于不同大小的数据集,设置不同的学习率。对于frozen layers层,由于是基于庞大的语音数据得到的,对其设置正常的学习率,而新加的Retrain layers层是基于较小的宣告音数据训练的,需要对其设置较小的学习率。因此,在对迁移模型进行迭代训练时,为输出层设置的学习率低于为其它层设置的学习率。For datasets of different sizes, set different learning rates. For the frozen layers, since they are obtained based on huge speech data, a normal learning rate is set for them, while the newly added Retrain layers are trained based on smaller announcement data, and a smaller learning rate needs to be set for them. . Therefore, when iteratively training the transfer model, the learning rate set for the output layer is lower than the learning rate set for the other layers.
应理解,在每次迭代中调节不同的学习率。在整个迭代训练过程中,每迭代一次完成后,可以将本次迭代后的模型测试的错误率与上一次迭代后的模型测试的错误率进行比对;如果本次的错误率大于上一次的错误率,则按照预设幅度调低学习率;如果本次的错误率小于或等于上一次的错误率,则按照预设幅度调高学习率。其中,预设幅度可以是5%左右。It should be understood that different learning rates are adjusted in each iteration. In the entire iterative training process, after each iteration is completed, the error rate of the model test after this iteration can be compared with the error rate of the model test after the previous iteration; if the error rate of this iteration is greater than the previous iteration If the error rate is less than or equal to the previous error rate, the learning rate will be increased by the preset rate. Wherein, the preset amplitude may be about 5%.
在迭代过程中不断计算识别错误率,当错误率低于预设精度之后,终止训练,得到新的 训练模型。In the iterative process, the recognition error rate is continuously calculated. When the error rate is lower than the preset accuracy, the training is terminated and a new training model is obtained.
由上述技术方案可知,基于在源域训练得到的原始语音识别模型创建在目标域待训练的迁移模型,从目标域的样本数据中提取特征数据集,该特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。从而,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性,且识别准确率高。It can be seen from the above technical solutions that a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain. An environmental factor that can reflect the similarity of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, the migration The output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency. At the same time, during training, the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
实施例二Embodiment 2
参照图2所示,为本公开提供的一种语音识别方法的步骤示意图,该方法可以包括:Referring to FIG. 2, which is a schematic diagram of steps of a speech recognition method provided by the present disclosure, the method may include:
步骤202:确定待识别的语音数据;Step 202: determine the voice data to be recognized;
步骤204:从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;Step 204: extracting characteristic data from the speech data, and the characteristic data carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical property of speech;
步骤206:将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于实施例一的方法训练得到。Step 206: Substitute the feature data into the target speech recognition model to perform speech recognition, and the target speech recognition model is obtained by training based on the method of the first embodiment.
其中,步骤204中提取特征数据的方式可参照实施例一,从样本数据中提取特征数据的描述。在此不做赘述。The method of extracting characteristic data in
相比较传统的迁移学习过程,为了提高宣告音的识别率,本方案利用训练生成的语言模型,对该语音识别模型进行超参数调整处理。典型地,利用该语言模型能够实现修正同音错别字词等功能,例如将识别结果中的“在拨”修正为“再拨”,使得语音识别模型与宣告音领域的耦合度大大提升。Compared with the traditional transfer learning process, in order to improve the recognition rate of the announcement sound, this solution uses the language model generated by training to adjust the hyperparameters of the speech recognition model. Typically, the language model can be used to implement functions such as correcting homophone typos, for example, correcting "Dialing" in the recognition result to "Dialing", which greatly improves the coupling between the speech recognition model and the announcement sound field.
训练前后识别结果如图3a和图3b所示:总共40条现场发回的原始的训练模型无法识别的宣告音,采用新训练模型可以识别出其中的33条,语音识别效率显著提高。The recognition results before and after training are shown in Figure 3a and Figure 3b: a total of 40 announcement sounds that were sent back by the original training model could not be recognized, 33 of them could be recognized by the new training model, and the speech recognition efficiency was significantly improved.
综上所述,本公开在现有宣告音数据不充分、现有模型针对宣告音识别率不高的情况下,让工程人员花费比较少的时间去得到一个识别率改善的训练模型。不需要花费人力去标注大量的训练数据,不需要将过期的数据重新标注,也不需要针对训练过的数据从头开始训练。只需要工程人员对新加入的宣告音数据进行特征提取,在原始模型的基础上,针对宣告音的 特征数据进行训练,得到一个较好的训练模型。由于只是针对新加入的数据进行输出层的训练,可以有效缩短训练时间。而且,这种方法相较于传统方法也异常灵活。在当前训练模型针对某特定领域的识别结果不佳的时候,传统方法是在原始数据基础上加入该特定领域的数据,重新进行训练,得到模型。这种在花费了大量的时间训练得到的结果不一定更优。而通过本公开提供的方法,不仅可以快速的得到一个模型,而且可以通过灵活的调整输出层的初始权重来得到更优的识别结果。从工程人员角度来说,简化了训练流程,提高了训练效率;从用户角度来说,在现有方案不完善的情况下,花费较少的时间便可以得到一个成熟的解决方案,极大的提高了用户体验。To sum up, the present disclosure allows engineers to spend less time to obtain a training model with improved recognition rate when the existing announcement sound data is insufficient and the existing model has a low recognition rate for the announcement sound. There is no need to spend manpower to label a large amount of training data, no need to relabel expired data, and no need to train from scratch on trained data. Engineers only need to perform feature extraction on the newly added announcement sound data, and on the basis of the original model, train on the announcement sound feature data to obtain a better training model. Since the output layer is only trained for the newly added data, the training time can be effectively shortened. Moreover, this method is also very flexible compared to traditional methods. When the current training model has poor recognition results for a specific field, the traditional method is to add the data of the specific field on the basis of the original data, and retrain to obtain the model. The results obtained after spending a lot of time training are not necessarily better. With the method provided by the present disclosure, not only a model can be obtained quickly, but also a better recognition result can be obtained by flexibly adjusting the initial weight of the output layer. From the perspective of engineers, the training process is simplified and the training efficiency is improved; from the perspective of users, when the existing solution is not perfect, a mature solution can be obtained in less time, which greatly improves the efficiency of training. Improved user experience.
实施例三Embodiment 3
参照图4所示,以宣告音为例,介绍模型训练以及应用的整体流程。Referring to FIG. 4 , taking the announcement sound as an example, the overall flow of model training and application is introduced.
模型训练model training
--可以提取宣告音样本数据的MFCC+MFCC-p。提取方式参照实施例一中介绍。--MFCC+MFCC-p from which the announcement tone sample data can be extracted. For the extraction method, refer to the introduction in the first embodiment.
--去除原始模型全连接层的最后一层,创建得到迁移模型。--Remove the last layer of the fully connected layer of the original model to create a migration model.
其实,MFCC+MFCC-p的提取与迁移模型的创建可以调换顺序,也可以同时进行,互不干扰。In fact, the extraction of MFCC+MFCC-p and the creation of the migration model can be reversed in order or performed simultaneously without interfering with each other.
--冷冻层权值不变,再训练层权值随机调整。--The weights of the frozen layer remain unchanged, and the weights of the retraining layer are adjusted randomly.
--使用提取的MFCC+MFCC-p中训练集训练迁移模型。--Train the transfer model using the extracted training set in MFCC+MFCC-p.
--判断每迭代一次的模型测试错误率是否小于阈值,如果是,则结束训练,得到目标语音识别模型,否则,继续迭代训练。-- Determine whether the model test error rate for each iteration is less than the threshold, if so, end the training to obtain the target speech recognition model, otherwise, continue the iterative training.
后续,得到的目标语音识别模型可以送入模型应用环节。Subsequently, the obtained target speech recognition model can be sent to the model application link.
模型应用Model application
--提取宣告音待检数据的MFCC+MFCC-p。--Extract the MFCC+MFCC-p of the announcement tone pending data.
--将提取的MFCC+MFCC-p送入上面环节训练得到的目标语音识别模型中进行语音识别。-- Send the extracted MFCC+MFCC-p into the target speech recognition model trained in the above link for speech recognition.
在训练阶段对原始模型和宣告音识别模型的语音样本进行特征提取,并对这些特征数据进行映射,使得原始语音和宣告音的特征能够映射到一个新的公共特征空间,并且保证这些数据的维度控制在一定范围内(防止维度灾难),然后在这个新的特征空间内迭代优化宣告音领域样本的伪标记,直至收敛。这个预训练模型是基于大数据集上得到的,由于普通语音 与宣告音之间存在着很大的特征方面的相似性,可以直接使用其相应的结构和权重,将去掉输出层的DNN当做一个固定的特征提取机,对于迁移层设定一个较小的学习率,而新添的输出层则是以正常的学习率进行更新,应用到新的数据集中,通过对新的宣告音数据集进行预测和评分,从而得到最终的模型。对于这个新生成的模型如果识别结果不是很理想,还可以将原始模型起始的一些层的权重保持不变,重新训练后面的层,得到新的权重。这个过程用户可以进行多次尝试,从而可以的到最佳的训练模型。In the training phase, the feature extraction is performed on the speech samples of the original model and the announcement sound recognition model, and these feature data are mapped, so that the features of the original speech and announcement sound can be mapped to a new common feature space, and the dimension of these data is guaranteed. Controlled within a certain range (to prevent the curse of dimensionality), and then iteratively optimizes the pseudo-labels of the announcement domain samples within this new feature space until convergence. This pre-training model is based on a large data set. Since there is a great similarity in characteristics between ordinary speech and announcement speech, its corresponding structure and weight can be directly used, and the DNN without the output layer can be regarded as a The fixed feature extraction machine sets a small learning rate for the migration layer, and the newly added output layer is updated with a normal learning rate and applied to the new data set. Predict and score to get the final model. For this newly generated model, if the recognition result is not ideal, the weights of some layers at the beginning of the original model can also be kept unchanged, and the subsequent layers can be retrained to obtain new weights. In this process, the user can make multiple attempts, so that the best training model can be obtained.
实施例四Embodiment 4
参照图5所示,为本公开提供的一种语音识别模型训练装置的结构示意图,该语音识别模型训练装置500可以包括:创建模块502、提取模块504和训练模块506。Referring to FIG. 5 , which is a schematic structural diagram of a speech recognition model training apparatus provided in the present disclosure, the speech recognition model training apparatus 500 may include: a
创建模块502,用于基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层。The
提取模块504,用于从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子。The
训练模块506,用于将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;The
其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。Among them, in the iterative training process, the weights of the output layer of the migration model are adjusted randomly, and the weights of other layers of the migration model remain unchanged.
一种可实现的方案,创建模块502在基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型时,用于提取在源域训练得到的原始语音识别模型的参数,其中,参数至少包括神经网络模型中每一层的权重;基于神经网络创建新的语音识别模型,将提取的参数中除输出层外的其它层的权重迁移到新的语音识别模型,作为在目标域待训练的迁移模型。An achievable solution, the
另一种可实现的方案,创建模块502在基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型时,用于获取在源域训练得到的原始语音识别模型;将原始语音识别模型中输出层的权重剔除,得到目标域待训练的迁移模型。In another achievable solution, the
本公开中再一种可实现的方案,提取模块504在从目标域的样本数据中提取特征数据集时,用于将目标域的样本数据分别进行预处理,得到Mel频率倒谱系数MFCC;基于MFCC及其似然,确定MFCC的环境因子;将MFCC和其对应的环境因子组合为特征数据;所有样本数据的特征数据组成得到特征数据集。In yet another achievable solution in the present disclosure, when extracting the feature data set from the sample data of the target domain, the
本公开中再一种可实现的方案,提取模块504在基于MFCC的似然,确定MFCC的环境因子时,用于根据MFCC及似然函数确定该MFCC的似然;将似然收敛到最佳取值时对应的权重集合确定为MFCC的环境因子。In yet another achievable solution in the present disclosure, when determining the environmental factor of the MFCC based on the likelihood of the MFCC, the
本公开中再一种可实现的方案,提取模块504在根据MFCC及似然函数确定该MFCC的似然时,用于确定任一样本数据的MFCC为X,其中,X={x 1,x 2,...x T},X包含T个特征分量;根据下述似然函数公式,计算该MFCC的似然; In yet another achievable solution in the present disclosure, when the extraction module 504 determines the likelihood of the MFCC according to the MFCC and the likelihood function, it is used to determine that the MFCC of any sample data is X, where X={x 1 ,x 2 ,...x T }, X contains T eigencomponents; According to the following likelihood function formula, calculate the likelihood of this MFCC;
p(x t|λ)为MFCC的似然取值,w k为第k个特征分量的权重, 为高斯模型的参数,p k为高斯混合模型中各个高斯所占的权重,u k和Σ k分别为各个高斯权重的均值和方差。 p(x t |λ) is the likelihood value of MFCC, w k is the weight of the k-th feature component, are the parameters of the Gaussian model, p k is the weight of each Gaussian in the Gaussian mixture model, and u k and Σ k are the mean and variance of each Gaussian weight, respectively.
本公开中再一种可实现的方案,在对迁移模型进行迭代训练时,为输出层设置的学习率低于为其它层设置的学习率。In yet another achievable solution in the present disclosure, when the transfer model is iteratively trained, the learning rate set for the output layer is lower than the learning rate set for other layers.
本公开中再一种可实现的方案,在迭代训练过程中,每迭代一次完成后,进一步包括:比对模块,用于将本次迭代后的模型测试的错误率与上一次迭代后的模型测试的错误率进行比对;如果本次的错误率大于上一次的错误率,则按照预设幅度调低学习率;否则,按照预设幅度调高学习率。In yet another achievable solution in the present disclosure, in the iterative training process, after each iteration is completed, it further includes: a comparison module, configured to compare the error rate of the model test after this iteration with the model after the previous iteration The error rate of the test is compared; if the error rate of this time is greater than the previous error rate, the learning rate is decreased by the preset range; otherwise, the learning rate is increased by the preset range.
本公开中再一种可实现的方案,在迭代训练过程中,当错误率低于阈值时,在本次迭代完成后结束训练。In yet another achievable solution in the present disclosure, during the iterative training process, when the error rate is lower than the threshold, the training is ended after the current iteration is completed.
由上述技术方案可知,基于在源域训练得到的原始语音识别模型创建在目标域待训练的迁移模型,从目标域的样本数据中提取特征数据集,该特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。从而,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性,且识别准确率高。It can be seen from the above technical solutions that a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain. An environmental factor that can reflect the similarity of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, the migration The output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency. At the same time, during training, the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
实施例五Embodiment 5
参照图6所示,为本公开提供的一种语音识别装置的结构示意图,该装置600可以包括:确定模块602、提取模块604和识别模块606。Referring to FIG. 6 , which is a schematic structural diagram of a speech recognition apparatus provided in the present disclosure, the apparatus 600 may include: a
确定模块602,用于确定待识别的语音数据。The determining
提取模块604,用于从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子。The
识别模块606,用于将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于实施例一的方法训练得到。The
由上述技术方案可知,基于在源域训练得到的原始语音识别模型创建在目标域待训练的迁移模型,从目标域的样本数据中提取特征数据集,该特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。从而,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性,且识别准确率高。It can be seen from the above technical solutions that a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain. An environmental factor that can reflect the similarity of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, the migration The output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency. At the same time, during training, the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
实施例六Embodiment 6
本公开还提供一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。The present disclosure also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program is executed by the processor: based on the original speech recognition model trained in the source domain, Create a migration model to be trained in the target domain, where the output layer of the migration model is different from the output layer of the original speech recognition model; extract a feature dataset from the sample data in the target domain, where each feature data in the feature dataset carries There are environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as the input parameter to be substituted into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, The output layer of the transfer model adjusts the weights randomly, and the weights of other layers of the transfer model remain unchanged.
同时,还提供一种计算机可读存储介质,计算机可读存储介质存储一个或多个程序,一个或多个程序当被包括多个应用程序的服务器执行时,使得服务器执行以下操作:基于在源域训练得到的原始语音识别模型,创建在目标域待训练的迁移模型,其中,迁移模型的输出层不同于原始语音识别模型的输出层;从目标域的样本数据中提取特征数据集,其中,特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标 语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。At the same time, a computer-readable storage medium is also provided. The computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: The original speech recognition model obtained by domain training is used to create a migration model to be trained in the target domain, wherein the output layer of the migration model is different from the output layer of the original speech recognition model; the feature dataset is extracted from the sample data of the target domain, wherein, Each feature data in the feature data set carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained. ; Among them, in the iterative training process, the output layer of the migration model adjusts the weights randomly, and the weights of other layers of the migration model remain unchanged.
由上述技术方案可知,基于在源域训练得到的原始语音识别模型创建在目标域待训练的迁移模型,从目标域的样本数据中提取特征数据集,该特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。从而,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性,且识别准确率高。It can be seen from the above technical solutions that a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain. An environmental factor that can reflect the similarity of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, the migration The output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency. At the same time, during training, the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
实施例七Embodiment 7
本公开还提供另一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行:确定待识别的语音数据;从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于实施例一的方法训练得到。The present disclosure also provides another electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being executed by the processor: determining speech data to be recognized; Extract feature data, which carries environmental factors that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; substitute the feature data into the target speech recognition model for speech recognition, and the target speech recognition model is based on the method of the first embodiment Trained to get.
同时,还提供另一种计算机可读存储介质,计算机可读存储介质存储一个或多个程序,一个或多个程序当被包括多个应用程序的服务器执行时,使得服务器执行以下操作:确定待识别的语音数据;从语音数据中提取特征数据,特征数据中携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据代入目标语音识别模型进行语音识别,目标语音识别模型是基于实施例一的方法训练得到。At the same time, another computer-readable storage medium is also provided. The computer-readable storage medium stores one or more programs, and the one or more programs, when executed by a server including a plurality of application programs, cause the server to perform the following operations: Recognized speech data; extract characteristic data from speech data, which carries environmental factors that can reflect the similarity of different sample data in terms of the same physical properties of speech; substitute characteristic data into the target speech recognition model for speech recognition, the target speech The recognition model is obtained by training based on the method of the first embodiment.
由上述技术方案可知,基于在源域训练得到的原始语音识别模型创建在目标域待训练的迁移模型,从目标域的样本数据中提取特征数据集,该特征数据集中每个特征数据均携带有能够反映不同样本数据在相同语音物理属性方面相似可能性的环境因子;将特征数据集中的训练集作为输入参数代入迁移模型进行迭代训练,得到目标语音识别模型;其中,在迭代训练过程中,迁移模型的输出层随机调整权重,迁移模型的其它层的权重保持不变。从而,在新场景训练数据不足的情况下,通过使用迁移学习的方法可以加快生成有效训练模型的一种方案,降低成本,提高训练效率,同时,在训练时,再训练层所使用的特征数据中携带有能够影响语音物理属性的环境因子,因而,训练得到的目标语音识别模型拥有良好的扩展性, 且识别准确率高。It can be seen from the above technical solutions that a migration model to be trained in the target domain is created based on the original speech recognition model trained in the source domain, and a feature data set is extracted from the sample data in the target domain. An environmental factor that can reflect the similar possibility of different sample data in terms of the same physical properties of speech; the training set in the feature data set is used as an input parameter into the migration model for iterative training, and the target speech recognition model is obtained; among them, in the iterative training process, the migration The output layer of the model adjusts the weights randomly, and the weights of the other layers of the transfer model remain unchanged. Therefore, in the case of insufficient training data in the new scene, the use of transfer learning can speed up the generation of an effective training model, reduce costs, and improve training efficiency. At the same time, during training, the feature data used by the training layer There are environmental factors that can affect the physical properties of speech, so the target speech recognition model obtained by training has good scalability and high recognition accuracy.
上述实施例中所涉及的电子设备,可以参照图7所示的结构示意图。For the electronic devices involved in the above embodiments, reference may be made to the schematic structural diagram shown in FIG. 7 .
总之,以上仅为本公开的较佳实施例而已,并非用于限定本公开的保护范围。凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。In conclusion, the above are only preferred embodiments of the present disclosure, and are not intended to limit the protection scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included within the protection scope of the present disclosure.
上述一个或多个实施例阐明的系统、装置、模块或单元,可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。在一示例性实施例中,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in one or more of the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. In an exemplary embodiment, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer , wearable devices, or a combination of any of these devices.
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in the present disclosure are described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments.
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序 才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Claims (16)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011629211.3 | 2020-12-31 | ||
| CN202011629211.3A CN114708857A (en) | 2020-12-31 | 2020-12-31 | Speech recognition model training method, speech recognition method and corresponding device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022143723A1 true WO2022143723A1 (en) | 2022-07-07 |
Family
ID=82167415
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/142307 Ceased WO2022143723A1 (en) | 2020-12-31 | 2021-12-29 | Voice recognition model training method, voice recognition method, and corresponding device |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN114708857A (en) |
| WO (1) | WO2022143723A1 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115273856A (en) * | 2022-07-29 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, electronic device and storage medium |
| CN115497461A (en) * | 2022-09-09 | 2022-12-20 | 成都市联洲国际技术有限公司 | Audio recognition model training method and audio recognition method |
| CN116013300A (en) * | 2022-12-09 | 2023-04-25 | 联想(北京)有限公司 | Data processing method, device and medium used in vehicle environment |
| CN116503679A (en) * | 2023-06-28 | 2023-07-28 | 之江实验室 | Image classification method, device, equipment and medium based on mobility map |
| CN116597587A (en) * | 2023-05-31 | 2023-08-15 | 河南龙宇能源股份有限公司 | Intrusion early warning method for high-risk areas of underground operating equipment based on audio-visual collaborative recognition |
| CN118102217A (en) * | 2024-01-31 | 2024-05-28 | 海南科力千方科技有限公司 | Artificial intelligence broadcasting system based on Beidou satellite positioning and TTS framework |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115188371B (en) * | 2022-07-13 | 2025-01-24 | 合肥讯飞数码科技有限公司 | A speech recognition model training method, speech recognition method and related equipment |
| CN116028821B (en) * | 2023-03-29 | 2023-06-13 | 中电科大数据研究院有限公司 | Pre-training model training method and data processing method integrating domain knowledge |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2003255984A (en) * | 2002-03-06 | 2003-09-10 | Asahi Kasei Corp | Wild bird squeal recognition device and its recognition method |
| CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
| CN107610709A (en) * | 2017-08-01 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | A kind of method and system for training Application on Voiceprint Recognition model |
| CN111009237A (en) * | 2019-12-12 | 2020-04-14 | 北京达佳互联信息技术有限公司 | Voice recognition method and device, electronic equipment and storage medium |
-
2020
- 2020-12-31 CN CN202011629211.3A patent/CN114708857A/en active Pending
-
2021
- 2021-12-29 WO PCT/CN2021/142307 patent/WO2022143723A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2003255984A (en) * | 2002-03-06 | 2003-09-10 | Asahi Kasei Corp | Wild bird squeal recognition device and its recognition method |
| CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
| CN107610709A (en) * | 2017-08-01 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | A kind of method and system for training Application on Voiceprint Recognition model |
| CN111009237A (en) * | 2019-12-12 | 2020-04-14 | 北京达佳互联信息技术有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115273856A (en) * | 2022-07-29 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, electronic device and storage medium |
| CN115497461A (en) * | 2022-09-09 | 2022-12-20 | 成都市联洲国际技术有限公司 | Audio recognition model training method and audio recognition method |
| CN116013300A (en) * | 2022-12-09 | 2023-04-25 | 联想(北京)有限公司 | Data processing method, device and medium used in vehicle environment |
| CN116597587A (en) * | 2023-05-31 | 2023-08-15 | 河南龙宇能源股份有限公司 | Intrusion early warning method for high-risk areas of underground operating equipment based on audio-visual collaborative recognition |
| CN116503679A (en) * | 2023-06-28 | 2023-07-28 | 之江实验室 | Image classification method, device, equipment and medium based on mobility map |
| CN116503679B (en) * | 2023-06-28 | 2023-09-05 | 之江实验室 | Image classification method, device, equipment and medium based on migration map |
| CN118102217A (en) * | 2024-01-31 | 2024-05-28 | 海南科力千方科技有限公司 | Artificial intelligence broadcasting system based on Beidou satellite positioning and TTS framework |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114708857A (en) | 2022-07-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022143723A1 (en) | Voice recognition model training method, voice recognition method, and corresponding device | |
| CN111312245B (en) | Voice response method, device and storage medium | |
| CN107633842B (en) | Audio recognition method, device, computer equipment and storage medium | |
| Cui et al. | Data augmentation for deep neural network acoustic modeling | |
| CN109767778B (en) | A Speech Conversion Method Fusion Bi-LSTM and WaveNet | |
| TW201935464A (en) | Method and device for voiceprint recognition based on memorability bottleneck features | |
| US12361923B2 (en) | Transforming text data into acoustic feature | |
| WO2019237517A1 (en) | Speaker clustering method and apparatus, and computer device and storage medium | |
| CN107993663A (en) | A kind of method for recognizing sound-groove based on Android | |
| CN111771213A (en) | Speech style migration | |
| WO2019227574A1 (en) | Voice model training method, voice recognition method, device and equipment, and medium | |
| KR102765841B1 (en) | WaveNet Self-Training for Text-to-Speech | |
| WO2019237518A1 (en) | Model library establishment method, voice recognition method and apparatus, and device and medium | |
| CN112967732B (en) | Method, apparatus, device and computer readable storage medium for adjusting equalizer | |
| JP2020020872A (en) | Discriminator, learnt model, and learning method | |
| Devi et al. | RETRACTED ARTICLE: Automatic speaker recognition with enhanced swallow swarm optimization and ensemble classification model from speech signals | |
| WO2023102932A1 (en) | Audio conversion method, electronic device, program product, and storage medium | |
| CN115240645B (en) | Streaming speech recognition method based on attention re-scoring | |
| CN116863937A (en) | Far-field speaker confirmation method based on self-distillation pre-training and meta-learning fine tuning | |
| CN111755012A (en) | A robust speaker recognition method based on deep and shallow feature fusion | |
| Li et al. | Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement. | |
| CN118588087A (en) | A speaker recognition method based on Fca-ProRes2Net with fusion feature dimensionality reduction | |
| CN116758904A (en) | Improved pre-training methods, electronic equipment and storage media | |
| JP2021157145A (en) | Inference device and learning method of inference device | |
| CN113192495A (en) | Voice recognition method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21914457 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.11.2023) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21914457 Country of ref document: EP Kind code of ref document: A1 |