WO2022121155A1 - Meta learning-based adaptive speech recognition method and apparatus, device and medium - Google Patents
Meta learning-based adaptive speech recognition method and apparatus, device and medium Download PDFInfo
- Publication number
- WO2022121155A1 WO2022121155A1 PCT/CN2021/083002 CN2021083002W WO2022121155A1 WO 2022121155 A1 WO2022121155 A1 WO 2022121155A1 CN 2021083002 W CN2021083002 W CN 2021083002W WO 2022121155 A1 WO2022121155 A1 WO 2022121155A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- model
- meta
- speech recognition
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the present application relates to the technical field of artificial intelligence, and in particular, to a meta-learning-based adaptive speech recognition method, apparatus, device and medium.
- an effective speaker adaptation method relies on the selection of suitable acoustic model parameters and suitable parameter update rules to avoid overfitting on less training data.
- the manual design adjustment criteria such as the number of design adjustment steps, the learning rate, etc.
- this adjustment method requires manual design of adjustment criteria according to different speaker types in advance. The design process is cumbersome and the workload is large, and it cannot cover all parameter adjustment situations, which may easily lead to poor speech recognition results.
- the present application provides a meta-learning-based adaptive speech recognition method, device, equipment and medium, which mainly solves the problem that when the speaker adaptively adjusts the speech recognition model, it needs to be manually adjusted in advance according to different speaker types.
- the design process is cumbersome and the workload is large, and it cannot cover all parameter adjustment situations, which leads to the problem of poor speech recognition effect.
- a meta-learning-based adaptive speech recognition method comprising:
- the target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
- a meta-learning-based adaptive speech recognition device comprising:
- the training module is used to train the speech recognition model and the meta-learning adaptation model by using the preprocessed sample speech data
- an adjustment module for adjusting the initial model parameters of the speech recognition model to target model parameters matching the target speech type based on the meta-learning adaptation model
- the recognition module is used for recognizing the target speech under the target speech type by using the speech recognition model configured with the target model parameter.
- a storage medium on which a computer program is stored, and when the program is executed by a processor, the following methods are implemented:
- the target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
- a computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implements the following method when executing the program :
- the target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
- the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, and the artificial intelligence technology is used, which not only reduces the instability of manual design, but also enables the model parameter update for different application scenarios. This ensures the accuracy of speech recognition.
- FIG. 1 shows a schematic flowchart of a meta-learning-based adaptive speech recognition method provided by an embodiment of the present application
- FIG. 2 shows a schematic flowchart of another meta-learning-based adaptive speech recognition method provided by an embodiment of the present application
- FIG. 3 shows a schematic flowchart of a meta-learning adaptation model training process provided by an embodiment of the present application
- FIG. 4 shows a schematic flowchart of a meta-learning-based adaptive speech recognition system provided by an embodiment of the present application
- FIG. 5 shows a schematic structural diagram of a meta-learning-based adaptive speech recognition device provided by an embodiment of the present application
- FIG. 6 shows a schematic structural diagram of another meta-learning-based adaptive speech recognition device provided by an embodiment of the present application.
- the technical solution of the present application relates to the technical field of artificial intelligence, for example, it may specifically relate to speech processing technology to realize speech recognition.
- the data involved in this application such as sample speech data and/or target speech, etc., may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
- an embodiment of the present application provides a meta-learning-based adaptive speech recognition method. As shown in FIG. 1 , the method includes:
- the sample speech data corresponds to a preset number of unlabeled speech data.
- a preset speech processing tool such as the Kaldi ASR tool
- Data preprocessing can include pre-emphasis, framing, windowing and other operations. Through data preprocessing, aliasing and high-order harmonic distortion caused by defects in human vocal organs and acquisition equipment can be eliminated. The effect of voice signal quality.
- MFCCs Mel-Frequency Cepstral Coefficients
- Filter-Bank features Filter-Bank features
- the test set data cannot be used in the training process, it is also necessary to preprocess the sample speech data according to the meta-learning method.
- the data is sampled or divided into multiple data blocks, so that during the training process, the current data block can be used for adaptive training, and the loss reduction can be achieved on the next data block corresponding to the current data block.
- the executive body of the present application can be a speech recognition system for realizing speaker adaptation, in which a pre-trained speech recognition model is configured, and the speech recognition model can be adjusted according to the adaptation data of different speech types. Meta-learning to adapt the model parameters to the model.
- the target speech type is the same speech type as the speaker type to be recognized, and the speech type can be customized and divided according to actual application requirements.
- voice types in addition, due to the influence of the environment, people in different regions will have different types of accents, so the definition of voice types can be divided according to the environmental region, such as Northeastern dialect, Sichuan dialect, Shandong dialect, Cantonese dialect, etc. .
- the target voice types can be divided and selected according to actual application requirements. For example, when it involves voice recognition tasks for different age groups (such as infant education, etc.), the target voice types can be divided and selected according to age groups. ; If it involves one-to-one intelligent voice service, each person can be regarded as an independent group type; if it involves popular intelligent voice service, the target voice type can be divided and selected according to the environmental area. Further, after the target group type is selected, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type by using the meta-learning adaptation model.
- the target speech under the target speech type can be targeted based on the speech recognition model configured with the target model parameters. Thereby, the recognition accuracy is improved.
- the present application can firstly pre-train the speech recognition model through sample speech data, and further Taking the speaker adaptation task as a task in meta-learning, a meta-learning adaptation model is designed to adjust the model parameters in the speech recognition model according to different speech types. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech.
- the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, and the artificial intelligence technology is used, which not only reduces the instability of manual design, but also enables the model parameter update for different application scenarios. This ensures the accuracy of speech recognition.
- FIG. Methods include:
- pre-emphasizing the sample speech data is to emphasize the high-frequency part of the speech, remove the influence of lip radiation, and increase the high-frequency resolution of the speech.
- the analysis of the sample speech should be based on the "short-term" basis.
- Windowing and framing processing are performed, and the sample speech is divided into sections to analyze its characteristics, and each section becomes a "frame". Since the frame is too large to obtain the time-varying characteristics of the speech signal, and the frame is too small to extract the characteristics of the speech signal, each frame can be divided into 20-30ms size.
- a vector of speech features corresponding to each frame of sample speech can be obtained.
- the speech features of the entire sample data can be obtained by framing, that is, integrating the speech feature vectors of each frame based on the frame serial number, and the size of the frame serial number indicates the time sequence of the corresponding speech of each frame.
- the speech recognition task as a sequence-to-sequence problem.
- the speech features in the sample speech are calculated frame by frame, and finally the speech features can be integrated based on the frame number corresponding to each frame, and the speech sequence characteristics of the sample speech can be further obtained, which are expressed as:
- T is the total number of frames of the speech sequence, which is expressed as the speech feature contained in the t-th frame.
- the existing voice conversion algorithm can be used to obtain the text features corresponding to each frame of sample voice, and further integration can be performed to obtain the text sequence features corresponding to the sample voice, which are expressed as:
- U is the total length of the text corresponding to the speech, which is represented as the u-th text feature.
- the speech recognition model obtained by pre-training with sample speech data can be applied to basic speech recognition tasks, but when performing targeted speech recognition (such as speech recognition with accents, speech recognition of infants and young children) Recognition, etc.), the recognition effect is often not accurate enough, so it is necessary to perform the model parameter correction process in the subsequent steps, so that the speech recognition model can achieve targeted recognition.
- a deep neural network model based on Connectionist Temporal Classification (CTC) can be used, and the CTC-based model is trained by predicting the output of each frame in the speech sequence and comparing the real sample labels to calculate Model training error.
- step 202 in the embodiment may specifically include: inputting the first speech feature into the speech recognition model to obtain a text output result; A first loss function is calculated from the text output result and the first text feature; if it is determined that the first loss function is smaller than the first preset threshold, it is determined that the speech recognition model meets the first training standard.
- the first preset threshold may be set according to actual training accuracy requirements, and should be a value greater than 0 and less than 1. The larger the set value, the higher the training accuracy of the speech recognition model.
- the meta-learning adaptation model is trained based on a meta-learning technique, wherein the meta-learning is an algorithm that uses a model related to a learning task instead of manually designing an adjustment criterion.
- the task of the meta-learning adaptation model is to adjust the parameters of the original model with the help of a small amount of adaptive speech data, so that its effect on speech recognition is better.
- step 203 in the embodiment may specifically include: dividing the sample speech data into a preset number of data blocks, and extracting the second data block of each data block.
- the second voice feature and the second text feature in the steps of this embodiment are different from the first voice feature and the first text feature in the step 202 of the embodiment, and the second voice feature and the second text feature both correspond to sample voices
- the feature sequence of each data block divided by the data, and the first voice feature and the first text feature correspond to the feature sequence of the entire sample voice data.
- the initial model parameters of the pre-trained speech recognition model (pre-training model 1) and the second speech feature and the second text feature of the data block 1 can be used to calculate the meta-learning adaptation model in the data block 1.
- the loss value, loss gradient and new model parameters in replace the new model parameters with the initial model parameters in the pre-training model 1, and obtain the model 2 corresponding to the speech recognition model.
- the loss value, the loss gradient and the new model parameters of the meta-learning adaptation model in the data block 2 are calculated, and the new model parameters obtained at this time are calculated.
- the model parameters are replaced into model 2 to obtain model 3 corresponding to the speech recognition model.
- the loss value, the loss gradient and the new model parameters of the meta-learning adaptation model in the data block 3 are calculated, and the new model parameters obtained at this time are calculated.
- the model parameters are replaced into model 3, and the model 4 corresponding to the speech recognition model is obtained, until the training of all data blocks for the meta-learning adaptation model is completed.
- the loss function determines whether the meta-learning adaptive model has passed the training according to the loss function. When the training is passed, the model parameters determined by the last data block can be determined as the new model parameters of the speech recognition model in the testing phase.
- the specific implementation process of the training element learning adaptation model may be: extracting the initial model parameters of the speech recognition model; if it is determined that the current data block is the first divided data block, calculate the loss value, loss gradient and new model parameters of the meta-learning adaptive model in the first data block according to the initial model parameters and the second speech feature and the second text feature of the current data block; if the current data block is determined If the block is not the first data block, the loss value, loss gradient and New model parameters; if it is determined that all data blocks have been trained, the loss value, loss gradient and new model parameters calculated by each data block are used to determine the second loss function of the meta-learning adaptation model; if it is determined that the second loss function is smaller than the second loss function If the preset threshold is set, it is determined that the meta-learning adaptation model complies with the second training standard.
- the second preset threshold may be set according to the actual training accuracy requirement, and should be a value greater than 0 and less
- the feature description of the second loss function calculation formula is: Among them, J is the second loss function, y c+1 is the second text feature of the c+1 data block, x c+1 is the second speech feature of the c+1 data block, ⁇ ′ is the meta-learning adaptation model in c
- the new model parameters calculated by the data block, L(y c+1 , f(x c+1 ; ⁇ ′)) is the loss value calculated by the meta-learning adaptation model in the c+1 data block.
- the network structure of the meta-learning adaptation model may adopt a two-layer Long Short-Term Memory (LSTM).
- the input of the first layer of LSTM at time t includes the original model parameter ⁇ t , the cross-entropy loss L t on the data block and its corresponding gradient
- the input hidden layer representation h t can be obtained.
- the forgetting gate parameter f t and the input gate parameter i t can be obtained, combined with the original parameter ⁇ t and the corresponding gradient
- the parameters of the new model can be obtained:
- the meta-learning adaptation model can be used to adaptively adjust the model parameters of the speech recognition model according to the actual application scenario. Specifically, in order to obtain the target model parameters that match the target speech type, it is necessary to extract a small amount of adaptive speech data that matches the target speech type, and input the initial model parameters of the speech recognition model and the loss and loss gradient on the adaptive speech data into the meta-learning In the adaptation model, the target model parameters in the speech recognition model that match the target speech type can be obtained. The adaptation data needs to belong to the same speech type as the target speech to be recognized.
- the meta-learning adaptation model is used to analyze the speech
- a small amount of infant voice data can be used as adaptive voice data to further determine the target model parameters that match the voice type of infants and young children, so as to obtain high accuracy in infant voice recognition.
- speech recognition model if the speech recognition of the present application is applied to a popular-oriented intelligent speech service, because the target speech to be recognized is a group of people in various regions, in order to avoid the influence of regional accents on the speech recognition effect, it can be selected according to the region to be recognized. For the corresponding adaptive voice data, if the area to be recognized is the Northeast region, a small amount of Northeast voice data can be selected as the adaptive voice data, so as to train a voice recognition model that can eliminate accent interference.
- the target model parameters can be updated to the speech recognition model, and then the updated speech recognition model can be used to achieve accurate recognition of the target speech under the target speech type, thereby It can get better speech recognition effect than pre-trained speech recognition model.
- the present application can firstly pre-train the speech recognition model through sample speech data, and further regard the speaker adaptation task as a task in the meta-learning, which is designed to be used according to different speech sounds.
- Type a meta-learning adaptation model that adjusts model parameters in a speech recognition model. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech.
- the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, which not only reduces the instability of manual design, but also enables the model parameters to be updated for different application scenarios, thereby ensuring the accuracy of speech recognition. Accuracy. And when using the meta-learning adaptation model to determine the model parameters of the speech recognition model, only a small amount of adaptive speech data is needed, and less adaptive speech data will make the updated model parameters easier to overfit in the application scenario, so the parameters can also be reduced. The updated risk of overfitting.
- the speech recognition system may refer to the schematic flowchart of the meta-learning-based adaptive speech recognition system shown in FIG. 4 .
- the Feature extraction using the extracted first voice feature and the first text feature to pre-train the voice adaptation model, and then according to the second voice feature corresponding to the sample voice data, the second text feature and the original model parameter training element learning corresponding to the voice recognition model Fit the model.
- the adaptive voice data matching the target voice type after extracting the voice features and text features, input them together with the initial model parameters of the voice recognition model into the trained meta-learning adaptation model, and then obtain the matching target voice type.
- the initial model parameters of the speech recognition model are updated to the target model parameters, and then the updated speech recognition model (speaker adaptation model) can be used to recognize the target speech under the target speech type.
- an embodiment of the present application provides an adaptive speech recognition device based on meta-learning.
- the device includes: a training module 31 , an adjustment module 32. Identification module 33;
- the training module 31 can be used to train the speech recognition model and the meta-learning adaptation model by using the preprocessed sample speech data
- the adjustment module 32 can be used to adjust the initial model parameters of the speech recognition model to the target model parameters matching the target speech type based on the meta-learning adaptation model;
- the recognition module 33 can be used to recognize the target speech under the target speech type by using the speech recognition model configured with the target model parameters.
- the training module 31 may specifically include: a processing unit 311 , a first training unit 312 , and a second training unit 313 ;
- the processing unit 311 can be used to preprocess the sample voice data, and mark the first voice feature and the first text feature corresponding to the sample voice data, and the preprocessing at least includes pre-emphasis processing, framing processing, and windowing processing;
- a first training unit 312 which can be used to train a speech recognition model that meets the first training standard based on the first speech feature and the first text feature;
- the second training unit 313 can be configured to use the sample speech data and the speech recognition model to train a meta-learning adaptation model that meets the second training standard.
- the first training unit 312 can be specifically configured to input the first speech feature into the speech recognition model to obtain the text output result; calculate the first loss function according to the text output result and the first text feature; if it is determined that the first loss function is smaller than the first loss function; A preset threshold value determines that the speech recognition model complies with the first training standard.
- the second training unit 313 can be specifically configured to divide the sample voice data into a preset number of data blocks, and extract the second voice feature and the second text feature of each data block; according to the second voice feature , a second text feature, and a speech recognition model to train a meta-learning adaptation model that meets the second training criteria.
- the second training unit 313 can specifically be used to extract the initial model parameters of the voice recognition model; If it is determined that the current data block is the first divided data block, according to the initial model parameters and the second speech feature and the second text feature of the current data block, the loss value, Loss gradient and new model parameters; if it is determined that the current data block is not the first data block, the meta-learning adaptation model is calculated based on the new model parameters of the previous data block and the second speech feature and second text feature of the current data block.
- the loss value, loss gradient and new model parameters in the current data block are used to determine the second loss of the meta-learning adaptation model function; if it is determined that the second loss function is smaller than the second preset threshold, it is determined that the meta-learning adaptive model meets the second training standard.
- J is the second loss function
- y c+1 is the second text feature of the c+1 data block
- x c+1 is the second speech feature of the c+1 data block
- ⁇ ′ is the meta-learning adaptation model in c
- L(y c+1 , f(x c+1 ; ⁇ ′)) is the loss value calculated by the meta-learning adaptation model in the c+1 data block.
- the adjustment module 32 may specifically include: an extraction unit 321 and an acquisition unit 322;
- the obtaining unit 322 is configured to input the initial model parameters of the speech recognition model and the adaptive speech data into the meta-learning adaptation model, and obtain the target model parameters matching the target speech type.
- the recognition module 33 can be specifically configured to update the initial model parameters of the speech recognition model to the target model parameters, so as to use the updated speech recognition model to recognize the target speech under the target speech type.
- the present embodiment further provides a storage medium on which computer-readable instructions (or computer programs) are stored, and the readable instructions (or computer programs) are When executed by the processor, the above-mentioned meta-learning-based adaptive speech recognition method shown in FIG. 1 to FIG. 2 is implemented.
- the storage medium involved in this application may be a readable storage medium, or may be referred to as a computer-readable storage medium.
- the storage medium such as a readable storage medium, may be non-volatile, such as a non-volatile readable storage medium; or, may also be volatile, such as a volatile readable storage medium.
- the technical solution of the present application can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of various implementation scenarios of the present application.
- a computer device which may be a personal computer, a server, or a network device, etc.
- this embodiment also provides a computer device, the computer device includes a storage medium and a processing
- the storage medium may be a non-volatile storage medium for storing a computer program; a processor for executing the computer program to implement the above-mentioned meta-learning-based adaptive speech recognition method shown in FIG. 1 to FIG. 2 .
- the processor may be configured to perform: using the preprocessed sample speech data to train a speech recognition model and a meta-learning adaptation model; based on the meta-learning adaptation model, adjusting the initial model parameters of the speech recognition model to match the target speech type matching target model parameters; using the speech recognition model configured with the target model parameters to recognize the target speech under the target speech type.
- the processor executes, other steps of the method in the foregoing embodiment may also be implemented, which will not be repeated here.
- the computer device may further include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like.
- the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
- Optional network interfaces may include standard wired interfaces, wireless interfaces (such as WI-FI interfaces), and the like.
- a computer device does not constitute a limitation on the physical device, and may include more or less components, or combine some components, or arrange different components.
- the non-volatile storage medium may further include an operating system and a network communication module.
- An operating system is a program that manages the hardware and software resources of the above-mentioned computer equipment, and supports the operation of information processing programs and other software and/or programs.
- the network communication module is used to realize the communication between various components inside the non-volatile storage medium, and communicate with other hardware and software in the information processing entity device.
- the present application can firstly pre-train the speech recognition model through sample speech data, and further regard the speaker adaptation task as a task in meta-learning, which is designed for According to different speech types, the meta-learning adaptation model of the model parameters in the speech recognition model is adjusted. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech.
- the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, which not only reduces the instability of manual design, but also enables the model parameters to be updated for different application scenarios, thereby ensuring the accuracy of speech recognition. Accuracy. And when using the meta-learning adaptation model to determine the model parameters of the speech recognition model, only a small amount of adaptive speech data is needed, and less adaptive speech data will make the updated model parameters easier to overfit in the application scenario, so the parameters can also be reduced. The updated risk of overfitting.
- the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present application.
- the modules in the device in the implementation scenario may be distributed in the device in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the implementation scenario with corresponding changes.
- the modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
本申请要求于2020年12月10日提交中国专利局、申请号为202011434900.9,发明名称为“基于元学习的自适应语音识别方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 10, 2020 with the application number 202011434900.9 and the invention titled "Meta-learning-based adaptive speech recognition method, device, equipment and medium", all of which are The contents are incorporated herein by reference.
本申请涉及人工智能技术领域,尤其涉及到一种基于元学习的自适应语音识别方法、装置、设备及介质。The present application relates to the technical field of artificial intelligence, and in particular, to a meta-learning-based adaptive speech recognition method, apparatus, device and medium.
随着深度学习的发展,自动语音识别系统取得了显著的效果,并被用于日常生活的各种场景。目前最广泛使用的语音识别应用是智能助理,用户可以通过语音自然地与机器进行交流,协助工作。然而智能助理服务面向的多数场景是单一用户,其语音识别针对的是单一说话人。此时自动语音识别系统的性能可以通过调整声学模型参数来补偿训练和测试条件之间的不匹配,从而来改善识别的效果。这种针对未知说话人调整已有参数的方法被称为说话人自适应方法。With the development of deep learning, automatic speech recognition systems have achieved remarkable results and are used in various scenarios of daily life. The most widely used speech recognition application today is the intelligent assistant, which allows users to communicate with machines naturally through speech to assist work. However, most scenarios for intelligent assistant services are single-user, and their speech recognition is aimed at a single speaker. At this time, the performance of the automatic speech recognition system can improve the recognition effect by adjusting the parameters of the acoustic model to compensate for the mismatch between the training and testing conditions. This method of adjusting existing parameters for unknown speakers is called speaker adaptation.
发明人意识到,有效的说话人自适应方法依赖于选择合适的声学模型参数和合适的参数更新规则,以避免在较少的训练数据上的过拟合。为了依据不同说话人进行声学模型参数的有效调整,目前主要采用人工设计调整准则(如设计调整步骤数,学习率等)的方式来进行。然而这种调整方式需要人工预先根据不同说话人类型进行调整准则的设计,设计过程繁琐,且工作量较大,并且无法涵盖所有参数调整情况,进而容易导致语音识别效果不佳。The inventors have realized that an effective speaker adaptation method relies on the selection of suitable acoustic model parameters and suitable parameter update rules to avoid overfitting on less training data. In order to effectively adjust the parameters of the acoustic model according to different speakers, the manual design adjustment criteria (such as the number of design adjustment steps, the learning rate, etc.) are mainly used at present. However, this adjustment method requires manual design of adjustment criteria according to different speaker types in advance. The design process is cumbersome and the workload is large, and it cannot cover all parameter adjustment situations, which may easily lead to poor speech recognition results.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请提供了一种基于元学习的自适应语音识别方法、装置、设备及介质,主要解决目前在对语音识别模型进行说话人自适应调节时,需要人工预先根据不同说话人类型进行调整准则的设计,设计过程繁琐、工作量较大,并且无法涵盖所有参数调整情况,进而导致语音识别效果不佳的问题。In view of this, the present application provides a meta-learning-based adaptive speech recognition method, device, equipment and medium, which mainly solves the problem that when the speaker adaptively adjusts the speech recognition model, it needs to be manually adjusted in advance according to different speaker types. To design the adjustment criteria, the design process is cumbersome and the workload is large, and it cannot cover all parameter adjustment situations, which leads to the problem of poor speech recognition effect.
根据本申请的一个方面,提供了一种基于元学习的自适应语音识别方法,该方法包括:According to an aspect of the present application, a meta-learning-based adaptive speech recognition method is provided, the method comprising:
利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;Use the preprocessed sample speech data to train the speech recognition model and the meta-learning adaptation model;
基于所述元学习适应模型,将所述语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数;Based on the meta-learning adaptation model, adjusting the initial model parameters of the speech recognition model to target model parameters matching the target speech type;
利用配置有所述目标模型参数的语音识别模型识别所述目标语音类型下的目标语音。The target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
根据本申请的另一个方面,提供了一种基于元学习的自适应语音识别装置,包括:According to another aspect of the present application, a meta-learning-based adaptive speech recognition device is provided, comprising:
训练模块,用于利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;The training module is used to train the speech recognition model and the meta-learning adaptation model by using the preprocessed sample speech data;
调整模块,用于基于所述元学习适应模型,将所述语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数;an adjustment module for adjusting the initial model parameters of the speech recognition model to target model parameters matching the target speech type based on the meta-learning adaptation model;
识别模块,用于利用配置有所述目标模型参数的语音识别模型识别所述目标语音类型下的目标语音。The recognition module is used for recognizing the target speech under the target speech type by using the speech recognition model configured with the target model parameter.
根据本申请的又一个方面,提供了一种存储介质,其上存储有计算机程序,所述程序被处理器执行时实现以下方法:According to yet another aspect of the present application, there is provided a storage medium on which a computer program is stored, and when the program is executed by a processor, the following methods are implemented:
利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;Use the preprocessed sample speech data to train the speech recognition model and the meta-learning adaptation model;
基于所述元学习适应模型,将所述语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数;Based on the meta-learning adaptation model, adjusting the initial model parameters of the speech recognition model to target model parameters matching the target speech type;
利用配置有所述目标模型参数的语音识别模型识别所述目标语音类型下的目标语音。The target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
根据本申请的再一个方面,提供了一种计算机设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:According to yet another aspect of the present application, a computer device is provided, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implements the following method when executing the program :
利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;Use the preprocessed sample speech data to train the speech recognition model and the meta-learning adaptation model;
基于所述元学习适应模型,将所述语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数;Based on the meta-learning adaptation model, adjusting the initial model parameters of the speech recognition model to target model parameters matching the target speech type;
利用配置有所述目标模型参数的语音识别模型识别所述目标语音类型下的目标语音。The target speech under the target speech type is recognized by using the speech recognition model configured with the target model parameters.
在本申请中,采用元学习适应模型实现对语音识别模型中模型参数的自适应调整,采用人工智能技术,不仅减少了人工设计的不稳定性,还使得模型参数更新可以针对不同的应用场景,进而保证语音识别的精准性。In this application, the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, and the artificial intelligence technology is used, which not only reduces the instability of manual design, but also enables the model parameter update for different application scenarios. This ensures the accuracy of speech recognition.
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本地申请的不当限定。在附图中:The drawings described herein are used to provide further understanding of the application, and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the local application. In the attached image:
图1示出了本申请实施例提供的一种基于元学习的自适应语音识别方法的流程示意图;1 shows a schematic flowchart of a meta-learning-based adaptive speech recognition method provided by an embodiment of the present application;
图2示出了本申请实施例提供的另一种基于元学习的自适应语音识别方法的流程示意图;2 shows a schematic flowchart of another meta-learning-based adaptive speech recognition method provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种元学习适应模型训练过程的流程示意图;3 shows a schematic flowchart of a meta-learning adaptation model training process provided by an embodiment of the present application;
图4示出了本申请实施例提供的一种基于元学习的自适应语音识别系统的流程示意图;FIG. 4 shows a schematic flowchart of a meta-learning-based adaptive speech recognition system provided by an embodiment of the present application;
图5示出了本申请实施例提供的一种基于元学习的自适应语音识别装置的结构示意图;FIG. 5 shows a schematic structural diagram of a meta-learning-based adaptive speech recognition device provided by an embodiment of the present application;
图6示出了本申请实施例提供的另一种基于元学习的自适应语音识别装置的结构示意图。FIG. 6 shows a schematic structural diagram of another meta-learning-based adaptive speech recognition device provided by an embodiment of the present application.
下文将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互结合。Hereinafter, the present application will be described in detail with reference to the accompanying drawings and in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.
本申请的技术方案涉及人工智能技术领域,如可具体涉及语音处理技术,以实现语音识别。可选的,本申请涉及的数据如样本语音数据和/或目标语音等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。The technical solution of the present application relates to the technical field of artificial intelligence, for example, it may specifically relate to speech processing technology to realize speech recognition. Optionally, the data involved in this application, such as sample speech data and/or target speech, etc., may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
针对目前在对语音识别模型进行说话人自适应调节时,需要人工预先根据不同说话人类型进行调整准则的设计,设计过程繁琐、工作量较大,并且无法涵盖所有参数调整情况,进而导致语音识别效果不佳的问题,本申请实施例提供了一种基于元学习的自适应语音识别方法,如图1所示,该方法包括:For the current speaker adaptive adjustment of the speech recognition model, it is necessary to manually adjust the design of the criteria according to different speaker types in advance. The design process is cumbersome, the workload is large, and it cannot cover all parameter adjustment situations, which leads to speech recognition. For the problem of poor effect, an embodiment of the present application provides a meta-learning-based adaptive speech recognition method. As shown in FIG. 1 , the method includes:
101、利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型。101. Use the preprocessed sample speech data to train a speech recognition model and a meta-learning adaptation model.
其中,样本语音数据对应预设数量个未标注的语音数据,在利用样本语音数据训练语音识别模型以及元学习适应模型之前,需要利用预设语音处理工具(如Kaldi ASR工具)对样本语音数据进行数据预处理,数据预处理可包括预加重,分帧,加窗等操作,通过数据预处理,可消除因为人类发声器官缺陷和采集设备缺陷带来的混叠、高次谐波失真等因素对语音信号质量的影响。此外,为了方便对语音特征的分析,还需要利用特征提取工具(如梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCCs)或Filter-Bank特征等)进行特征提取,具体需要提取出语音特征以及文本特征。Among them, the sample speech data corresponds to a preset number of unlabeled speech data. Before using the sample speech data to train the speech recognition model and the meta-learning adaptation model, it is necessary to use a preset speech processing tool (such as the Kaldi ASR tool) to process the sample speech data. Data preprocessing. Data preprocessing can include pre-emphasis, framing, windowing and other operations. Through data preprocessing, aliasing and high-order harmonic distortion caused by defects in human vocal organs and acquisition equipment can be eliminated. The effect of voice signal quality. In addition, in order to facilitate the analysis of speech features, it is also necessary to use feature extraction tools (such as Mel-Frequency Cepstral Coefficients (MFCCs) or Filter-Bank features, etc.) to perform feature extraction. Specifically, it is necessary to extract the speech features. and text features.
此外,在利用样本语音数据训练元学习适应模型时,因为训练过程中不可以使用测试集数据,故还需要按照元学习的方式对样本语音数据进行数据块的预处理,具体可采用从样本语音数据中采样或划分转换为多个数据块的方式,从而在训练过程中,可以使用当前数据块进行适应训练,在当前数据块对应的下一数据块上取得损失降低。In addition, when using the sample speech data to train the meta-learning adaptation model, because the test set data cannot be used in the training process, it is also necessary to preprocess the sample speech data according to the meta-learning method. The data is sampled or divided into multiple data blocks, so that during the training process, the current data block can be used for adaptive training, and the loss reduction can be achieved on the next data block corresponding to the current data block.
对于本申请的执行主体可为用于实现说话人自适应的语音识别系统,在语音识别系统中配置有预训练好的语音识别模型,以及能够根据不同语音类型下的适应数据对语音识别模型中模型参数进行调整的元学习适应模型。The executive body of the present application can be a speech recognition system for realizing speaker adaptation, in which a pre-trained speech recognition model is configured, and the speech recognition model can be adjusted according to the adaptation data of different speech types. Meta-learning to adapt the model parameters to the model.
102、基于元学习适应模型,将语音识别模型的初始模型参数调整为与目标语音类型匹 配的目标模型参数。102. Based on the meta-learning adaptation model, adjust the initial model parameters of the speech recognition model to target model parameters matching the target speech type.
其中,目标语音类型为与待识别的说话人类型相同的语音类型,语音类型具体可根据实际应用需求进行自定义划分。以下特举几种语音类型划分的场景实例:因为不同年龄段或不同性别对应的音频、音调往往不同,故可将说话人类型按照年龄或性别来划分,如按照年龄将语音类型划分为老人、成年人、青少年、婴幼儿,或按照性别将语音类型划分为男人、女人;相应的,因为即使同一年龄段,不同人因为声带的不同,也会导致具有不同的音色,故可将每个人作为一个独立的语音类型;此外,因为受到环境的影响,不同地域的人都会具有不同类型的口音,故可按照环境地域进行语音类型的定义划分,如东北话、四川话、山东话、广东话等。The target speech type is the same speech type as the speaker type to be recognized, and the speech type can be customized and divided according to actual application requirements. The following are some examples of the classification of speech types: Because the audio and tones corresponding to different age groups or different genders are often different, the speaker types can be divided according to age or gender. Adults, teenagers, infants and young children, or according to gender, the voice types are divided into men and women; correspondingly, because even in the same age group, different people will have different timbres due to different vocal cords, so each person can be used as a voice. An independent voice type; in addition, due to the influence of the environment, people in different regions will have different types of accents, so the definition of voice types can be divided according to the environmental region, such as Northeastern dialect, Sichuan dialect, Shandong dialect, Cantonese dialect, etc. .
相应的,对于本实施例,可根据实际应用需求进行目标语音类型的划分选取,如涉及对不同年龄段的语音识别任务(如婴幼儿教育等)时,可按照年龄段来划分选取目标语音类型;如涉及一对一智能语音服务时,可将每一个人作为一个独立的人群类型;如涉及面向大众化的智能语音服务时,可按照环境区域来划分选取目标语音类型等。进一步地,在选取出目标人群类型后,可利用元学习适应模型将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数。Correspondingly, for this embodiment, the target voice types can be divided and selected according to actual application requirements. For example, when it involves voice recognition tasks for different age groups (such as infant education, etc.), the target voice types can be divided and selected according to age groups. ; If it involves one-to-one intelligent voice service, each person can be regarded as an independent group type; if it involves popular intelligent voice service, the target voice type can be divided and selected according to the environmental area. Further, after the target group type is selected, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type by using the meta-learning adaptation model.
103、利用配置有目标模型参数的语音识别模型识别目标语音类型下的目标语音。103. Use the speech recognition model configured with the target model parameter to recognize the target speech under the target speech type.
对于本实施例,在具体的应用场景中,在确定得到与目标人群类型匹配的目标模型参数后,可基于配置有目标模型参数的语音识别模型对目标语音类型下的目标语音进行针对性识别,从而提高识别准确度。For this embodiment, in a specific application scenario, after it is determined that the target model parameters matching the target population type are obtained, the target speech under the target speech type can be targeted based on the speech recognition model configured with the target model parameters. Thereby, the recognition accuracy is improved.
通过本实施例中基于元学习的自适应语音识别方法,与通过人工设计调整准则来进行语音自适应识别的方式相比,本申请可首先通过样本语音数据对语音识别模型进行预训练,并且进一步将说话人适应任务当作元学习中的任务,设计了用于根据不同语音类型,调整语音识别模型中模型参数的元学习适应模型。进而可在确定目标语音类型时,基于元学习适应模型将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数,进而利用配置目标模型参数的语音识别模型实现对目标语音类型下目标语音的针对性精准识别。在本申请中,采用元学习适应模型实现对语音识别模型中模型参数的自适应调整,采用人工智能技术,不仅减少了人工设计的不稳定性,还使得模型参数更新可以针对不同的应用场景,进而保证语音识别的精准性。Through the adaptive speech recognition method based on meta-learning in this embodiment, compared with the way of performing speech adaptive recognition by manually designing and adjusting criteria, the present application can firstly pre-train the speech recognition model through sample speech data, and further Taking the speaker adaptation task as a task in meta-learning, a meta-learning adaptation model is designed to adjust the model parameters in the speech recognition model according to different speech types. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech. In this application, the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, and the artificial intelligence technology is used, which not only reduces the instability of manual design, but also enables the model parameter update for different application scenarios. This ensures the accuracy of speech recognition.
进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本实施例中的具体实施过程,提供了另一种基于元学习的自适应语音识别方法,如图2所示,该方法包括:Further, as a refinement and extension of the specific implementation of the above-mentioned embodiment, in order to fully describe the specific implementation process in this embodiment, another adaptive speech recognition method based on meta-learning is provided, as shown in FIG. Methods include:
201、对样本语音数据进行预处理,并标记样本语音数据对应的第一语音特征以及第一文本特征,预处理至少包括预加重处理、分帧处理、加窗处理。201. Preprocess the sample speech data, and mark the first speech feature and the first text feature corresponding to the sample speech data, and the preprocessing at least includes pre-emphasis processing, frame segmentation processing, and windowing processing.
其中,对样本语音数据进行预加重处理,目的是为了对语音的高频部分进行加重,去除口唇辐射的影响,增加语音的高频分辨率。一般通过传递函数为H(z)=1-az -1的一阶FIR高通数字滤波器来实现预加重,其中a为预加重系数,0.9<a<1.0。设n时刻的语音采样值为x(n),经过预加重处理后的结果为y(n)=x(n)-ax(n-1),这里取a=0.98。 The purpose of pre-emphasizing the sample speech data is to emphasize the high-frequency part of the speech, remove the influence of lip radiation, and increase the high-frequency resolution of the speech. Generally, pre-emphasis is realized by a first-order FIR high-pass digital filter with a transfer function of H(z)=1-az -1 , where a is a pre-emphasis coefficient, 0.9<a<1.0. Assuming that the speech sample value at time n is x(n), the result after pre-emphasis processing is y(n)=x(n)-ax(n-1), where a=0.98 is taken.
此外,由于语音信号在一个短时间范围内其特性基本保持不变,即语音信号具有短时平稳性,所以在对样本语音进行分析时应建立在“短时”的基础上,对样本语音数据进行加窗分帧处理,将样本语音分成一段一段来分析其特征,其中每一段成为一“帧”。由于帧太大不能得到语音信号随时间变化的特性,帧太小不能提取出语音信号的特征,故可将每帧切分为20-30ms大小,在对样本语音进行加窗分帧处理后,可得到每帧样本语音对应语音特征的一个矢量。整个样本数据的语音特征可通过组帧获取得到,即基于帧序号整合各帧下的语音特征矢量,帧序号大小表示每帧对应语音的时间先后。In addition, since the characteristics of the speech signal remain basically unchanged in a short time range, that is, the speech signal has short-term stability, so the analysis of the sample speech should be based on the "short-term" basis. Windowing and framing processing are performed, and the sample speech is divided into sections to analyze its characteristics, and each section becomes a "frame". Since the frame is too large to obtain the time-varying characteristics of the speech signal, and the frame is too small to extract the characteristics of the speech signal, each frame can be divided into 20-30ms size. After the sample speech is windowed and divided into frames, A vector of speech features corresponding to each frame of sample speech can be obtained. The speech features of the entire sample data can be obtained by framing, that is, integrating the speech feature vectors of each frame based on the frame serial number, and the size of the frame serial number indicates the time sequence of the corresponding speech of each frame.
对于本实施例,通过将语音识别任务看作是一个序列到序列的问题。具体来说,首先逐帧计算样本语音中的语音特征,最后可基于每帧对应的帧序号对语音特征进行整合,进一步得到样本语音的语音序列特征,表示为:For this example, by considering the speech recognition task as a sequence-to-sequence problem. Specifically, the speech features in the sample speech are calculated frame by frame, and finally the speech features can be integrated based on the frame number corresponding to each frame, and the speech sequence characteristics of the sample speech can be further obtained, which are expressed as:
X t=[x 0,…,x T] X t =[x 0 ,...,x T ]
其中,T为语音序列的总帧数,表示为第t帧所包含的语音特征。Among them, T is the total number of frames of the speech sequence, which is expressed as the speech feature contained in the t-th frame.
相应的,可利用已有的语音转换算法,获取每帧样本语音对应的文本特征,进一步进行整合,可得到样本语音对应的文本序列特征,表示为:Correspondingly, the existing voice conversion algorithm can be used to obtain the text features corresponding to each frame of sample voice, and further integration can be performed to obtain the text sequence features corresponding to the sample voice, which are expressed as:
Y u=[x 0,…,y U] Yu u =[x 0 ,...,y U ]
其中,U为语音对应文本的总长度,表示为第u个文本特征。Among them, U is the total length of the text corresponding to the speech, which is represented as the u-th text feature.
202、基于第一语音特征和第一文本特征训练符合第一训练标准的语音识别模型。202. Train a speech recognition model conforming to the first training standard based on the first speech feature and the first text feature.
其中,在本实施例中,利用样本语音数据预训练得到的语音识别模型可应用于基本的语音识别任务,然而在进行语音的针对性识别时(如带有口音的语音识别、婴幼儿语音的识别等),识别效果往往不够精准,故需要执行后续步骤的模型参数修正过程,以使语音识别模型能够实现针对性识别。对于本实施例的语音识别模型可以采用基于联结时序分类(Connectionist Temporal Classification,CTC)的深度神经网络模型,基于CTC的模型训练时通过预测语音序列中每一帧的输出,对比真实样本标签来计算模型训练误差。在本申请中,语音识别模型采用样本语音数据对应的第一语音特征和第一文本特征进行训练,网络结构一般可采用LSTM/CNN/GRU等结构,考虑到移动场景的内存计算限制,应使用较少的网络结构层数。相应的,为了依据第一语音特征和第一文本特征训练得到符合第一训练标准的语音识别模型,实施例步骤202具体可以包括:将第一语音特征输入语音识别模型,获取文本输出结果;依据文本输出结果与第一文本特征计算第一损失函数;若确定第一损失函数小于第一预设阈值,则判定语音识别模型符合第一训练标准。其中,第一预设阈值可根据实际训练精度要求进行设定,应为大于0且小于1的数值,设定的数值越大,表示语音识别模型的训练精度越高。Among them, in this embodiment, the speech recognition model obtained by pre-training with sample speech data can be applied to basic speech recognition tasks, but when performing targeted speech recognition (such as speech recognition with accents, speech recognition of infants and young children) Recognition, etc.), the recognition effect is often not accurate enough, so it is necessary to perform the model parameter correction process in the subsequent steps, so that the speech recognition model can achieve targeted recognition. For the speech recognition model of this embodiment, a deep neural network model based on Connectionist Temporal Classification (CTC) can be used, and the CTC-based model is trained by predicting the output of each frame in the speech sequence and comparing the real sample labels to calculate Model training error. In this application, the speech recognition model uses the first speech feature and the first text feature corresponding to the sample speech data for training, and the network structure can generally adopt the structure of LSTM/CNN/GRU. Fewer network structure layers. Correspondingly, in order to train a speech recognition model that meets the first training standard according to the first speech feature and the first text feature, step 202 in the embodiment may specifically include: inputting the first speech feature into the speech recognition model to obtain a text output result; A first loss function is calculated from the text output result and the first text feature; if it is determined that the first loss function is smaller than the first preset threshold, it is determined that the speech recognition model meets the first training standard. The first preset threshold may be set according to actual training accuracy requirements, and should be a value greater than 0 and less than 1. The larger the set value, the higher the training accuracy of the speech recognition model.
203、利用样本语音数据以及语音识别模型,训练符合第二训练标准的元学习适应模型。203. Use the sample speech data and the speech recognition model to train a meta-learning adaptation model that meets the second training standard.
在本实施例中,元学习适应模型是基于元学习技术进行训练的,其中,元学习是使用学习任务相关的模型代替手工设计调整准则的算法。在本方案的设置下,元学习适应模型的任务是在少量适应语音数据的帮助下,调整原模型的参数,使得其在语音识别上的效果更好。在利用样本语音数据训练元学习适应模型时,因为训练过程中不可以使用测试集数据,故需要按照元学习的方式对样本语音数据进行数据块的预处理,具体可采用从样本语音数据中采样或划分转换为多个数据块的方式,从而在训练过程中,可以使用当前数据块进行适应训练,在当前数据块对应的下一数据块上取得损失降低。相应的,在利用样本语音数据以及语音识别模型进行元学习适应模型的训练时,实施例步骤203具体可以包括:将样本语音数据划分为预设数量个数据块,并提取各个数据块的第二语音特征和第二文本特征;依据第二语音特征、第二文本特征以及语音识别模型,训练符合第二训练标准的元学习适应模型。需要说明是,本实施例步骤中的第二语音特征、第二文本特征与实施例步骤202中的第一语音特征、第一文本特征不同,第二语音特征、第二文本特征均对应样本语音数据划分的各个数据块的特征序列,而第一语音特征、第一文本特征对应整个样本语音数据的特征序列。In this embodiment, the meta-learning adaptation model is trained based on a meta-learning technique, wherein the meta-learning is an algorithm that uses a model related to a learning task instead of manually designing an adjustment criterion. Under the setting of this scheme, the task of the meta-learning adaptation model is to adjust the parameters of the original model with the help of a small amount of adaptive speech data, so that its effect on speech recognition is better. When using the sample speech data to train the meta-learning adaptation model, because the test set data cannot be used in the training process, it is necessary to preprocess the data blocks of the sample speech data in the way of meta-learning. Specifically, sampling from the sample speech data can be used. Or the method of dividing and converting into multiple data blocks, so that in the training process, the current data block can be used for adaptive training, and the loss reduction can be obtained on the next data block corresponding to the current data block. Correspondingly, when using the sample speech data and the speech recognition model to perform the training of the meta-learning adaptation model,
在依据第二语音特征、第二文本特征以及语音识别模型,对元学习适应模型进行训练的过程中,可参见图3所示的元学习适应模型训练过程的流程示意图,若当前数据块为划分的第一个数据块时,可利用预训练的语音识别模型(预训练模型1)的初始模型参数以及数据块1的第二语音特征和第二文本特征,计算元学习适应模型在数据块1中的损失值、损失梯度以及新模型参数,将新模型参数替换至预训练模型1中的初始模型参数,得到语音识别模型对应的模型2。进一步依据模型2对应的模型参数以及数据块2的第二语音特 征和第二文本特征,计算元学习适应模型在数据块2中的损失值、损失梯度以及新模型参数,将此时得到的新模型参数替换至模型2中,得到语音识别模型对应的模型3。进一步依据模型3对应的模型参数以及数据块3的第二语音特征和第二文本特征,计算元学习适应模型在数据块3中的损失值、损失梯度以及新模型参数,将此时得到的新模型参数替换至模型3中,得到语音识别模型对应的模型4……,直至完成所有数据块对元学习适应模型的训练,最后可通过所有块数据的训练,计算得到元学习适应模型的第二损失函数,依据损失函数判定元学习适应模型是否通过训练,当通过训练时,可将最后一个数据块确定出的模型参数确定为语音识别模型在测试阶段的新模型参数。In the process of training the meta-learning adaptation model according to the second speech feature, the second text feature and the speech recognition model, please refer to the schematic flowchart of the training process of the meta-learning adaptation model shown in FIG. 3 , if the current data block is divided into In the first data block, the initial model parameters of the pre-trained speech recognition model (pre-training model 1) and the second speech feature and the second text feature of the data block 1 can be used to calculate the meta-learning adaptation model in the data block 1. The loss value, loss gradient and new model parameters in , replace the new model parameters with the initial model parameters in the pre-training model 1, and obtain the model 2 corresponding to the speech recognition model. Further according to the model parameters corresponding to the model 2 and the second speech feature and the second text feature of the data block 2, the loss value, the loss gradient and the new model parameters of the meta-learning adaptation model in the data block 2 are calculated, and the new model parameters obtained at this time are calculated. The model parameters are replaced into model 2 to obtain model 3 corresponding to the speech recognition model. Further according to the model parameters corresponding to the model 3 and the second speech feature and the second text feature of the data block 3, the loss value, the loss gradient and the new model parameters of the meta-learning adaptation model in the data block 3 are calculated, and the new model parameters obtained at this time are calculated. The model parameters are replaced into model 3, and the model 4 corresponding to the speech recognition model is obtained, until the training of all data blocks for the meta-learning adaptation model is completed. The loss function determines whether the meta-learning adaptive model has passed the training according to the loss function. When the training is passed, the model parameters determined by the last data block can be determined as the new model parameters of the speech recognition model in the testing phase.
相应的,依据第二语音特征、第二文本特征以及语音识别模型,训练元学习适应模型的具体实现过程可为:提取语音识别模型的初始模型参数;若判定当前数据块为划分的第一个数据块,则依据初始模型参数以及当前数据块的第二语音特征和第二文本特征,计算元学习适应模型在第一个数据块中的损失值、损失梯度以及新模型参数;若判定当前数据块非第一个数据块,则依据前一数据块的新模型参数和当前数据块的第二语音特征和第二文本特征,计算元学习适应模型在当前数据块中的损失值、损失梯度以及新模型参数;若判定所有数据块均完成训练,则利用各个数据块计算得到的损失值、损失梯度以及新模型参数确定元学习适应模型的第二损失函数;若确定第二损失函数小于第二预设阈值,则判定元学习适应模型符合第二训练标准。其中,第二预设阈值可根据实际训练精度要求进行设定,应为大于0且小于1的数值,设定的数值越大,表示语音识别模型的训练精度越高。Correspondingly, according to the second speech feature, the second text feature and the speech recognition model, the specific implementation process of the training element learning adaptation model may be: extracting the initial model parameters of the speech recognition model; if it is determined that the current data block is the first divided data block, calculate the loss value, loss gradient and new model parameters of the meta-learning adaptive model in the first data block according to the initial model parameters and the second speech feature and the second text feature of the current data block; if the current data block is determined If the block is not the first data block, the loss value, loss gradient and New model parameters; if it is determined that all data blocks have been trained, the loss value, loss gradient and new model parameters calculated by each data block are used to determine the second loss function of the meta-learning adaptation model; if it is determined that the second loss function is smaller than the second loss function If the preset threshold is set, it is determined that the meta-learning adaptation model complies with the second training standard. The second preset threshold may be set according to the actual training accuracy requirement, and should be a value greater than 0 and less than 1. The larger the set value, the higher the training accuracy of the speech recognition model.
其中,第二损失函数计算公式的特征描述为: 其中,J为第二损失函数,y c+1为c+1数据块的第二文本特征,x c+1为c+1数据块的第二语音特征,θ′为元学习适应模型在c数据块计算出的新模型参数,L(y c+1,f(x c+1;θ′))为元学习适应模型在c+1数据块计算出的损失值。 Among them, the feature description of the second loss function calculation formula is: Among them, J is the second loss function, y c+1 is the second text feature of the c+1 data block, x c+1 is the second speech feature of the c+1 data block, θ′ is the meta-learning adaptation model in c The new model parameters calculated by the data block, L(y c+1 , f(x c+1 ; θ′)) is the loss value calculated by the meta-learning adaptation model in the c+1 data block.
在本申请中,元学习适应模型网络结构可采用两层的长短期记忆网络(Long Short-Term Memory,LSTM)。其中第一层的LSTM的t时刻的输入包括原模型参数θ t,在数据块上的交叉熵损失L t和其对应的梯度 经过第一层LSTM,可以得到输入的隐层表示h t。将h t输入第二层LSTM后,可以得到其中的遗忘门参数f t和输入门参数i t,结合原参数θ t和对应的梯度 可以得到新模型的参数: In this application, the network structure of the meta-learning adaptation model may adopt a two-layer Long Short-Term Memory (LSTM). The input of the first layer of LSTM at time t includes the original model parameter θ t , the cross-entropy loss L t on the data block and its corresponding gradient After the first layer of LSTM, the input hidden layer representation h t can be obtained. After inputting h t into the second layer of LSTM, the forgetting gate parameter f t and the input gate parameter i t can be obtained, combined with the original parameter θ t and the corresponding gradient The parameters of the new model can be obtained:
204、基于元学习适应模型,将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数。204. Based on the meta-learning adaptation model, adjust the initial model parameters of the speech recognition model to target model parameters matching the target speech type.
对于本实施例,在训练得到符合预设训练标准的元学习适应模型后,即可根据实际应用场景利用元学习适应模型对语音识别模型的模型参数进行适应性调整。具体的,为了得到与目标语音类型匹配的目标模型参数,需要提取少量与目标语音类型匹配的适应语音数据,在将语音识别模型的初始模型参数和适应语音数据上的损失及损失梯度输入元学习适应模型中,即可得到语音识别模型中与目标语音类型匹配的目标模型参数。其中适应数据需要与待识别的目标语音同属于一个语音类型,例如,若本申请的语音识别应用于婴幼儿教育时,因为待识别的目标语音为婴幼儿,故在利用元学习适应模型对语音识别模型的模型参数进行调整时,可利用少量婴幼儿语音数据作为适应语音数据,进一步确定得到与婴幼儿这一语音类型匹配的目标模型参数,从而能够得到在婴幼儿语音识别上精度较高的语音识别模型。再例如,若本申请的语音识别应用于面向大众化的智能语音服务时,因为待识别的目标语音为各个地域的人群,为了避免地域口音对语音识别效果的影响,故可根据 待识别区域来选取对应的适应语音数据,如待识别区域为东北区域,则可选取少量东北语音数据作为适应语音数据,从而训练得到能够排除口音干扰的语音识别模型。For this embodiment, after training a meta-learning adaptation model that meets the preset training standard, the meta-learning adaptation model can be used to adaptively adjust the model parameters of the speech recognition model according to the actual application scenario. Specifically, in order to obtain the target model parameters that match the target speech type, it is necessary to extract a small amount of adaptive speech data that matches the target speech type, and input the initial model parameters of the speech recognition model and the loss and loss gradient on the adaptive speech data into the meta-learning In the adaptation model, the target model parameters in the speech recognition model that match the target speech type can be obtained. The adaptation data needs to belong to the same speech type as the target speech to be recognized. For example, if the speech recognition of the present application is applied to infant education, because the target speech to be recognized is an infant, the meta-learning adaptation model is used to analyze the speech When the model parameters of the recognition model are adjusted, a small amount of infant voice data can be used as adaptive voice data to further determine the target model parameters that match the voice type of infants and young children, so as to obtain high accuracy in infant voice recognition. speech recognition model. For another example, if the speech recognition of the present application is applied to a popular-oriented intelligent speech service, because the target speech to be recognized is a group of people in various regions, in order to avoid the influence of regional accents on the speech recognition effect, it can be selected according to the region to be recognized. For the corresponding adaptive voice data, if the area to be recognized is the Northeast region, a small amount of Northeast voice data can be selected as the adaptive voice data, so as to train a voice recognition model that can eliminate accent interference.
205、利用配置有目标模型参数的语音识别模型识别目标语音类型下的目标语音。205. Use the speech recognition model configured with the target model parameter to recognize the target speech under the target speech type.
其中,在确定得到与适应语音数据匹配的目标模型参数后,可将目标模型参数更新到语音识别模型中,进而利用更新后的语音识别模型实现对目标语音类型下的目标语音的精准识别,从而能够得到比预训练的语音识别模型更好的语音识别效果。Wherein, after it is determined that the target model parameters matching the adaptive speech data are obtained, the target model parameters can be updated to the speech recognition model, and then the updated speech recognition model can be used to achieve accurate recognition of the target speech under the target speech type, thereby It can get better speech recognition effect than pre-trained speech recognition model.
借由上述基于元学习的自适应语音识别方法,本申请可首先通过样本语音数据对语音识别模型进行预训练,并且进一步将说话人适应任务当作元学习中的任务,设计用于根据不同语音类型,调整语音识别模型中模型参数的元学习适应模型。进而可在确定目标语音类型时,基于元学习适应模型将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数,进而利用配置目标模型参数的语音识别模型实现对目标语音类型下目标语音的针对性精准识别。在本申请中,采用元学习适应模型实现对语音识别模型中模型参数的自适应调整,不仅减少了人工设计的不稳定性,还使得模型参数更新可以针对不同的应用场景,进而保证语音识别的精准性。并且在利用元学习适应模型确定语音识别模型的模型参数时,仅需少量适应语音数据,较少的适应语音数据会使得更新的模型参数更容易在应用场景中过拟合,故还可减少参数更新的过拟合风险。With the above-mentioned meta-learning-based adaptive speech recognition method, the present application can firstly pre-train the speech recognition model through sample speech data, and further regard the speaker adaptation task as a task in the meta-learning, which is designed to be used according to different speech sounds. Type, a meta-learning adaptation model that adjusts model parameters in a speech recognition model. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech. In this application, the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, which not only reduces the instability of manual design, but also enables the model parameters to be updated for different application scenarios, thereby ensuring the accuracy of speech recognition. Accuracy. And when using the meta-learning adaptation model to determine the model parameters of the speech recognition model, only a small amount of adaptive speech data is needed, and less adaptive speech data will make the updated model parameters easier to overfit in the application scenario, so the parameters can also be reduced. The updated risk of overfitting.
对于本实施例,在具体的应用场景中,语音识别系统可参见图4所示的基于元学习的自适应语音识别系统的流程示意图,具体可在确定出样本语音数据并进行预处理后,进行特征提取,利用提取出的第一语音特征以及第一文本特征预训练语音适应模型,进而根据样本语音数据对应的第二语音特征、第二文本特征以及语音识别模型对应的原始模型参数训练元学习适应模型。之后利用与目标语音类型匹配的适应语音数据,在提取语音特征以及文本特征后,与语音识别模型的初始模型参数一同输入到训练完成的元学习适应模型中,即可获取得到与目标语音类型匹配的目标模型参数,将语音识别模型的初始模型参数更新为目标模型参数,进而可利用更新后的语音识别模型(说话人适应模型)识别目标语音类型下的目标语音。For this embodiment, in a specific application scenario, the speech recognition system may refer to the schematic flowchart of the meta-learning-based adaptive speech recognition system shown in FIG. 4 . Specifically, after the sample speech data is determined and preprocessed, the Feature extraction, using the extracted first voice feature and the first text feature to pre-train the voice adaptation model, and then according to the second voice feature corresponding to the sample voice data, the second text feature and the original model parameter training element learning corresponding to the voice recognition model Fit the model. Then, using the adaptive voice data matching the target voice type, after extracting the voice features and text features, input them together with the initial model parameters of the voice recognition model into the trained meta-learning adaptation model, and then obtain the matching target voice type. The initial model parameters of the speech recognition model are updated to the target model parameters, and then the updated speech recognition model (speaker adaptation model) can be used to recognize the target speech under the target speech type.
进一步的,作为图1和图2所示方法的具体实现,本申请实施例提供了一种基于元学习的自适应语音识别装置,如图5所示,该装置包括:训练模块31、调整模块32、识别模块33;Further, as a specific implementation of the methods shown in FIG. 1 and FIG. 2 , an embodiment of the present application provides an adaptive speech recognition device based on meta-learning. As shown in FIG. 5 , the device includes: a
训练模块31,可用于利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;The
调整模块32,可用于基于元学习适应模型,将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数;The
识别模块33,可用于利用配置有目标模型参数的语音识别模型识别目标语音类型下的目标语音。The
在具体的应用场景中,为了预训练得到语音识别模型,以及元学习适应模型,如图6所示,训练模块31,具体可包括:处理单元311、第一训练单元312、第二训练单元313;In a specific application scenario, in order to obtain a speech recognition model and a meta-learning adaptation model through pre-training, as shown in FIG. 6 , the
处理单元311,可用于对样本语音数据进行预处理,并标记样本语音数据对应的第一语音特征以及第一文本特征,预处理至少包括预加重处理、分帧处理、加窗处理;The
第一训练单元312,可用于基于第一语音特征和第一文本特征训练符合第一训练标准的语音识别模型;a
第二训练单元313,可用于利用样本语音数据以及语音识别模型,训练符合第二训练标准的元学习适应模型。The
相应的,第一训练单元312,具体可用于将第一语音特征输入语音识别模型,获取文本输出结果;依据文本输出结果与第一文本特征计算第一损失函数;若确定第一损失函数 小于第一预设阈值,则判定语音识别模型符合第一训练标准。Correspondingly, the
在具体的应用场景中,第二训练单元313,具体可用于将样本语音数据划分为预设数量个数据块,并提取各个数据块的第二语音特征和第二文本特征;依据第二语音特征、第二文本特征以及语音识别模型,训练符合第二训练标准的元学习适应模型。In a specific application scenario, the
相应的,在依据第二语音特征、第二文本特征以及语音识别模型,训练符合第二训练标准的元学习适应模型时,第二训练单元313,具体可用于提取语音识别模型的初始模型参数;若判定当前数据块为划分的第一个数据块,则依据初始模型参数以及当前数据块的第二语音特征和第二文本特征,计算元学习适应模型在第一个数据块中的损失值、损失梯度以及新模型参数;若判定当前数据块非第一个数据块,则依据前一数据块的新模型参数和当前数据块的第二语音特征和第二文本特征,计算元学习适应模型在当前数据块中的损失值、损失梯度以及新模型参数;若判定所有数据块均完成训练,则利用各个数据块计算得到的损失值、损失梯度以及新模型参数确定元学习适应模型的第二损失函数;若确定第二损失函数小于第二预设阈值,则判定元学习适应模型符合第二训练标准。Correspondingly, when training a meta-learning adaptive model that meets the second training standard according to the second voice feature, the second text feature, and the voice recognition model, the
其中,第二损失函数计算公式的特征描述为:Among them, the feature description of the second loss function calculation formula is:
其中,J为第二损失函数,y c+1为c+1数据块的第二文本特征,x c+1为c+1数据块的第二语音特征,θ′为元学习适应模型在c数据块计算出的新模型参数,L(y c+1,f(x c+1;θ′))为元学习适应模型在c+1数据块计算出的损失值。 Among them, J is the second loss function, y c+1 is the second text feature of the c+1 data block, x c+1 is the second speech feature of the c+1 data block, θ′ is the meta-learning adaptation model in c The new model parameters calculated by the data block, L(y c+1 , f(x c+1 ; θ′)) is the loss value calculated by the meta-learning adaptation model in the c+1 data block.
在具体的应用场景中,为了基于元学习适应模型确定得到与目标语音类型匹配的目标模型参数,如图6所示,调整模块32,具体可包括:提取单元321、获取单元322;In a specific application scenario, in order to determine the target model parameters matching the target speech type based on the meta-learning adaptation model, as shown in FIG. 6 , the
提取单元321,用于提取与目标语音类型匹配的适应语音数据;
获取单元322,用于将语音识别模型的初始模型参数和适应语音数据输入元学习适应模型中,获取与目标语音类型匹配的目标模型参数。The obtaining
相应的,识别模块33,具体可用于将语音识别模型的初始模型参数更新为目标模型参数,以便利用更新后的语音识别模型识别目标语音类型下的目标语音。Correspondingly, the
需要说明的是,本实施例提供的一种基于元学习的自适应语音识别装置所涉及各功能单元的其他相应描述,可以参考图1至图2的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional units involved in the meta-learning-based adaptive speech recognition apparatus provided in this embodiment, reference may be made to the corresponding descriptions in FIG. 1 to FIG. 2 , which will not be repeated here.
基于上述如图1至图2所示方法,相应的,本实施例还提供了一种存储介质,其上存储有计算机可读指令(或计算机程序),该可读指令(或计算机程序)被处理器执行时实现上述如图1至图2所示的基于元学习的自适应语音识别方法。Based on the above methods as shown in FIGS. 1 to 2 , correspondingly, the present embodiment further provides a storage medium on which computer-readable instructions (or computer programs) are stored, and the readable instructions (or computer programs) are When executed by the processor, the above-mentioned meta-learning-based adaptive speech recognition method shown in FIG. 1 to FIG. 2 is implemented.
可选的,本申请涉及的存储介质可以是可读存储介质,或者可以称为计算机可读存储介质。该存储介质如可读存储介质可以是非易失性的,如非易失性可读存储介质;或者,也可以是易失性的,如易失性可读存储介质。Optionally, the storage medium involved in this application may be a readable storage medium, or may be referred to as a computer-readable storage medium. The storage medium, such as a readable storage medium, may be non-volatile, such as a non-volatile readable storage medium; or, may also be volatile, such as a volatile readable storage medium.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景的方法。Based on this understanding, the technical solution of the present application can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of various implementation scenarios of the present application.
基于上述如图1至图2所示的方法和图5、图6所示的虚拟装置实施例,为了实现上述目的,本实施例还提供了一种计算机设备,该计算机设备包括存储介质和处理器;该存储介质可以是非易失性存储介质,用于存储计算机程序;处理器,用于执行计算机程序以实现上述如图1至图2所示的基于元学习的自适应语音识别方法。例如,处理器可用于执行:利用预处理后的样本语音数据训练语音识别模型以及元学习适应模型;基于所述元学习适应模型,将所述语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型 参数;利用配置有所述目标模型参数的语音识别模型识别所述目标语音类型下的目标语音。可选的,该处理器执行时还可实现上述实施例中方法的其他步骤,这里不再赘述。Based on the methods shown in FIG. 1 to FIG. 2 and the virtual device embodiments shown in FIG. 5 and FIG. 6 , in order to achieve the above purpose, this embodiment also provides a computer device, the computer device includes a storage medium and a processing The storage medium may be a non-volatile storage medium for storing a computer program; a processor for executing the computer program to implement the above-mentioned meta-learning-based adaptive speech recognition method shown in FIG. 1 to FIG. 2 . For example, the processor may be configured to perform: using the preprocessed sample speech data to train a speech recognition model and a meta-learning adaptation model; based on the meta-learning adaptation model, adjusting the initial model parameters of the speech recognition model to match the target speech type matching target model parameters; using the speech recognition model configured with the target model parameters to recognize the target speech under the target speech type. Optionally, when the processor executes, other steps of the method in the foregoing embodiment may also be implemented, which will not be repeated here.
可选的,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如WI-FI接口)等。Optionally, the computer device may further include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. Optional network interfaces may include standard wired interfaces, wireless interfaces (such as WI-FI interfaces), and the like.
本领域技术人员可以理解,本实施例提供的一种计算机设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the physical device, and may include more or less components, or combine some components, or arrange different components.
非易失性存储介质中还可以包括操作系统、网络通信模块。操作系统是管理上述计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现非易失性存储介质内部各组件之间的通信,以及与信息处理实体设备中其它硬件和软件之间通信。The non-volatile storage medium may further include an operating system and a network communication module. An operating system is a program that manages the hardware and software resources of the above-mentioned computer equipment, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between various components inside the non-volatile storage medium, and communicate with other hardware and software in the information processing entity device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware.
通过应用本申请的技术方案,与目前现有技术相比,本申请可首先通过样本语音数据对语音识别模型进行预训练,并且进一步将说话人适应任务当作元学习中的任务,设计用于根据不同语音类型,调整语音识别模型中模型参数的元学习适应模型。进而可在确定目标语音类型时,基于元学习适应模型将语音识别模型的初始模型参数调整为与目标语音类型匹配的目标模型参数,进而利用配置目标模型参数的语音识别模型实现对目标语音类型下目标语音的针对性精准识别。在本申请中,采用元学习适应模型实现对语音识别模型中模型参数的自适应调整,不仅减少了人工设计的不稳定性,还使得模型参数更新可以针对不同的应用场景,进而保证语音识别的精准性。并且在利用元学习适应模型确定语音识别模型的模型参数时,仅需少量适应语音数据,较少的适应语音数据会使得更新的模型参数更容易在应用场景中过拟合,故还可减少参数更新的过拟合风险。By applying the technical solution of the present application, compared with the current prior art, the present application can firstly pre-train the speech recognition model through sample speech data, and further regard the speaker adaptation task as a task in meta-learning, which is designed for According to different speech types, the meta-learning adaptation model of the model parameters in the speech recognition model is adjusted. Then, when the target speech type is determined, the initial model parameters of the speech recognition model can be adjusted to the target model parameters matching the target speech type based on the meta-learning adaptation model, and then the speech recognition model configured with the target model parameters can be used to realize the target speech type. Targeted and precise recognition of target speech. In this application, the meta-learning adaptation model is used to realize the self-adaptive adjustment of the model parameters in the speech recognition model, which not only reduces the instability of manual design, but also enables the model parameters to be updated for different application scenarios, thereby ensuring the accuracy of speech recognition. Accuracy. And when using the meta-learning adaptation model to determine the model parameters of the speech recognition model, only a small amount of adaptive speech data is needed, and less adaptive speech data will make the updated model parameters easier to overfit in the application scenario, so the parameters can also be reduced. The updated risk of overfitting.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present application. Those skilled in the art can understand that the modules in the device in the implementation scenario may be distributed in the device in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the implementation scenario with corresponding changes. The modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial numbers in the present application are only for description, and do not represent the pros and cons of the implementation scenarios. The above disclosures are only a few specific implementation scenarios of the present application, however, the present application is not limited thereto, and any changes that can be conceived by those skilled in the art should fall within the protection scope of the present application.
Claims (20)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011434900.9 | 2020-12-10 | ||
| CN202011434900.9A CN112562648A (en) | 2020-12-10 | 2020-12-10 | Adaptive speech recognition method, apparatus, device and medium based on meta learning |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022121155A1 true WO2022121155A1 (en) | 2022-06-16 |
Family
ID=75060346
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/083002 Ceased WO2022121155A1 (en) | 2020-12-10 | 2021-03-25 | Meta learning-based adaptive speech recognition method and apparatus, device and medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN112562648A (en) |
| WO (1) | WO2022121155A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116052649A (en) * | 2023-01-10 | 2023-05-02 | 厦门大学 | Loss weight self-adaptive element learning method in low-resource speech recognition |
| CN116090553A (en) * | 2023-04-10 | 2023-05-09 | 环球数科集团有限公司 | Artificial intelligence automatic processing system based on meta learning |
| CN116631436A (en) * | 2023-04-06 | 2023-08-22 | 平安健康保险股份有限公司 | Gender recognition model processing method, device, computer equipment and storage medium |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113052324B (en) * | 2021-03-24 | 2022-08-02 | 支付宝(杭州)信息技术有限公司 | User abnormal pattern recognition method, device and equipment |
| CN114038465B (en) * | 2021-04-28 | 2022-08-23 | 北京有竹居网络技术有限公司 | Voice processing method and device and electronic equipment |
| CN113838466B (en) * | 2021-06-16 | 2024-02-06 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, equipment and storage medium |
| CN115691464A (en) * | 2021-07-30 | 2023-02-03 | 佛山市顺德区美的电子科技有限公司 | Model training method, speech synthesis method, device, terminal and storage medium |
| CN113539246B (en) * | 2021-08-20 | 2022-10-18 | 贝壳找房(北京)科技有限公司 | Voice recognition method and device |
| CN114453852A (en) * | 2022-02-16 | 2022-05-10 | 上海海事大学 | Method and system for controlling mechanical arm to assemble blade based on voice recognition |
| CN115132171B (en) * | 2022-06-28 | 2024-10-29 | 中国人民解放军战略支援部队信息工程大学 | Task-based focus loss-improving multilingual element learning voice recognition method |
| CN119741917A (en) * | 2023-09-25 | 2025-04-01 | 荣耀终端股份有限公司 | Speech recognition method and electronic equipment |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7885812B2 (en) * | 2006-11-15 | 2011-02-08 | Microsoft Corporation | Joint training of feature extraction and acoustic model parameters for speech recognition |
| CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
| CN111243576A (en) * | 2020-01-16 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device, equipment and storage medium |
| CN111312256A (en) * | 2019-10-31 | 2020-06-19 | 平安科技(深圳)有限公司 | Voice identity recognition method and device and computer equipment |
| CN111613212A (en) * | 2020-05-13 | 2020-09-01 | 携程旅游信息技术(上海)有限公司 | Speech recognition method, system, electronic device and storage medium |
| CN111727473A (en) * | 2018-02-22 | 2020-09-29 | 索尼公司 | Information processing apparatus, information processing method, and program |
| CN111916067A (en) * | 2020-07-27 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Training method and device of voice recognition model, electronic equipment and storage medium |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111062495B (en) * | 2019-11-28 | 2024-03-19 | 深圳市华尊科技股份有限公司 | Machine learning method and related device |
| CN111724083B (en) * | 2020-07-21 | 2023-10-13 | 腾讯科技(深圳)有限公司 | Training method and device for financial risk identification model, computer equipment and medium |
-
2020
- 2020-12-10 CN CN202011434900.9A patent/CN112562648A/en active Pending
-
2021
- 2021-03-25 WO PCT/CN2021/083002 patent/WO2022121155A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7885812B2 (en) * | 2006-11-15 | 2011-02-08 | Microsoft Corporation | Joint training of feature extraction and acoustic model parameters for speech recognition |
| CN111727473A (en) * | 2018-02-22 | 2020-09-29 | 索尼公司 | Information processing apparatus, information processing method, and program |
| CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
| CN111312256A (en) * | 2019-10-31 | 2020-06-19 | 平安科技(深圳)有限公司 | Voice identity recognition method and device and computer equipment |
| CN111243576A (en) * | 2020-01-16 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device, equipment and storage medium |
| CN111613212A (en) * | 2020-05-13 | 2020-09-01 | 携程旅游信息技术(上海)有限公司 | Speech recognition method, system, electronic device and storage medium |
| CN111916067A (en) * | 2020-07-27 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Training method and device of voice recognition model, electronic equipment and storage medium |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116052649A (en) * | 2023-01-10 | 2023-05-02 | 厦门大学 | Loss weight self-adaptive element learning method in low-resource speech recognition |
| CN116631436A (en) * | 2023-04-06 | 2023-08-22 | 平安健康保险股份有限公司 | Gender recognition model processing method, device, computer equipment and storage medium |
| CN116090553A (en) * | 2023-04-10 | 2023-05-09 | 环球数科集团有限公司 | Artificial intelligence automatic processing system based on meta learning |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112562648A (en) | 2021-03-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022121155A1 (en) | Meta learning-based adaptive speech recognition method and apparatus, device and medium | |
| CN113707125B (en) | Training method and device for multi-language speech synthesis model | |
| CN107492382B (en) | Voiceprint information extraction method and device based on neural network | |
| CN110136698B (en) | Method, apparatus, apparatus and storage medium for determining mouth shape | |
| TWI681383B (en) | Method, system, and non-transitory computer-readable medium for determining a language identity corresponding to a speech signal | |
| CN110310623B (en) | Sample generation method, model training method, device, medium, and electronic apparatus | |
| US10748544B2 (en) | Voice processing device, voice processing method, and program | |
| CN112435684A (en) | Voice separation method and device, computer equipment and storage medium | |
| CN111862934B (en) | Improved method for speech synthesis model and speech synthesis method and device | |
| CN107767881B (en) | Method and device for acquiring satisfaction degree of voice information | |
| WO2019237517A1 (en) | Speaker clustering method and apparatus, and computer device and storage medium | |
| CN116665669A (en) | A voice interaction method and system based on artificial intelligence | |
| JP2017058674A (en) | Apparatus and method for speech recognition, apparatus and method for training transformation parameter, computer program and electronic apparatus | |
| WO2019237519A1 (en) | General vector training method, voice clustering method, apparatus, device and medium | |
| WO2019237518A1 (en) | Model library establishment method, voice recognition method and apparatus, and device and medium | |
| CN114783424B (en) | Text corpus screening methods, devices, equipment and storage media | |
| CN112614510B (en) | Audio quality assessment method and device | |
| Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
| WO2022236803A1 (en) | Systems and methods for audio signal generation | |
| CN110765868A (en) | Lip reading model generation method, device, equipment and storage medium | |
| CN116884430A (en) | Virtual tone conversion method, device, system and storage medium | |
| CN113516987B (en) | Speaker recognition method, speaker recognition device, storage medium and equipment | |
| CN113674745A (en) | Voice recognition method and device | |
| CN116206592A (en) | A voice cloning method, device, equipment and storage medium | |
| CN111027675A (en) | Automatic adjusting method and system for multimedia playing setting |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901872 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21901872 Country of ref document: EP Kind code of ref document: A1 |