CN118136001A - Speech recognition model training method, device, equipment and storage medium - Google Patents
Speech recognition model training method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN118136001A CN118136001A CN202410184786.0A CN202410184786A CN118136001A CN 118136001 A CN118136001 A CN 118136001A CN 202410184786 A CN202410184786 A CN 202410184786A CN 118136001 A CN118136001 A CN 118136001A
- Authority
- CN
- China
- Prior art keywords
- training
- acoustic
- original
- signal data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 245
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000003860 storage Methods 0.000 title claims abstract description 20
- 238000013507 mapping Methods 0.000 claims abstract description 39
- 238000000605 extraction Methods 0.000 claims abstract description 27
- 238000012795 verification Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000005096 rolling process Methods 0.000 claims abstract description 17
- 238000001228 spectrum Methods 0.000 claims description 34
- 238000009432 framing Methods 0.000 claims description 15
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 238000012098 association analyses Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 29
- 230000000694 effects Effects 0.000 description 7
- 230000007704 transition Effects 0.000 description 7
- 230000010365 information processing Effects 0.000 description 6
- 238000013519 translation Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000005336 cracking Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to the technical field of voice, in particular to a voice recognition model training method, a device, equipment and a storage medium, which comprise the steps of obtaining original voice signal data of different users, and carrying out encryption processing on the original voice signal data to form an original encryption training library; extracting features of original voice signal data in an original encryption training library through an acoustic feature extraction algorithm to obtain a Mel frequency cepstrum coefficient and a filter bank feature; pre-constructing an acoustic training model, and performing rolling training on the acoustic training model through a Mel frequency cepstrum coefficient and a filter bank characteristic to obtain a plurality of acoustic feature mapping results; acquiring corresponding text data according to each acoustic feature mapping result; performing result verification on each text data to obtain an optimal acoustic training model as a voice recognition model; the method pre-processes the voice signals in advance, and reduces the training difficulty of the whole model.
Description
Technical Field
The present invention relates to the field of speech technology, and in particular, to a method, apparatus, device, and storage medium for training a speech recognition model.
Background
When complex speech signals are recognized, the main difficulties are often noise, echo and environmental interference, and the speech recognition model training model is also required to adapt to different speech characteristics, including different accents, speech speeds, pronunciation and speaking modes, the speech of different users needs to express the same meaning for different words or phrases, however, these factors all need to train the speech recognition model in a large scale, and a large amount of computing resources and data need to be adopted during training, but the large-scale training mode is difficult to realize in the environment with limited resources, and the difficulty of overall training is large.
It can be seen that there is a need for improvements and improvements in the art.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a voice recognition model training method, a device, equipment and a storage medium, which are used for carrying out information processing on voice signals in advance, so that the operation recognition performance of a training model is improved, and the overall training difficulty of the model is effectively reduced.
The first aspect of the present invention provides a method for training a speech recognition model, comprising: acquiring original voice signal data of different users, and carrying out encryption processing on the original voice signal data to form an original encryption training library; extracting features of original voice signal data in an original encryption training library through an acoustic feature extraction algorithm to obtain a Mel frequency cepstrum coefficient and a filter bank feature; pre-constructing an acoustic training model, and performing rolling training on the acoustic training model through a Mel frequency cepstrum coefficient and a filter bank characteristic to obtain a plurality of acoustic feature mapping results; acquiring corresponding text data according to each acoustic feature mapping result; and performing result verification on each text data to obtain the optimal text data and an acoustic training model corresponding to the optimal text data, and taking the acoustic training model as a voice recognition model.
Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining original speech signal data of different users and performing encryption processing on the original speech signal data to form an original encrypted training library includes: encrypting the original voice signal data of different users through an AES encryption algorithm to form encrypted voice data; constructing an encryption database based on the DBMS database, and configuring access rights to the encryption database; and synchronizing the encrypted voice data to the encrypted database after the access authority configuration to form an original encrypted training library.
Optionally, in a second implementation manner of the first aspect of the present invention, the feature extracting, by using an acoustic feature extracting algorithm, the original speech signal data in the original encrypted training library to obtain mel-frequency cepstrum coefficients and filter bank features includes: performing Fourier transform on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm to obtain a frequency spectrum signal; performing modular square conversion on the spectrum signal to obtain a power spectrum signal; and carrying out Mel band conversion on the power spectrum signal by using a Mel filter group so as to obtain Mel frequency cepstrum coefficient and filter group characteristics.
Optionally, in a third implementation manner of the first aspect of the present invention, before performing fourier transform on the original speech signal data in the original encrypted training library by using an acoustic feature extraction algorithm to obtain a spectrum signal, the method further includes: decrypting and extracting original voice signal data from an original encryption training library; pre-emphasis processing is carried out on the original voice signal data through a high-pass filter so as to obtain the original voice signal data after the same frequency; carrying out framing treatment on the original voice signal data after the same frequency so as to obtain the original voice signal data after framing; and windowing the framed original voice signal data to obtain windowed original voice signal data.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the pre-constructing an acoustic training model, and performing rolling training on the acoustic training model through mel frequency cepstrum coefficients and filter bank features to obtain a plurality of acoustic feature mapping results, includes: constructing an acoustic training model based on the hidden Markov model; generating an acoustic training set according to a preset weight proportion, a Mel frequency cepstrum coefficient and a filter bank characteristic; and performing rolling training on the acoustic training model according to the acoustic training set to obtain a plurality of different acoustic feature mapping results.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the obtaining corresponding text data according to each acoustic feature mapping result includes: performing association analysis on each acoustic feature mapping result to obtain word elements and phoneme elements; pre-constructing a vocabulary table and a pronunciation dictionary library; text data is retrieved from the vocabulary and pronunciation dictionary based on the word elements and the phoneme elements.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing result verification on each text data to obtain optimal text data and an acoustic training model corresponding to the text data, and taking the acoustic training model as a speech recognition model includes: obtaining comparison text information according to the original voice signal data; performing character comparison on each text data according to the comparison text information to obtain a plurality of similarity results; and according to the preset similarity threshold value, each similarity result is checked to obtain an optimal similarity result, and according to the optimal similarity result, a corresponding acoustic training model is obtained and used as a voice recognition model.
The second aspect of the present invention provides a speech recognition model training apparatus, comprising: the encryption module is used for acquiring the original voice signal data of different users and carrying out encryption processing on the original voice signal data to form an original encryption training library; the feature module is used for carrying out feature extraction on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm so as to obtain a Mel frequency cepstrum coefficient and a filter bank feature; the training module is used for pre-constructing an acoustic training model, and carrying out rolling training on the acoustic training model through the Mel frequency cepstrum coefficient and the filter bank characteristics so as to obtain a plurality of acoustic feature mapping results; the acquisition module is used for acquiring corresponding text data according to the mapping result of each acoustic feature; and the verification module is used for verifying the results of the text data to obtain the optimal text data and the acoustic training model corresponding to the optimal text data, and taking the acoustic training model as a voice recognition model.
Optionally, in a first implementation manner of the second aspect of the present invention, the encryption module includes: an encryption unit for encrypting original voice signal data of different users by an AES encryption algorithm to form encrypted voice data; the authority unit is used for constructing an encrypted database based on the DBMS database and configuring access authority of the encrypted database; and the configuration unit is used for synchronizing the encrypted voice data to the encrypted database after the access authority configuration so as to form an original encrypted training library.
Optionally, in a second implementation manner of the second aspect of the present invention, the feature module includes: the transformation unit is used for carrying out Fourier transformation on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm so as to obtain a frequency spectrum signal; the conversion unit is used for carrying out modular square conversion on the frequency spectrum signal to obtain a power spectrum signal; and the conversion unit is used for carrying out melton band conversion on the power spectrum signal by utilizing the mel filter bank so as to obtain mel frequency cepstrum coefficients and filter bank characteristics.
Optionally, in a third implementation manner of the second aspect of the present invention, the feature module further includes: the decryption unit is used for extracting the original voice signal data from the original encryption training library; the pre-emphasis unit is used for carrying out pre-emphasis processing on the original voice signal data through the high-pass filter so as to obtain the original voice signal data after the same frequency; the framing unit is used for framing the original voice signal data after the same frequency so as to obtain the original voice signal data after framing; and the windowing unit is used for windowing the framed original voice signal data to obtain windowed original voice signal data.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the training module includes: the building unit is used for building an acoustic training model based on the hidden Markov model; the generation unit is used for generating an acoustic training set according to the preset weight proportion, the Mel frequency cepstrum coefficient and the characteristics of the filter bank; and the training unit is used for carrying out rolling training on the acoustic training model according to the acoustic training set so as to obtain a plurality of different acoustic feature mapping results.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the acquiring module includes: the association unit is used for carrying out association analysis on each acoustic feature mapping result so as to obtain word elements and phoneme elements; the pre-building unit is used for pre-building a vocabulary table and a pronunciation dictionary library; and the retrieval unit is used for retrieving the text data from the vocabulary and the pronunciation dictionary base according to the word elements and the phoneme elements.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the verification module includes: the acquisition unit is used for acquiring comparison text information according to the original voice signal data; the comparison unit is used for comparing the characters of each text data according to the comparison text information so as to obtain a plurality of similarity results; the correction unit is used for correcting each similarity result according to a preset similarity threshold value to obtain an optimal similarity result, and obtaining a corresponding acoustic training model according to the optimal similarity result to serve as a voice recognition model.
A third aspect of the present invention provides a speech recognition model training apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; at least one of the processors invokes the instructions in the memory to cause the speech recognition model training apparatus to perform the steps of the speech recognition model training method of any of the preceding claims.
A fourth aspect of the present invention provides a computer readable storage medium having instructions stored thereon which, when executed by a processor, implement the steps of the speech recognition model training method of any of the above.
According to the technical scheme, the diversity of training data sources is improved by acquiring the original voice signal data of different users, the problem that the model is fitted in the training process is avoided, encryption processing is needed after the original voice signal data are acquired, illegal users are prevented from stealing voice information of the users, and the safety of the original voice signal data in use is improved; converting the original voice signal data into the characteristics which can be processed by the system by utilizing an acoustic characteristic extraction algorithm, namely converting the original voice signal data into the Mel frequency cepstrum coefficient and the filter bank characteristics, and capturing the frequency spectrum information and the time domain information in the voice by utilizing the Mel frequency cepstrum coefficient and the filter bank characteristics; when the model is trained, through carrying out information processing on the voice signals in advance, the training difficulty of the acoustic training model is effectively reduced, the training effect of the acoustic training model is improved, the output results of different acoustic training models are obtained in the training process, corresponding words or phonemes are mapped according to the mapping results of the acoustic features, the words or phonemes form voice text data according to the words and phonemes, and result verification is carried out on the text data according to the original voice signal data to judge whether translation of the text data is accurate or not, so that the acoustic training model with the highest recognition accuracy is obtained as a voice recognition model.
Drawings
FIG. 1 is a first flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 2 is a second flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 3 is a third flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 4 is a fourth flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 5 is a fifth flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 6 is a sixth flowchart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of another structure of a speech recognition model training device according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present invention.
Detailed Description
The invention provides a method, a device, equipment and a storage medium for training a speech recognition model, which are used for improving the diversity of training data sources by acquiring original speech signal data of different users, avoiding the problem that the model is fitted in the training process, and avoiding illegal users from stealing the voice information of the users when the original speech signal data is acquired, and improving the safety of the original speech signal data in use; converting the original voice signal data into the characteristics which can be processed by the system by utilizing an acoustic characteristic extraction algorithm, namely converting the original voice signal data into the Mel frequency cepstrum coefficient and the filter bank characteristics, and capturing the frequency spectrum information and the time domain information in the voice by utilizing the Mel frequency cepstrum coefficient and the filter bank characteristics; when the model is trained, through carrying out information processing on the voice signals in advance, the training difficulty of the acoustic training model is effectively reduced, the training effect of the acoustic training model is improved, the output results of different acoustic training models are obtained in the training process, corresponding words or phonemes are mapped according to the mapping results of the acoustic features, the words or phonemes form voice text data according to the words and phonemes, and result verification is carried out on the text data according to the original voice signal data to judge whether translation of the text data is accurate or not, so that the acoustic training model with the highest recognition accuracy is obtained as a voice recognition model.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a method for training a speech recognition model according to the embodiment of the present invention includes:
101. Acquiring original voice signal data of different users, and carrying out encryption processing on the original voice signal data to form an original encryption training library;
In this embodiment, original voice signal data of different users need to be obtained in an online or offline manner, so as to improve diversity of the original voice signal data, avoid the problem that a model is fitted in the training process, and after the original voice signal data is obtained, encryption processing needs to be performed on the original voice signal data, so that illegal users are prevented from stealing voice information (including voice content, voiceprint information and the like) of the users, and safety of the original voice signal data in use is improved.
102. Extracting features of original voice signal data in an original encryption training library through an acoustic feature extraction algorithm to obtain a Mel frequency cepstrum coefficient and a filter bank feature;
In this embodiment, the original speech signal data is converted into the features that can be processed by the system by using the acoustic feature extraction algorithm to represent, that is, the original speech signal data is converted into the mel-frequency cepstrum coefficient and the filter bank feature, and the spectral information and the time domain information captured in the speech can be captured by using the mel-frequency cepstrum coefficient and the filter bank feature.
The mel-frequency cepstrum coefficient is a coefficient constituting the mel-frequency cepstrum. It is derived from the cepstral (cepstral) of an audio piece. The difference between cepstrum and mel-frequency cepstrum is that the frequency band division of mel-frequency cepstrum is equally divided on the mel scale, which more closely approximates the human auditory system than the linearly spaced frequency bands used in normal cepstrum; the filter bank features reflect the characteristics of the filters employed by the acoustic feature extraction algorithm.
103. Pre-constructing an acoustic training model, and performing rolling training on the acoustic training model through a Mel frequency cepstrum coefficient and a filter bank characteristic to obtain a plurality of acoustic feature mapping results;
104. Acquiring corresponding text data according to each acoustic feature mapping result;
105. performing result verification on each text data to obtain optimal text data and an acoustic training model corresponding to the optimal text data, and taking the acoustic training model as a voice recognition model;
In this embodiment, a training set is constructed by using mel frequency cepstrum coefficients and filter bank features, rolling training is performed on an acoustic training model through the training set, information processing is performed on voice signals in advance, training difficulty of the acoustic training model is effectively reduced, training effect of the acoustic training model is improved, output results (namely acoustic feature mapping results) of different acoustic training models are obtained in a training process, corresponding words or phonemes are mapped according to the acoustic feature mapping results, so that word data of voice are formed according to words and phonemes, result verification is performed on each word data according to original voice signal data to judge whether translation of each word data is accurate, optimal word data and corresponding acoustic training models are obtained according to accuracy of each word data, so that an acoustic training model with high recognition accuracy is obtained as a voice recognition model, and in addition, if the verification results are not expected, the training process is adjusted and optimized according to verification results, so that an optimal voice recognition model is obtained.
In the embodiment of the invention, the diversity of training data sources is improved by acquiring the original voice signal data of different users, the problem that the model is fitted in the training process is avoided, encryption processing is needed after the original voice signal data is acquired, illegal users are prevented from stealing voice information of the users, and the safety of the original voice signal data in use is improved; converting the original voice signal data into the characteristics which can be processed by the system by utilizing an acoustic characteristic extraction algorithm, namely converting the original voice signal data into the Mel frequency cepstrum coefficient and the filter bank characteristics, and capturing the frequency spectrum information and the time domain information in the voice by utilizing the Mel frequency cepstrum coefficient and the filter bank characteristics; when the model is trained, through carrying out information processing on the voice signals in advance, the training difficulty of the acoustic training model is effectively reduced, the training effect of the acoustic training model is improved, the output results of different acoustic training models are obtained in the training process, corresponding words or phonemes are mapped according to the mapping results of the acoustic features, the words or phonemes form voice text data according to the words and phonemes, and result verification is carried out on the text data according to the original voice signal data to judge whether translation of the text data is accurate or not, so that the acoustic training model with the highest recognition accuracy is obtained as a voice recognition model.
Referring to fig. 2, a second embodiment of a speech recognition model training method according to an embodiment of the present invention includes:
201. encrypting the original voice signal data of different users through an AES encryption algorithm to form encrypted voice data;
202. Constructing an encryption database based on the DBMS database, and configuring access rights to the encryption database;
203. And synchronizing the encrypted voice data to the encrypted database after the access authority configuration to form an original encrypted training library.
In this embodiment, when the original voice signal data of the user is obtained, the original voice signal data is encrypted by using an AES encryption algorithm, and the encrypted data is used as the original voice signal data and a plurality of round keys, so that illegal users are prevented from forcibly cracking the original voice signal data of the user, the safety of a training source in use is improved, in addition, the safety of the whole encryption database is improved by using the encryption function built in the DBMS database, and the access authority configuration is performed on the encryption database, so that illegal users are prevented from maliciously entering the original encryption training database to steal the original voice signal data of other users.
Referring to fig. 3, a third embodiment of a speech recognition model training method according to an embodiment of the present invention includes:
301. Decrypting and extracting original voice signal data from an original encryption training library;
302. Pre-emphasis processing is carried out on the original voice signal data through a high-pass filter so as to obtain the original voice signal data after the same frequency;
In this embodiment, the original voice signal data used for training is extracted from the original encryption training library by means of key decryption, a high-pass filter is used to boost a high-frequency part in the original voice signal data, so that the frequency spectrum of the signal becomes flat, the signal is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used to calculate the frequency spectrum, in order to eliminate the effects of vocal cords and lips generated by a user in the sounding process, to compensate the high-frequency part of the voice signal which is restrained by the sounding system, and also in order to highlight formants of the high frequency, and to improve the accuracy of feature conversion.
303. Carrying out framing treatment on the original voice signal data after the same frequency so as to obtain the original voice signal data after framing;
In this embodiment, N sampling points are first collected into one observation unit, which is called a frame, in framing. Typically, N has a value of 256 or 512, covering a period of about 20-30 ms. To avoid excessive variation between two adjacent frames, there is an overlap region between two adjacent frames, which includes M sampling points, where M is typically about 1/2 or 1/3 of N. Typically, the speech signal used for speech recognition has a sampling frequency of 8KHz or 16KHz, and for 8KHz, if the frame length is 256 samples, the corresponding time length is 256/8000 1000 =32 ms.
304. Windowing is carried out on the original voice signal data after framing so as to obtain windowed original voice signal data;
in this embodiment, after framing the signal, each frame is substituted into the window function, and the value outside the window is set to 0, so as to eliminate signal discontinuities that may be caused at both ends of each frame. The window functions commonly used are square window, hamming window, hanning window, etc., according to the frequency domain characteristics of the window functions.
305. Performing Fourier transform on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm to obtain a frequency spectrum signal;
306. Performing modular square conversion on the spectrum signal to obtain a power spectrum signal;
307. And carrying out Mel band conversion on the power spectrum signal by using a Mel filter group so as to obtain Mel frequency cepstrum coefficient and filter group characteristics.
In this embodiment, after the original speech signal data is repaired, fourier transformation is performed on the original speech signal data, and the characteristics of the signal are generally difficult to be seen by transforming the original speech signal data in the time domain, so that the original speech signal data is generally transformed into energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain the frequency spectrum of each frame. And obtaining a power spectrum signal of the voice signal by performing modular squaring on the spectrum signal of the voice signal; finally, the energy spectrum signal passes through a triangular filter bank with a set of Me scales by utilizing a Mel filter bank, and a filter adopted by the filter bank with M filters (the number of the filters is similar to that of the critical bands) is defined as a triangular filter; the triangular band-pass filter smoothes the frequency spectrum signal to achieve the effect of eliminating harmonic waves, highlight formants of original voice and reduce the operation amount.
Referring to fig. 4, a fourth embodiment of a speech recognition model training method according to an embodiment of the present invention includes:
401. constructing an acoustic training model based on the hidden Markov model;
402. generating an acoustic training set according to a preset weight proportion, a Mel frequency cepstrum coefficient and a filter bank characteristic;
403. and performing rolling training on the acoustic training model according to the acoustic training set to obtain a plurality of different acoustic feature mapping results.
In this embodiment, an acoustic training model with automatic analysis capability is constructed by using a hidden markov model, a mel frequency cepstrum coefficient and a filter bank feature are configured to form an acoustic training set according to a preset weight proportion, and finally the acoustic training model is subjected to rolling training according to a certain period by using the acoustic training set, so as to obtain acoustic feature mapping results generated by a plurality of different models.
It should be noted that the markov process employed by the hidden markov model is a random process, which has markov properties and is therefore called a markov process. A markov process refers to a process in which states evolve continuously, which is called a markov model after modeling it, and a markov process refers to a process in which each state transition in the process depends only on the previous n states, which is called an n-order markov model, where n is the number of influencing transition states. The simplest markov process is a first order process, with each state transition depending only on the state preceding it. Markov chains have a markov process of discrete states, a markov chain being a series (set of states) of random variables S1, …, st, the range of these variables, i.e. the set of all their possible values, is called the "state space", whereas the value of St is the state at time t; a markov chain containing N states has N square state transitions. The probability of each transition is called the state transition probability, which is the probability of transitioning from one state to another. Hidden markov models are used to describe a markov process that contains hidden unknown parameters. In a markov model, each state represents an observable event, in which the states are directly visible to an observer, such that the transition probabilities of the states are all parameters, whereas in a hidden markov model the states are not directly visible, but some variables affected by the states are visible, each state has a probability distribution over the symbols that may be output, so that the sequence of output symbols reveals some information about the sequence of states.
Referring to fig. 5, a fifth embodiment of a speech recognition model training method according to an embodiment of the present invention includes:
501. Performing association analysis on each acoustic feature mapping result to obtain word elements and phoneme elements;
502. pre-constructing a vocabulary table and a pronunciation dictionary library;
503. Text data is retrieved from the vocabulary and pronunciation dictionary based on the word elements and the phoneme elements.
In this embodiment, the corresponding mapping relation is found by using the mapping result of each acoustic feature, so as to obtain the retrieval condition (namely, word element and phoneme element), then the corresponding text data is found from the pre-constructed vocabulary and pronunciation dictionary library by using the word element and phoneme element, so as to obtain the recognition result of the voice, and the accuracy of the voice recognition can be judged by using the text data.
Referring to fig. 6, a sixth embodiment of a speech recognition model training method according to an embodiment of the present invention includes:
601. obtaining comparison text information according to the original voice signal data;
602. Performing character comparison on each text data according to the comparison text information to obtain a plurality of similarity results;
603. and according to the preset similarity threshold value, each similarity result is checked to obtain an optimal similarity result, and according to the optimal similarity result, a corresponding acoustic training model is obtained and used as a voice recognition model.
In this embodiment, when the result is verified, comparison text information is obtained in advance from original speech signal data, the comparison text information is a text reference quantity, then the comparison text information is compared with each text data to obtain a similarity result of each text data, the similarity result is used as an evaluation standard of each acoustic training model, and the acoustic training model with the highest score is used as a speech recognition model.
The method for training a speech recognition model in the embodiment of the present invention is described above, and the device for training a speech recognition model in the embodiment of the present invention is described below, referring to fig. 7, where an embodiment of the device for training a speech recognition model in the embodiment of the present invention includes:
the encryption module 701 is configured to obtain original voice signal data of different users, and encrypt the original voice signal data to form an original encrypted training library;
The feature module 702 is configured to perform feature extraction on original speech signal data in an original encrypted training library by using an acoustic feature extraction algorithm, so as to obtain mel frequency cepstrum coefficients and filter bank features;
The training module 703 is configured to pre-construct an acoustic training model, and perform rolling training on the acoustic training model through mel frequency cepstrum coefficients and filter bank features, so as to obtain a plurality of acoustic feature mapping results;
The acquiring module 704 is configured to acquire corresponding text data according to each acoustic feature mapping result;
and the verification module 705 is configured to perform result verification on each text data to obtain optimal text data and an acoustic training model corresponding to the optimal text data, and take the acoustic training model as a speech recognition model.
In this embodiment, the encryption module 701 improves the diversity of training data sources by acquiring the original voice signal data of different users, so as to avoid the problem that the model is fitted in the training process, and when the original voice signal data is acquired, encryption processing is needed, so that illegal users are prevented from stealing voice information of the users, and the safety of the original voice signal data is improved when in use; the feature module 702 converts the original voice signal data into a feature which can be processed by the system by using an acoustic feature extraction algorithm, namely, converts the original voice signal data into a mel frequency cepstrum coefficient and a filter bank feature, and captures the frequency spectrum information and the time domain information in the voice by using the mel frequency cepstrum coefficient and the filter bank feature; during model training, through carrying out information processing on voice signals in advance, the training difficulty of an acoustic training model is effectively reduced, the training effect of the acoustic training model is improved, in addition, the training module 703 and the obtaining module 704 obtain output results of different acoustic training models in the training process, corresponding words or phonemes are mapped according to the mapping results of the acoustic features, so that word data and phoneme-composed voice are formed according to the words and phonemes, the verification module 705 carries out result verification on each word data according to original voice signal data, so as to judge whether translation of each word data is accurate, and accordingly the acoustic training model with the highest recognition accuracy is obtained as a voice recognition model.
Referring to fig. 8, another embodiment of a speech recognition model training apparatus according to an embodiment of the present invention includes:
the encryption module 701 is configured to obtain original voice signal data of different users, and encrypt the original voice signal data to form an original encrypted training library;
The feature module 702 is configured to perform feature extraction on original speech signal data in an original encrypted training library by using an acoustic feature extraction algorithm, so as to obtain mel frequency cepstrum coefficients and filter bank features;
The training module 703 is configured to pre-construct an acoustic training model, and perform rolling training on the acoustic training model through mel frequency cepstrum coefficients and filter bank features, so as to obtain a plurality of acoustic feature mapping results;
The acquiring module 704 is configured to acquire corresponding text data according to each acoustic feature mapping result;
and the verification module 705 is configured to perform result verification on each text data to obtain optimal text data and an acoustic training model corresponding to the optimal text data, and take the acoustic training model as a speech recognition model.
In this embodiment, the encryption module 701 includes: an encryption unit 7011 for encrypting original voice signal data of different users by an AES encryption algorithm to form encrypted voice data; a rights unit 7012 for constructing an encrypted database based on the DBMS database and performing access rights configuration on the encrypted database; the configuration unit 7013 is configured to synchronize the encrypted voice data to the encrypted database after the access rights are configured, so as to form an original encrypted training library.
In this embodiment, the feature module 702 includes: the transformation unit 705 is configured to perform fourier transform on the original speech signal data in the original encrypted training library by using an acoustic feature extraction algorithm, so as to obtain a spectrum signal; a scaling unit 7026, configured to perform a modulo square scaling on the spectrum signal to obtain a power spectrum signal; the conversion unit 7027 is configured to perform melton band conversion on the power spectrum signal by using a mel filter bank to obtain mel frequency cepstrum coefficients and filter bank characteristics.
In this embodiment, the feature module 702 further includes: a decryption unit 7021, configured to extract original speech signal data from the original encryption training library; the pre-emphasis unit 7022 is configured to perform pre-emphasis processing on the original speech signal data through a high-pass filter, so as to obtain original speech signal data after the same frequency; a framing unit 7023, configured to perform framing processing on the original speech signal data after the same frequency, so as to obtain framed original speech signal data; the windowing unit 7024 is configured to perform windowing processing on the framed original speech signal data, so as to obtain windowed original speech signal data.
In this embodiment, the training module 703 includes: a construction unit 7031 for constructing an acoustic training model based on the hidden markov model; a generating unit 7032, configured to generate an acoustic training set according to a preset weight proportion, a mel frequency cepstrum coefficient and a filter bank feature; the training unit 7033 is configured to perform rolling training on the acoustic training model according to the acoustic training set, so as to obtain a plurality of different acoustic feature mapping results.
In this embodiment, the obtaining module 704 includes: a correlation unit 7041, configured to perform correlation analysis on each acoustic feature mapping result, so as to obtain a word element and a phoneme element; a pre-building unit 7042 for pre-building a vocabulary and a pronunciation dictionary library; the retrieval unit 7043 is used for retrieving text data from the vocabulary and pronunciation dictionary library according to the word elements and the phoneme elements.
In this embodiment, the verification module 705 includes: an acquisition unit 7051 for acquiring comparison text information from the original voice signal data; a comparison unit 7052, configured to perform character comparison on each text data according to the comparison text information, so as to obtain a plurality of similarity results; the checking unit 7053 is configured to check each similarity result according to a preset similarity threshold, so as to obtain an optimal similarity result, and obtain a corresponding acoustic training model according to the optimal similarity result as a speech recognition model.
The speech recognition model training apparatus in the embodiment of the present invention is described in detail above in fig. 7 and 8 from the point of view of modularized functional entities, and the speech recognition model training device in the embodiment of the present invention is described in detail below from the point of view of hardware processing.
Fig. 9 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present invention, where the speech recognition model training device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 810 (e.g., one or more processors) and a memory 820, and one or more storage mediums 830 (e.g., one or more mass storage devices) storing application programs 833 or data 832. Wherein memory 820 and storage medium 830 can be transitory or persistent. The program stored on the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the speech recognition model training apparatus 800. Still further, the processor 810 may be configured to communicate with the storage medium 830 and execute a series of instruction operations in the storage medium 830 on the speech recognition model training apparatus 800 to implement the steps of the speech recognition model training method provided by the method embodiments described above.
The speech recognition model training apparatus 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input/output interfaces 860, and/or one or more operating systems 831, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the speech recognition model training device architecture shown in FIG. 9 is not limiting of the speech recognition model based training device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of a speech recognition model training method.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing is merely a preferred example of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method for training a speech recognition model, comprising:
acquiring original voice signal data of different users, and carrying out encryption processing on the original voice signal data to form an original encryption training library;
Extracting features of original voice signal data in an original encryption training library through an acoustic feature extraction algorithm to obtain a Mel frequency cepstrum coefficient and a filter bank feature;
pre-constructing an acoustic training model, and performing rolling training on the acoustic training model through a Mel frequency cepstrum coefficient and a filter bank characteristic to obtain a plurality of acoustic feature mapping results;
Acquiring corresponding text data according to each acoustic feature mapping result;
And performing result verification on each text data to obtain the optimal text data and an acoustic training model corresponding to the optimal text data, and taking the acoustic training model as a voice recognition model.
2. The method for training a speech recognition model according to claim 1, wherein the steps of obtaining the original speech signal data of different users and encrypting the original speech signal data to form an original encrypted training library comprise:
encrypting the original voice signal data of different users through an AES encryption algorithm to form encrypted voice data;
Constructing an encryption database based on the DBMS database, and configuring access rights to the encryption database;
and synchronizing the encrypted voice data to the encrypted database after the access authority configuration to form an original encrypted training library.
3. The method according to claim 1, wherein the feature extraction of the original speech signal data in the original encrypted training library by the acoustic feature extraction algorithm to obtain mel-frequency cepstral coefficients and filter bank features comprises:
performing Fourier transform on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm to obtain a frequency spectrum signal;
performing modular square conversion on the spectrum signal to obtain a power spectrum signal;
And carrying out Mel band conversion on the power spectrum signal by using a Mel filter group so as to obtain Mel frequency cepstrum coefficient and filter group characteristics.
4. The method for training a speech recognition model according to claim 3, wherein before performing fourier transform on the original speech signal data in the original encrypted training library by using an acoustic feature extraction algorithm to obtain a spectrum signal, the method further comprises:
decrypting and extracting original voice signal data from an original encryption training library;
Pre-emphasis processing is carried out on the original voice signal data through a high-pass filter so as to obtain the original voice signal data after the same frequency;
carrying out framing treatment on the original voice signal data after the same frequency so as to obtain the original voice signal data after framing;
And windowing the framed original voice signal data to obtain windowed original voice signal data.
5. The method according to claim 1, wherein the pre-constructing the acoustic training model and performing rolling training on the acoustic training model by mel frequency cepstral coefficient and filter bank features to obtain a plurality of acoustic feature mapping results comprises:
Constructing an acoustic training model based on the hidden Markov model;
generating an acoustic training set according to a preset weight proportion, a Mel frequency cepstrum coefficient and a filter bank characteristic;
And performing rolling training on the acoustic training model according to the acoustic training set to obtain a plurality of different acoustic feature mapping results.
6. The method according to claim 1, wherein the obtaining the corresponding text data according to each acoustic feature mapping result includes:
performing association analysis on each acoustic feature mapping result to obtain word elements and phoneme elements;
Pre-constructing a vocabulary table and a pronunciation dictionary library;
text data is retrieved from the vocabulary and pronunciation dictionary based on the word elements and the phoneme elements.
7. The method according to claim 1, wherein the performing the result verification on each text data to obtain the optimal text data and the acoustic training model corresponding to the text data, and using the acoustic training model as the speech recognition model comprises:
Obtaining comparison text information according to the original voice signal data;
performing character comparison on each text data according to the comparison text information to obtain a plurality of similarity results;
And according to the preset similarity threshold value, each similarity result is checked to obtain an optimal similarity result, and according to the optimal similarity result, a corresponding acoustic training model is obtained and used as a voice recognition model.
8. A speech recognition model training device, comprising:
The encryption module is used for acquiring the original voice signal data of different users and carrying out encryption processing on the original voice signal data to form an original encryption training library;
the feature module is used for carrying out feature extraction on the original voice signal data in the original encryption training library through an acoustic feature extraction algorithm so as to obtain a Mel frequency cepstrum coefficient and a filter bank feature;
the training module is used for pre-constructing an acoustic training model, and carrying out rolling training on the acoustic training model through the Mel frequency cepstrum coefficient and the filter bank characteristics so as to obtain a plurality of acoustic feature mapping results;
the acquisition module is used for acquiring corresponding text data according to the mapping result of each acoustic feature;
And the verification module is used for verifying the results of the text data to obtain the optimal text data and the acoustic training model corresponding to the optimal text data, and taking the acoustic training model as a voice recognition model.
9. A speech recognition model training apparatus, characterized in that the speech recognition model training apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;
At least one of the processors invokes the instructions in the memory to cause the speech recognition model training apparatus to perform the steps of the speech recognition model training method of any of claims 1-7.
10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of the speech recognition model training method of any of claims 1-7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410184786.0A CN118136001A (en) | 2024-02-19 | 2024-02-19 | Speech recognition model training method, device, equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410184786.0A CN118136001A (en) | 2024-02-19 | 2024-02-19 | Speech recognition model training method, device, equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118136001A true CN118136001A (en) | 2024-06-04 |
Family
ID=91238542
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410184786.0A Pending CN118136001A (en) | 2024-02-19 | 2024-02-19 | Speech recognition model training method, device, equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118136001A (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119025680A (en) * | 2024-08-26 | 2024-11-26 | 它思科技(天津)有限公司 | Medical text classification method and device based on speech recognition and large language model |
| CN119397329A (en) * | 2024-10-25 | 2025-02-07 | 深圳瑞捷技术股份有限公司 | A method and device for automatically classifying housing quality and safety complaints |
| CN120048263A (en) * | 2025-04-25 | 2025-05-27 | 广东美电贝尔科技集团股份有限公司 | Intelligent voice recognition and instruction execution system based on artificial intelligence in duty process |
-
2024
- 2024-02-19 CN CN202410184786.0A patent/CN118136001A/en active Pending
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119025680A (en) * | 2024-08-26 | 2024-11-26 | 它思科技(天津)有限公司 | Medical text classification method and device based on speech recognition and large language model |
| CN119397329A (en) * | 2024-10-25 | 2025-02-07 | 深圳瑞捷技术股份有限公司 | A method and device for automatically classifying housing quality and safety complaints |
| CN120048263A (en) * | 2025-04-25 | 2025-05-27 | 广东美电贝尔科技集团股份有限公司 | Intelligent voice recognition and instruction execution system based on artificial intelligence in duty process |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11056097B2 (en) | Method and system for generating advanced feature discrimination vectors for use in speech recognition | |
| Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
| US20060253285A1 (en) | Method and apparatus using spectral addition for speaker recognition | |
| CN118136001A (en) | Speech recognition model training method, device, equipment and storage medium | |
| US20100094622A1 (en) | Feature normalization for speech and audio processing | |
| CN111798846A (en) | Voice command word recognition method and device, conference terminal and conference terminal system | |
| Waghmare et al. | Emotion recognition system from artificial marathi speech using MFCC and LDA techniques | |
| Priyadarshani et al. | Dynamic time warping based speech recognition for isolated sinhala words | |
| Patel et al. | Optimize approach to voice recognition using iot | |
| Rudresh et al. | Performance analysis of speech digit recognition using cepstrum and vector quantization | |
| Bhatt et al. | Effects of the dynamic and energy based feature extraction on hindi speech recognition | |
| Goh et al. | Robust speech recognition using harmonic features | |
| Sinith et al. | A novel method for text-independent speaker identification using MFCC and GMM | |
| Khanna et al. | Application of vector quantization in emotion recognition from human speech | |
| Kaur et al. | Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition | |
| Sinha et al. | Continuous density hidden markov model for hindi speech recognition | |
| Gedam et al. | Development of automatic speech recognition of Marathi numerals-a review | |
| Kanisha et al. | Speech recognition with advanced feature extraction methods using adaptive particle swarm optimization | |
| JP2006235243A (en) | Audio signal analysis device and audio signal analysis program for | |
| Gadekar et al. | Analysis of speech recognition techniques | |
| Maurya et al. | Speaker recognition for noisy speech in telephonic channel | |
| Sriranjani et al. | Experiments on front-end techniques and segmentation model for robust Indian Language speech recognizer | |
| Vasudev et al. | Speaker identification using FBCC in Malayalam language | |
| Hora et al. | Exploring residual cepstral features for spoken language identification | |
| Gaikwad et al. | Novel approach based feature extraction for Marathi continuous speech recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |