WO2017135148A1 - 音響モデル学習方法、音声認識方法、音響モデル学習装置、音声認識装置、音響モデル学習プログラムおよび音声認識プログラム - Google Patents
音響モデル学習方法、音声認識方法、音響モデル学習装置、音声認識装置、音響モデル学習プログラムおよび音声認識プログラム Download PDFInfo
- Publication number
- WO2017135148A1 WO2017135148A1 PCT/JP2017/002740 JP2017002740W WO2017135148A1 WO 2017135148 A1 WO2017135148 A1 WO 2017135148A1 JP 2017002740 W JP2017002740 W JP 2017002740W WO 2017135148 A1 WO2017135148 A1 WO 2017135148A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- acoustic
- parameter
- acoustic model
- feature
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Definitions
- the present invention relates to an acoustic model learning method, a speech recognition method, an acoustic model learning device, a speech recognition device, an acoustic model learning program, and a speech recognition program.
- HMM Hidden Markov Model
- the recognition voice data often does not match the characteristics with the learning voice data due to ambient noise and speaker diversity. That is, the mismatch of acoustic conditions such as the acoustic environment including ambient noise in which each of the learning voice data and the recognition voice data is placed and speaker characteristics that are characteristics of each speaker lowers the voice recognition accuracy. For this reason, the voice recognition technology is required to be robust against acoustic conditions.
- a robust speech recognition technique a technique is known in which parameters of an acoustic model are learned by re-estimation using adaptive data so that the speech data for recognition matches the acoustic model (see, for example, Non-Patent Document 2).
- an error back-propagation method or the like is widely used (see, for example, Non-Patent Document 3).
- the acoustic condition for placing the acoustic data for learning the acoustic model is not necessarily the same as the acoustic condition for placing the recognition speech data, and therefore there is a mismatch between the acoustic model and the speech feature amount during speech recognition. As a result, the accuracy of voice recognition is reduced.
- the parameters of the acoustic model are adapted using adaptation data under the same acoustic conditions as the speech data for recognition.
- the voice used for parameter estimation requires a label (for example, speaker ID, transcription, etc.) that expresses the voice. Therefore, when adapting the parameters of the acoustic model to the observed recognition speech data, enormous calculation is required, and there is a problem that high-speed parameter adaptation cannot be performed.
- an example of an embodiment disclosed in the present application is to achieve parameter adaptation of an acoustic model with high accuracy and high speed during speech recognition.
- the acoustic model learning method of the present invention includes a feature amount extraction step for extracting a feature amount indicating a feature of speech data, an acoustic condition calculation model parameter that characterizes a calculation model of an acoustic condition represented by a neural network, and the feature amount. Based on the acoustic condition feature amount calculation step for calculating the acoustic condition feature amount indicating the acoustic condition feature of the audio data using the acoustic condition calculation model, and an output layer of the acoustic condition calculation model are combined.
- An acoustic model parameter correcting step for generating a corrected parameter that is a parameter obtained by correcting an acoustic model parameter that characterizes the acoustic model represented by the neural network based on the acoustic condition feature amount; and the corrected parameter and the feature amount
- An acoustic model parameter updating step for updating the acoustic model parameter based on An acoustic condition calculation model parameter updating process based on positive post parameter and the characteristic quantity updating the acoustic condition calculation model parameters, characterized in that it contains.
- the acoustic model learning device includes a feature amount extraction unit that extracts a feature amount indicating a feature of speech data, an acoustic condition calculation model parameter that characterizes a calculation model of an acoustic condition represented by a neural network, and the feature
- An acoustic condition feature amount calculation unit that calculates an acoustic condition feature amount indicating the acoustic condition feature of the audio data based on the amount using a calculation model of the acoustic condition, and an output layer of the acoustic condition calculation model includes: An acoustic model parameter that characterizes the combined acoustic model, an acoustic model parameter correction unit that generates a corrected parameter that is a parameter corrected based on the acoustic condition feature amount, and the correction parameter and the feature amount
- An acoustic model parameter updating unit for updating an acoustic model parameter, the corrected parameter and the feature amount; Wherein characterized in that it has an acoustic condition calculation
- parameter adaptation of an acoustic model can be realized with high accuracy and high speed during speech recognition.
- FIG. 1 is a diagram illustrating an example of a configuration of a speech recognition apparatus according to a first conventional technique.
- FIG. 2 is a diagram illustrating an example of an outline of processing according to the first conventional technique.
- FIG. 3 is a flowchart showing an example of a speech recognition process according to the first conventional technique.
- FIG. 4 is a diagram illustrating an example of the configuration of a speech recognition apparatus according to the second conventional technique.
- FIG. 5 is a flowchart showing an example of a voice recognition process according to the second prior art.
- FIG. 6 is a diagram illustrating an example of a configuration of an acoustic model relearning apparatus according to the third related art.
- FIG. 7 is a flowchart showing an example of a speech model relearning process according to the third prior art.
- FIG. 8 is a diagram illustrating an example of an outline of a conventional acoustic condition adaptive acoustic model.
- FIG. 9 is a diagram illustrating an example of an outline of the acoustic condition adaptive acoustic model according to the embodiment.
- FIG. 10 is a diagram illustrating an example of the configuration of the acoustic model learning device according to the embodiment.
- FIG. 11 is a flowchart illustrating an example of the acoustic model learning process according to the embodiment.
- FIG. 12 is a diagram illustrating an example of the configuration of the speech recognition apparatus according to the embodiment.
- FIG. 13 is a flowchart illustrating an example of a voice recognition process according to the embodiment.
- FIG. 14 is a diagram illustrating an example of a computer that realizes the acoustic model learning device and the speech recognition device according to the embodiment by executing a program.
- exemplary embodiments of the acoustic model learning method, the speech recognition method, the acoustic model learning device, the speech recognition device, the acoustic model learning program, and the speech recognition program disclosed in the present application the exemplary embodiment is assumed. Prior art will be described. Thereafter, an example of an embodiment of an acoustic model learning method, a speech recognition method, an acoustic model learning device, a speech recognition device, an acoustic model learning program, and a speech recognition program disclosed in the present application will be described.
- A when A is a vector, it is expressed as “vector A”, when A is a matrix, for example, it is expressed as “matrix A”, and when A is a scalar, for example, it is simply “ A ".
- A when A is a set, it is expressed as “set A”.
- the function f of the vector A is expressed as f (vector A).
- ⁇ A when “ ⁇ A” is described for A which is a vector, matrix or scalar, it is equivalent to “a symbol in which“ ⁇ ”is written immediately above“ A ””.
- a T denotes the transpose of A.
- FIG. 1 is a diagram illustrating an example of a configuration of a speech recognition apparatus according to a first conventional technique.
- the speech recognition apparatus 20a according to the first prior art includes a feature amount extraction unit 201a, an HMM state output probability calculation unit 205a, and a word string search unit 206a.
- the voice recognition device 20a is connected to the storage unit 250a.
- the storage unit 250a stores an acoustic model and a language model in advance.
- the acoustic model is a model of acoustic features of speech.
- the language model is composed of a large number of symbol sequences such as phonemes and words. For example, it can be said that the language model is a model of a word string generation probability.
- an acoustic model for speech recognition is a left-to-right HMM for each phoneme, and includes an output probability distribution of each state of the HMM calculated by a neural network (hereinafter referred to as NN (Neural Network)). .
- NN Neurological Network
- the acoustic model stored in the storage unit 250a includes the state transition probability of the HMM for each symbol of the phoneme, etc., the weight matrix for the i-th hidden layer W i and the bias vector b i, a parameter such as the activation function It is a parameter of NN.
- i is the index of the hidden layer.
- the language model is composed of a large number of symbol sequences S j such as phonemes and words, and P (S j ) is a probability (language probability) of the symbol sequence S j obtained by the language model.
- the symbol series S j is a series of symbols composed of phonemes, words, and the like that can be a speech recognition result.
- the feature quantity extraction unit 201a reads the recognition voice data and extracts the voice feature quantity from the recognition voice data.
- Features include, for example, MFCC (Mel Frequency Cepstral Coefficient), LMFC (log Mel Filterbank coefficients), ⁇ MFCC (MFCC one-time differentiation), ⁇ MFCC (MFCC two-time differentiation), logarithmic (spectrum) power, and ⁇ log power (Logarithmic power single derivative).
- the feature extraction unit 201a concatenates the feature amount obtained from each consecutive frames of the frame and its front and rear about 5 frames for each frame, the sequence feature vector o n (n time of about 10-2000 dimensional , 1,..., N natural numbers). Then, the feature extraction unit 201a, as described below (1), and generates a feature vector O summarizes the series feature vector o n time for all frames.
- the feature vector O is data represented by a D-dimensional vector from the first to the Nth frame. For example, the frame length is about 30 ms, and the frame shift length is about 10 ms.
- the output probability calculation unit 205a of the HMM state reads the acoustic model parameter ⁇ from the storage unit 250a, and calculates the output probability of each HMM state of the acoustic model for each frame n of the feature vector O based on the read acoustic model parameter ⁇ . To do.
- the output probability of the HMM state is described in Reference 1 “G. Hinton et al.,“ Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups, ”IEEE SIGNAL PROCESSING MAGAZINE, Vol. 29, No. 6, for example. , pp. 82-97, 2012. ”is the output of the neural network represented by the equation (2).
- a neural network representing an acoustic model for speech recognition in the prior art has one or more hidden layers between an input and an output. Input of the neural network, when a series feature vector o n, is input to the foremost stage of the hidden layer.
- the output of the neural network is the output probability of the HMM state by the last hidden layer.
- the calculation in each hidden layer performed by the output probability calculation unit 205a in the HMM state includes two processes: a process using linear transformation and a process using an activation function.
- the linear transformation in each hidden layer is expressed by the following equation (2).
- the vector x 0, n is a sequence feature vector o n when an input of the neural network.
- the output of the activation function is as shown in the following equation (3).
- the vector x i, n is the output of the i-th hidden layer
- ⁇ is an activation function such as a sigmoid function
- the elements of the ⁇ (vector z i, n ) vector Calculated every time. That is, the output probability calculation unit 205a in the HMM state performs the above (2) for the vector x i ⁇ 1, n that is the output of the (i ⁇ 1) th hidden layer that is the preceding hidden layer in the i th hidden layer.
- the vector x i, n that is the result of performing the processing according to the above equation (2) on the vector z i, n that is the result of performing the linear transformation according to the equation) is output.
- the word string search unit 206a generates J (J is a natural number) conflict candidate symbol sequences Sj based on the output probabilities of the respective HMM states calculated by the HMM state output probability calculation unit 205a, and the conflict candidate symbol sequences For each S j , an acoustic score indicating the likelihood of matching with the acoustic model is calculated.
- the symbol is, for example, a phoneme.
- j 1, 2,..., J.
- the word string search unit 206a calculates a language score indicating the likelihood of matching with the language model for each conflict candidate symbol series Sj .
- the word string search unit 206a is most probable as a word string corresponding to the speech data for recognition from among the J conflict candidate symbol sequences Sj. Then, the opponent candidate symbol series having the highest score obtained by integrating the language scores is searched from the language model stored in the storage unit 250a, and the searched opponent candidate symbol series is output as a word string ⁇ S as a recognition result.
- FIG. 3 is a flowchart showing an example of a speech recognition process according to the first conventional technique.
- the speech recognition apparatus 20a reads the acoustic model parameter ⁇ from the storage unit 250a (step S201a).
- the speech recognition apparatus 20a reads a language model from the storage unit 250a (step S202a).
- the voice recognition device 20a reads the voice data for recognition (step S203a).
- the speech recognition apparatus 20a extracts a feature amount of speech from the read recognition speech data, and generates a feature amount vector O (step S204a).
- the speech recognition apparatus 20a calculates the output probability of each HMM state of the acoustic model for each frame n of the feature vector O based on the read acoustic model parameter ⁇ (step S205a).
- the voice recognition device 20a based on the output probability of each HMM state of being calculated by the output probability calculation unit 205a of the HMM state, generates a conflict candidate symbol sequence S j, acoustic score alleles candidate symbol for each series S j
- the opponent candidate symbol series having the highest score obtained by integrating the language scores is searched from the language model stored in the storage unit 250a (step S206a).
- the speech recognition apparatus 20a outputs the search result in step S206a as a word string ⁇ S that is the recognition result (step S207a).
- acoustic model correction a second conventional technique for performing speech recognition by correcting (re-estimating) parameters of the acoustic model (hereinafter referred to as acoustic model correction) in order to match the acoustic model with the feature amount at the time of recognition.
- the second conventional technology is, for example, a speech recognition technology described in Reference 2 “H.
- FIG. 4 is a diagram showing an example of the configuration of a speech recognition apparatus according to the second prior art.
- the speech recognition apparatus 20b according to the second prior art that performs acoustic model correction includes a feature amount extraction unit 201b, an HMM state output probability calculation unit 205b, and a word string search unit 206b.
- the voice recognition device 20b is connected to the storage unit 250b.
- the storage unit 250b is the same as the storage unit 250a of the first prior art, but stores the corrected acoustic model parameters for the stored acoustic model.
- the feature quantity extraction unit 201b reads the recognition voice data and generates a feature quantity vector O.
- the output probability calculation unit 205b of the HMM state calculates the output probability of each HMM state based on the acoustic model parameter ⁇ ⁇ corrected in advance and the feature amount vector O generated by the feature amount extraction unit 201b.
- the word string search unit 206b receives the output probability of each HMM state and the language model read from the storage unit 250b, and outputs a word string ⁇ S as a recognition result.
- FIG. 5 is a flowchart showing an example of a voice recognition process according to the second prior art.
- the specific processing of the speech recognition device 20b is different from the first conventional technology speech recognition device 20a except that the acoustic model read in step S201b is a corrected acoustic model. Is the same as the speech recognition device 20a of the first prior art.
- FIG. 6 is a diagram illustrating an example of a configuration of an acoustic model relearning apparatus according to the third related art.
- the acoustic model relearning apparatus 10c includes a feature amount extraction unit 101c and an acoustic model correction unit 104c.
- the acoustic model relearning device 10c is connected to the storage unit 150c.
- the storage unit 150c does not store the language model, but stores only the acoustic model parameter ⁇ .
- the feature quantity extraction unit 101c reads the adaptation audio data and generates a feature quantity vector Or .
- the feature amount extraction unit 101c performs the same processing as the feature amount extraction unit 201b of the speech recognition device 20b.
- the corrected acoustic model parameter ⁇ ⁇ calculated by the acoustic model re-learning apparatus 10c is obtained by using the adaptive speech data having the same acoustic conditions as the recognized speech data and the label relating to the adaptive speech data. Calculated by correcting.
- the label may be one that has been transcribed manually (with teacher) or one that has been automatically obtained by voice recognition according to the first or second prior art (without teacher). is there.
- the correction of the acoustic model parameter ⁇ using the supervised label is called supervised correction.
- the correction of the acoustic model parameter ⁇ using the unsupervised label is called unsupervised correction.
- the acoustic model correction unit 104c corrects the acoustic model parameter ⁇ using the acoustic model parameter ⁇ read from the storage unit 150c, the feature amount vector O r generated by the feature amount extraction unit 101c, and the input label ⁇ S r ( Re-estimate).
- the acoustic model correction unit 104c includes an adaptive data (feature vectors O r of the adaptation speech data), using the correct symbol sequence S r and a corresponding feature vector O r, the following equation (4) of The acoustic model parameter ⁇ ⁇ is re-estimated so that the objective function F ⁇ is maximized.
- the re-estimated acoustic model parameter ⁇ is used, for example, by the output probability calculation unit 205b (see FIG. 4) in the HMM state of the speech recognition apparatus 20b according to the second conventional technique.
- the acoustic model used by the acoustic model correction unit 104c is NN.
- the objective function F lambda e.g. Cross Entropy is used.
- the optimization problem of the above equation (4) is solved by the Stochastic Gradient Descent (SGD) method, and the derivative for the correction parameter for that is described in Reference 3 “S.
- SGD Stochastic Gradient Descent
- a minute value such as 0.0001 is often used as the Learning Rate which is a variable of the SGD.
- FIG. 7 is a flowchart showing an example of a speech model relearning process according to the third prior art.
- the acoustic model re-learning apparatus 10c reads the acoustic model parameter ⁇ from the storage unit 150c (step S101c).
- the acoustic model relearning device 10c reads a language model from a storage unit (not shown), for example, the storage unit 250b (see FIG. 4) of the speech recognition device 20b (step S102c).
- the acoustic model relearning device 10c reads the adaptation speech data (step S103c).
- the acoustic model retraining unit 10c reads the correct symbol sequence S r (step S104c).
- the acoustic model retraining unit 10c extracts a feature from the adaptation speech data to generate a feature vector O r (step S105c).
- the acoustic model re-learning apparatus 10c corrects (re-estimates) the acoustic model parameter ⁇ using the feature vector O r and the input label ⁇ S r (step S106c).
- the acoustic model re-learning apparatus 10c re-estimates and outputs the acoustic model parameter ⁇ ⁇ obtained by correcting the acoustic model parameter ⁇ (step S107c).
- the acoustic model parameter of CADNN is related to the acoustic condition feature amount given from the outside, and changes depending on the feature amount.
- CADNN learning first, acoustic model parameters corresponding to each acoustic condition are learned using the voice feature quantity and the acoustic condition feature quantity.
- the acoustic condition feature amount of the speech to be recognized is calculated, and a new acoustic model parameter that matches the acoustic condition is automatically calculated based on the acoustic condition feature amount and the acoustic model parameter learned in advance. Estimated and determined.
- the acoustic condition feature amount can be calculated without using the correct answer label (speaker ID or transcription), and can be calculated from a small amount of voice data (several seconds).
- the acoustic condition feature amount calculation unit used in CADNN is designed independently of the speech recognition apparatus, and is not designed based on the speech recognition performance optimization standard. As a result, it has been difficult to perform highly accurate speech recognition using conventional acoustic condition feature quantities.
- CADNN-JT Context Adaptive Deep Neural Network with joint training
- CADNN-JT Context Adaptive Deep Neural Network with joint training
- CADNN-JT it is possible to simultaneously optimize the parameters of the acoustic condition feature quantity calculation model and the acoustic model parameters.
- a calculation model of acoustic condition feature values is represented by a neural network, and an output layer of the neural network is combined with a part of a conventional CADNN neural network.
- FIG. 8 is a diagram illustrating an example of an outline of a conventional acoustic condition adaptive acoustic model.
- CADNN which is a conventional acoustic condition adaptive acoustic model, has a hidden layer of NN for each element of a vector (hereinafter referred to as “acoustic condition feature vector”) indicating acoustic condition feature quantities. It is in a disassembled form.
- FIG. 8 shows a state where one hidden layer (i-th hidden layer) is decomposed as an example, at least one hidden layer or all hidden layers may be decomposed. The output after linear transformation of the decomposed hidden layer is calculated as in the following equation (5).
- y k, n in the above equation (5) will be described in detail later, but the k-th (k is the k-dimensional real space) of the acoustic condition feature quantity vector y n ⁇ R K extracted from the n-th frame.
- K is a natural number
- K is a natural number indicating the number of acoustic conditions).
- acoustic condition feature values y k, n they are referred to as acoustic condition feature values y k, n .
- the weight matrix W i, k in the above equation (5) is a linear transformation matrix for the acoustic condition feature value y k, n in the i-th hidden layer.
- bias vector b i, k in the above equation (5) is a bias vector related to the acoustic condition feature value y k, n in the i-th hidden layer.
- CADNN expresses the hidden layer by breaking it down into K acoustic condition elements.
- disassembly of the hidden layer for every acoustic condition feature-value you may show to following (6) Formula or following (7) Formula.
- the acoustic model parameters at the time of speech recognition are calculated as in the following equations (8-1) and (8-2), and are automatically adapted to the acoustic conditions based on the acoustic condition feature values y k, n at the time of speech recognition. Will do.
- the acoustic condition feature value y k, n represents an acoustic condition.
- the acoustic condition feature value y k, n is calculated by a system independent of the speech recognition apparatus. For example, in the case of speaker adaptation, if the speech data at the time of learning is divided for each speaker class, models of each speaker class can be learned (Reference 5 “N. Dehak et al.,“ Front- End Factor Analysis for Speaker Verification, ”IEEE Trans. Audio, Speech, Language Process., Vol. 19, No. 4, pp. 788-798, 2011”). In CADNN, the posterior probability of each speaker class is calculated for each test utterance using the model of each speaker class, and is set as y k, n .
- FIG. 9 is a diagram illustrating an example of an outline of the acoustic condition adaptive acoustic model according to the embodiment.
- CADNN-JT which is an acoustic condition adaptive acoustic model according to the embodiment
- a neural network is used as a calculation model for acoustic condition feature amounts representing acoustic conditions, and the parameters of the neural network are the same as those in the conventional network model. It is simultaneously optimized with the CADNN parameters. That is, the acoustic model and the acoustic condition calculation model are represented by a neural network having one or more hidden layers, and the output layer of the acoustic condition calculation model is coupled to the acoustic model.
- the input of the acoustic conditions feature quantity calculation model the second input feature value u n is used.
- Vector u n is, IVector like and representing the speaker characteristic frequently used in speaker recognition, etc., can be a feature quantity O n audio.
- the second input feature value u n may be the same as the input feature quantity, it may be different.
- the acoustic condition feature quantity y k, n calculated by the acoustic condition feature quantity calculation model is calculated as shown in the following equation (9).
- the vector u n [u 1, n ,..., U J, n ] is a J-dimensional second input feature quantity.
- the function f () is a function for calculating the acoustic condition feature quantity
- ⁇ is a parameter of the function f ().
- f () describes the case of a multilayer neural network (DNN), but f () may be a recurrent neural network (RNN) or a CNN (Convolutional Neural Network).
- linear in each layer of the neural network is a set of parameters for the conversion, W'i'the transformation matrix, b'i'bias vector, I'number obtained by adding 1 to the total number of hidden layers of the neural network computational model of the acoustic conditions characteristic quantity That is, the total number of hidden layers and output layers.
- a sigmoid function or the like can be used as the activation function of the hidden layer.
- a softmax function, a linear function, or the like can be used as an output layer activation function (activation function).
- Each speaker can be represented as an acoustic condition by using the acoustic condition feature value y k, n as the posterior probability of the speaker class.
- the noise environment can be expressed as the acoustic condition by using the acoustic condition feature value y k, n as the posterior probability of the noise environment class.
- y k, n is a feature quantity that can be basically calculated with several seconds of speech data, a large amount of acoustic model parameters (W i , b i ) are adapted to acoustic conditions by using several seconds of speech data. It will be possible.
- each hidden layer is calculated as shown in the following equation (10) by applying an activation function to the output vector z i, n of the linear transformation.
- Weight matrix W i, k and bias vector b i, k which are linear transformation parameters of each hidden layer decomposed for each acoustic condition feature y k, n , and parameter ⁇ of a function for calculating the acoustic condition feature Are optimized simultaneously.
- the calculation result of the acoustic condition feature value is used in the calculation process in the factorized hidden layer (see the equations (5) and (9)). That is, since the neural network of the acoustic condition feature quantity calculation model and the factorized hidden layer are combined, the learning procedure of the conventional neural network (error backpropagation and SGD) (Reference 6 “D. Yu and L. Deng, “Automatic Speech Recognition: A Deep Learning Approach”, “Springer, 2015.”) can be used to simultaneously optimize the parameters of all neural networks. In this case, the differentiation of the linear transformation parameter of the hidden layer is as shown in the following formulas (11-1) and (11-2).
- F in the above formulas (11-1) and (11-2) represents an optimization criterion (for example, Cross Entropy).
- the vector ⁇ i, n represents a back-propagated error and is calculated as in the following equation (12).
- a Hadamard product is an element-by-element product of a matrix or vector.
- the above equation (12) is the same as the error back propagation equation of the prior art, but the weight matrix W i + 1, n and the vector z i, n used in the above equation (12) are newly introduced by CADNN-JT. It is calculated based on the above equations (8-1) and (8-2) and the above equation (5) (or the above equation (6) or the above equation (7)).
- the error vector ⁇ I, n is an error term.
- the error vector ⁇ I, n is obtained from a vector x i, n which is a network output (HMM state output probability) calculated based on the input feature vectors Y and NN and the input correct symbol sequence S r. based on the correct HMM state d n to be, as in the prior art, an error that back propagation, is calculated as follows (13).
- ⁇ ′ i, n represents an error that has been propagated back to the neural network of the acoustic condition feature quantity calculation model, and is calculated as in the following equation (15).
- ⁇ ′ i, p, n is an error propagated back in the p- th i-th layer
- z k, i, p, n is the p- th dimension of z k, i, n .
- z k, i, n is calculated as in the following equation (17).
- FIG. 10 is a diagram illustrating an example of the configuration of the acoustic model learning device according to the embodiment.
- the acoustic model learning device 10 according to the embodiment includes a feature amount extraction unit 101, a second feature amount extraction unit 102, a condition feature amount calculation unit 103, an acoustic model parameter correction unit 104, and an output of an HMM state.
- a probability calculation unit 105, an error calculation unit 121, an acoustic model parameter differential value calculation unit 122, an acoustic model parameter update unit 123, a parameter differential value calculation unit 124 of a condition feature quantity calculation unit, a parameter update unit 125 of a condition feature quantity calculation unit, A convergence determination unit 126 is included.
- the acoustic model learning device 10 is connected to the storage unit 150.
- n are natural numbers of 1, 2,..., N ⁇ as parameters characterizing the acoustic model.
- N represents a total number of frames in one utterance is a target for calculating the acoustics feature vectors y n of each frame will be described later.
- N represents a total number of frames in one utterance is a target for calculating the acoustics feature vectors y n of each frame will be described later.
- W'i'the transformation matrix, b'i'bias vector I'is a 1 to the total number of hidden layers of the neural network computational model of the acoustic conditions characteristic amount mentioned in the description of (9) This is the total number of hidden layers and output layers.
- the acoustic condition calculation model is a model for generating an acoustic condition feature vector -Y described later.
- the acoustic condition feature amount is a feature for each speaker, a sex of the speaker, an acoustic environment related to noise or reverberation, or the like.
- the feature amount extraction unit 101 reads learning speech data observed with a microphone or the like, and generates a feature amount vector O from the learning speech data. That is, the feature quantity extraction unit 101 extracts feature quantities from the learning speech data.
- the specific processing of the feature amount extraction unit 101 is the same as the feature amount extraction unit 201a of the first conventional technique, the feature amount extraction unit 201b of the second conventional technique, and the feature amount extraction unit 101c of the third conventional technique. is there.
- the second feature quantity extraction unit 102 reads the learning speech data, extracts a second feature quantity vector series U as shown in the following equation (18), and outputs it to the conditional feature quantity calculation unit 103.
- the second feature quantity extraction unit 102 may perform the same processing as the feature quantity extraction unit 101 to extract the feature quantity vector O as the second feature quantity vector.
- the feature quantity vector O such as ivector May extract different feature quantities.
- N is the total number of frames of one utterance for which the second feature vector is calculated, and n is an integer from 1 to N. That is, the second feature quantity vector series U includes the second feature quantity in each frame from the first to the Nth frame.
- the second feature amount represents, for example, the characteristics of speaker characteristics or environment (noise, reverberation).
- the second feature amount in each frame is expressed by an L-dimensional vector.
- Each feature vector u n, rather than different values in each frame, to the period of several seconds may be fixed to the same value, during one utterance may be fixed to the same value.
- the condition feature amount calculation unit 103 reads the acoustic condition calculation model parameter ⁇ that characterizes the acoustic condition calculation model and the second feature amount extracted by the second feature amount extraction unit 102, An acoustic condition feature amount indicating the feature of the acoustic condition is calculated. Also, the condition feature quantity calculation unit 103 outputs the calculated acoustic condition feature quantity to the acoustic model parameter correction unit 104 as a feature quantity vector Y as shown in the following equation (19).
- N represents a total number of frames one utterance is a target for calculating the acoustics feature vectors y n of each frame
- n is a natural number of from 1 N. That is, the acoustic condition feature vector Y includes acoustics feature vectors y n of each frame from 1 to N-th frame, acoustics feature vector y n of each frame is represented by a vector of K dimensions.
- acoustics feature vector y n of each frame rather than different values in each frame, during the few seconds may be fixed to the same value, during one utterance are fixed to the same value Also good.
- the acoustic model parameter correction unit 104 sets the acoustic model parameter ⁇ that characterizes the acoustic model read from the storage unit 150 based on the acoustic condition feature quantity vector Y generated by the condition feature quantity calculation unit 103 to the above equation (8-1). Then, the correction is made by the equation (8-2). Note that the initial value of the acoustic model parameter ⁇ corrected by the acoustic model parameter correction unit 104 is a parameter determined by a random number, an acoustic model parameter learned by the first to third conventional techniques, or the like. The acoustic model parameter correction unit 104 outputs the corrected parameter ⁇ generated by the correction to the output probability calculation unit 105 in the HMM state.
- the HMM state output probability calculation unit 105 calculates the output probability of each HMM state based on the acoustic model parameter ⁇ ⁇ corrected by the acoustic model parameter correction unit 104 and the feature amount vector O generated by the feature amount extraction unit 101. calculate.
- the specific processing of the HMM state output probability calculation unit 105 is the same as the HMM state output probability calculation unit 205a of the first prior art and the second prior art HMM state output probability calculation unit 205b.
- the error calculation unit 121 is based on the output probability of each HMM state calculated by the output probability calculation unit 105 of the HMM state and the input correct symbol sequence ⁇ S r (correct HMM state) according to the above equation (13).
- the error vector ⁇ I, n is calculated. Further, the error calculation unit 121 calculates an error vector ⁇ ′ I ′, n representing an error that has been propagated back to the neural network of the acoustic condition feature quantity calculation model, according to the above equation (16).
- the acoustic model parameter differential value calculation unit 122 calculates an acoustic model parameter differential value based on the error vector ⁇ I, n calculated by the error calculation unit 121 and the acoustic model parameter ⁇ ⁇ corrected by the acoustic model parameter correction unit 104. To do.
- the acoustic model parameter differential value calculation unit 122 calculates the acoustic model parameter differential value using the above equations (11-1), (11-2), and (12) indicating the back-propagated error. Or the acoustic model parameter differential value calculation part 122 can also be calculated by the conventional Stochastic Gradient Descent (SGD) method (refer the said literature 6). Also, momentum and L2 Regularization often used for speeding up parameter learning can be used together.
- SGD Stochastic Gradient Descent
- the acoustic model parameter update unit 123 based on the acoustic model parameter ⁇ read from the storage unit 150 and the acoustic model parameter differential value calculated by the acoustic model parameter differential value calculation unit 122, the following equation (20-1) and ( The acoustic model parameter ⁇ is updated by the equation 20-2). As described above, the acoustic model parameter update unit 123 updates the acoustic model parameter based on the values calculated based on the corrected acoustic model parameter and the feature amount.
- the weight matrix ⁇ W i, k and the bias vector ⁇ b i, k are the updated acoustic model parameters ⁇ ⁇ and the weight matrix -W i , K and bias vector ⁇ b i, k are acoustic model parameters ⁇ obtained in the previous step.
- ⁇ is a learning rate which is a variable of SGD, and is a minute value such as 0.1 to 0.0001.
- ⁇ is a parameter for acoustic model parameter correction.
- the parameter differential value calculation unit 124 of the condition feature amount calculation unit is based on the error vector ⁇ I, n calculated by the error calculation unit 121 and the calculation model parameter ⁇ of the acoustic condition, that is, the condition feature amount parameter of the acoustic condition. Calculate the parameter differential value of the calculator.
- the parameter differential value calculation unit 124 of the conditional feature value calculation unit calculates the parameter differential value of the conditional feature value calculation unit according to the above equations (14-1), (14-2), and (15) indicating the back-propagated error. calculate.
- the parameter differential value calculation unit 124 of the conditional feature quantity calculation unit can also use the same method as the acoustic model parameter differential value calculation unit 122.
- the parameter update unit 125 of the conditional feature quantity calculation unit includes the acoustic model calculation model parameter ⁇ read from the storage unit 150 and the conditional feature quantity calculation unit parameter calculated by the parameter differential value calculation unit 124 of the conditional feature quantity calculation unit. Based on the differential value, the acoustic condition calculation model parameter ⁇ is updated by the following formulas (21-1) and (21-2). As described above, the parameter update unit 125 of the condition feature quantity calculation unit updates the calculation model parameter of the acoustic condition based on each value calculated based on the corrected parameter and the feature quantity.
- the weight matrix ⁇ W ' i, k and the bias vector ⁇ b' i, k are the calculation model parameters ⁇ ⁇ of the updated acoustic conditions
- the weight matrix ⁇ W ′ i, k and the bias vector ⁇ b ′ i, k are calculation model parameters ⁇ of acoustic conditions before update.
- ⁇ ′ is a learning rate which is a variable of SGD, and is a minute value such as 0.1 to 0.0001.
- ⁇ ′ is a parameter for correcting acoustic model calculation model parameters.
- Convergence determination unit 126 learns (estimates) acoustic model parameter ⁇ and acoustic condition calculation model parameter ⁇ for acoustic model parameter ⁇ ⁇ and acoustic condition calculation model parameter ⁇ ⁇ updated by acoustic model parameter update unit 123. It is determined whether or not a predetermined convergence condition is satisfied. When it is determined that the predetermined convergence condition is satisfied, the convergence determination unit 126 outputs the acoustic model parameters ⁇ ⁇ when the convergence condition is satisfied as output values of the acoustic model learning device 10.
- the acoustic model parameters ⁇ output from the acoustic model learning device 10 are stored in the storage unit 150, for example.
- the convergence determination unit 126 determines that the predetermined convergence condition is not satisfied, the convergence determination unit 126 outputs the acoustic model parameters ⁇ ⁇ at the time of convergence condition satisfaction determination to the acoustic model parameter correction unit 104 and further at the time of convergence condition satisfaction determination.
- the acoustic condition calculation model parameter ⁇ ⁇ is output to the conditional feature quantity calculation unit 103, the conditional feature quantity calculation unit 103, the acoustic model parameter correction unit 104, the HMM state output probability calculation unit 105, the error calculation unit 121, and the acoustic model
- the parameter differential value calculation unit 122, the acoustic model parameter update unit 123, and the convergence determination unit 126 repeat the processing.
- the acoustic model parameter ⁇ ⁇ and the acoustic model calculation model parameter ⁇ ⁇ when it is determined that the predetermined convergence condition is satisfied are further stored in the storage unit 150 and used as initial values of each parameter in the next processing. You may do it.
- the convergence determination unit 126 for example, (1) the acoustic model parameter - ⁇ obtained in the previous step or the acoustic model calculation model parameter - ⁇ and the updated acoustic model parameter ⁇ ⁇ or acoustic condition When the difference between the calculation model parameter and ⁇ is less than the threshold, (2) When the number of iterations of convergence condition satisfaction determination is greater than a predetermined number, (3) Performance using a part of the speech data for learning Is evaluated based on any predetermined condition such as when a predetermined performance index deteriorates by a predetermined value or more.
- FIG. 11 is a flowchart illustrating an example of the acoustic model learning process according to the embodiment.
- the acoustic model learning device 10 reads an acoustic model (acoustic model parameter ⁇ ) from the storage unit 150 (step S101).
- the acoustic model learning device 10 reads the acoustic condition calculation model (acoustic condition calculation model parameter ⁇ ) from the storage unit 150 (step S102).
- the acoustic model learning device 10 reads the learning voice data (step S103).
- the acoustic model learning unit 10 reads the correct symbol sequence -S r (step S104).
- the acoustic model learning device 10 extracts the feature vector O from the learning speech data (step S105). Next, the acoustic model learning device 10 extracts the second feature vector series U from the learning speech data (step S106). Next, the acoustic model learning device 10 calculates the acoustic condition feature vector Y from the above equation (9) from the acoustic condition calculation model parameter ⁇ and the second feature vector sequence (step S107). Next, the acoustic model learning device 10 corrects the acoustic model parameter ⁇ read from the storage unit 150 based on the acoustic condition feature quantity vector Y by the above equations (8-1) and (8-2) (step) S108). Next, the acoustic model learning device 10 calculates the output probability of each HMM state based on the corrected acoustic model parameter ⁇ and the feature vector O (step S109).
- the acoustic model learning device 10 calculates the error vector ⁇ I, n by the above equation (13) based on the output probability of each HMM state and the input correct symbol sequence -S r, and The error vector ⁇ ′ I ′, n is calculated from the equation (16) (step S110).
- the acoustic model learning device 10 calculates an acoustic model parameter differential value based on the error vector ⁇ I, n and the corrected acoustic model parameter ⁇ (step S111).
- the acoustic model learning device 10 calculates the acoustic model parameter ⁇ by the above equations (20-1) and (20-2) based on the acoustic model parameter ⁇ read from the storage unit 150 and the acoustic model parameter differential value. Is updated (step S112).
- the acoustic model learning device 10 calculates the acoustic model calculation model parameter differential value based on the error vector ⁇ ′ I ′, n and the acoustic condition calculation model parameter ⁇ (step S113).
- the acoustic model learning device 10 calculates the acoustic condition calculation model parameter ⁇ and the acoustic condition calculation model parameter differential value read from the storage unit 150, and the above-described equations (21-1) and (21-2)
- the acoustic condition calculation model parameter ⁇ is updated by the equation (step S114).
- the acoustic model learning device 10 learns the acoustic model parameter ⁇ and the acoustic condition calculation model parameter ⁇ with respect to the updated acoustic model parameter ⁇ ⁇ and the acoustic condition calculation model parameter ⁇ so that the predetermined convergence condition is satisfied. Is determined (step S115). When the learning of the acoustic model parameter ⁇ and the acoustic model calculation model parameter ⁇ satisfies the predetermined convergence condition (Yes in step S115), the acoustic model learning device 10 moves the process to step S116.
- step S115 when the learning of the acoustic model parameter ⁇ or the acoustic model calculation model parameter ⁇ does not satisfy the predetermined convergence condition (No in step S115), the acoustic model learning device 10 moves the process to step S107 or step S108.
- step S116 the acoustic model learning device 10 outputs the acoustic model parameters ⁇ ⁇ when it is determined that the predetermined convergence condition is satisfied as the output value of the acoustic model learning device 10 (step S116).
- the acoustic model parameter ⁇ ⁇ and the acoustic model calculation model parameter ⁇ ⁇ when it is determined that the predetermined convergence condition is satisfied are further stored in the storage unit 150 and used as initial values of each parameter in the next processing. You may do it.
- FIG. 12 is a diagram illustrating an example of the configuration of the speech recognition apparatus according to the embodiment.
- the speech recognition apparatus 20 includes a feature quantity extraction unit 201, a second feature quantity extraction unit 202, a condition feature quantity calculation unit 203, an acoustic model parameter correction unit 204, and an output probability of an HMM state.
- a calculation unit 205 and a word string search unit 206 are included.
- the voice recognition device 20 is connected to the storage unit 250.
- the storage unit 250 includes an acoustic model (acoustic model parameter ⁇ ) updated by the acoustic model learning device 10, an acoustic condition calculation model (acoustic condition calculation model parameter ⁇ ), a language model, an acoustic model parameter correction parameter ⁇ , and an acoustic model.
- the condition calculation model parameter correction parameter ⁇ ′ is stored in advance.
- the feature amount extraction unit 201 reads the recognition speech data observed with a microphone or the like, extracts the feature amount from the recognition speech data, and generates a feature amount vector O. That is, the feature amount extraction unit 201 extracts a feature amount from the recognition voice data. Specific processing of the feature quantity extraction unit 201 is the same as that of the feature quantity extraction unit 101 of the acoustic model learning device 10.
- the second feature quantity extraction unit 202 reads the recognition voice data observed with a microphone or the like, extracts the feature quantity from the recognition voice data, and generates a second feature quantity vector series U. That is, the feature amount extraction unit 201 extracts a feature amount from the recognition voice data.
- the specific processing of the feature quantity extraction unit 201 is the same as that of the second feature quantity extraction unit 102 of the acoustic model learning device 10.
- the condition feature quantity calculation unit 203 reads the acoustic condition calculation model parameter ⁇ and the second feature quantity extracted by the second feature quantity extraction unit 202, and calculates the acoustic condition feature quantity by Expression (9). In addition, the condition feature quantity calculation unit 203 outputs the calculated acoustic condition feature quantity to the acoustic model parameter correction unit 204 as a feature quantity vector Y as shown in Equation (19).
- the specific processing of the conditional feature quantity calculation unit 203 is the same as that of the conditional feature quantity calculation unit 103 of the acoustic model learning device 10.
- the acoustic model parameter correction unit 204 is based on the acoustic model parameter ⁇ read from the storage unit 250 and the acoustic condition feature quantity vector Y generated by the condition feature quantity calculation unit 203, and the above equation (8-1) and ( The acoustic model parameter ⁇ is corrected by the equation 8-2).
- the specific processing of the acoustic model parameter correction unit 204 is the same as that of the acoustic model parameter correction unit 104 of the acoustic model learning device 10.
- the HMM state output probability calculation unit 205 calculates the output probability of each HMM state based on the acoustic model parameter ⁇ ⁇ corrected by the acoustic model parameter correction unit 204 and the feature amount vector O generated by the feature amount extraction unit 201. calculate.
- the specific processing of the HMM state output probability calculation unit 205 is the same as that of the HMM state output probability calculation unit 105 of the acoustic model learning device 10.
- the word string search unit 206 outputs a word string using the output probability of the HMM state calculated based on the feature amount and the corrected parameter and the generation probability of the language model. That is, the word string search unit 206 searches the language model read from the storage unit 250 based on the output probability of each HMM state calculated by the output probability calculation unit 205 of the HMM state, and the word string as a speech recognition result ⁇ S is output.
- the specific processing of the word string search unit 206 is the same as the word string search unit 206a of the first conventional speech recognition apparatus 20a and the word string search unit 206b of the second conventional voice recognition apparatus 20b.
- FIG. 13 is a flowchart illustrating an example of a voice recognition process according to the embodiment.
- the speech recognition apparatus 20 reads an acoustic model (acoustic model parameter ⁇ ) from the storage unit 250 (step S201).
- the speech recognition apparatus 20 reads a calculation model of acoustic conditions from the storage unit 250 (step S202).
- the speech recognition apparatus 20 reads a language model from the storage unit 250 (step S203).
- the voice recognition device 20 reads the voice data for recognition (step S204).
- the speech recognition apparatus 20 reads the acoustic model parameter correction parameter ⁇ and the acoustic model calculation model parameter correction parameter ⁇ ′ from the storage unit 250 (step S205).
- the speech recognition apparatus 20 extracts the feature vector O from the learning speech data (step S206).
- the speech recognition apparatus 20 extracts the second feature vector series U from the learning speech data (step S207).
- the speech recognition apparatus 20 calculates the acoustic condition feature vector Y from the above equation (9) from the acoustic condition calculation model parameter ⁇ and the second feature vector sequence (step S208).
- the speech recognition apparatus 20 corrects the acoustic model parameter ⁇ read from the storage unit 250 based on the acoustic condition feature quantity vector Y by the above formulas (8-1) and (8-2) (step S209). ).
- the speech recognition apparatus 20 calculates the output probability of each HMM state based on the corrected acoustic model parameter ⁇ and the feature vector O (step S210).
- the speech recognition apparatus 20 searches for the language model read from the storage unit 250 based on the output probability of each HMM state (step S211).
- the speech recognition apparatus 20 outputs the word string ⁇ S as a speech recognition result from the search result in step S211 (step S212).
- acoustic model based on DNN CADNN
- CADNN CADNN
- the acoustic model in the present invention is not limited to the one based on the HMM, and can be any acoustic model that calculates the output probability using a neural network.
- the acoustic model in the present invention may be a model based on CTC (Connectionist Temporal Classification) or encoder-decoder.
- the feature amount extraction unit 101 and the second feature amount extraction unit 102 extract feature amounts indicating features of audio data.
- the condition feature amount calculation unit 103 calculates an acoustic condition feature amount indicating the feature of the acoustic condition of the voice data based on the acoustic condition calculation model parameter that characterizes the acoustic condition calculation model represented by the neural network and the feature amount. Calculate using acoustic condition calculation model.
- the acoustic model parameter correction unit 104 is a parameter obtained by correcting the acoustic model parameters that characterize the acoustic model represented by the neural network to which the output layer of the acoustic condition calculation model is coupled based on the acoustic condition feature amount. Generate post parameters.
- the acoustic model parameter update unit 123 updates the acoustic model parameter based on the corrected parameter and the feature amount.
- the parameter update unit 125 of the condition feature quantity calculation unit updates the acoustic condition calculation model parameter based on the corrected parameter and the feature quantity.
- the embodiment has the feature that the acoustic condition feature quantity can be calculated without using the correct answer label (speaker ID or transcription) and can be calculated from a small amount of voice data (several seconds). As a result, high-speed acoustic model adaptation becomes possible.
- the embodiment can adapt the acoustic model to acoustic conditions using a small amount of speech data, and can achieve higher speech recognition performance than the conventional technology without switching the acoustic model for each acoustic condition as in the conventional technology. .
- the embodiment can optimize the neural network representing the acoustic condition calculation model based on the error propagated back in the neural network representing the acoustic model, the parameters of the acoustic condition feature amount calculation model And simultaneous optimization of acoustic model parameters. Therefore, it becomes possible to simultaneously optimize all the neural networks including the calculation model of the acoustic condition feature amount based on the optimization criterion for speech recognition, and the speech recognition accuracy is improved.
- the feature quantity extraction unit 101 and the second feature quantity extraction unit 102 may extract the first feature quantity and the second feature quantity different from the first feature quantity as the feature quantity.
- the condition feature quantity calculation unit 103 calculates the acoustic condition feature quantity based on the acoustic condition calculation model parameter and the second feature quantity.
- the acoustic model parameter updating unit 123 updates the acoustic model parameter based on the corrected parameter and the first feature amount.
- the parameter update unit 125 of the condition feature quantity calculation unit updates the acoustic condition calculation model parameter based on the corrected parameter and the second feature quantity.
- the feature amount extraction unit 201 and the second feature amount extraction unit 202 for speech recognition extract feature amounts indicating features of the speech data.
- the condition feature quantity calculation unit 203 for speech recognition uses the acoustic condition calculation model to calculate the acoustic condition feature quantity indicating the acoustic condition feature of the voice data based on the acoustic condition calculation model parameter and the feature quantity. calculate.
- the acoustic model parameter correction unit 204 for speech recognition generates a corrected parameter that is a parameter obtained by correcting the acoustic model parameter based on the acoustic condition feature amount.
- the word string search unit 206 outputs a word string using the output probability of the HMM state calculated based on the feature amount and the corrected parameter and the generation probability of the language model. Since speech recognition can be performed using acoustic condition calculation model parameters that satisfy the optimization criteria for speech recognition, the accuracy of speech recognition is improved.
- Table 1 shows the results (word error rate) when the acoustic model is adapted to the speaker without teacher for each utterance in the speech recognition task AURORA4. Since the acoustic condition feature amount is calculated for each utterance (about several seconds), high-speed acoustic model adaptation is performed based on a small amount of data. Also, there are three methods used: baseline (speech recognition based on a conventional neural network), CADNN, and CADNN-JT which is a method according to the present invention used in the embodiment. Here, it can be seen from Table 1 that the present invention achieves higher performance than conventional speech recognition (baseline) and conventional CADNN.
- Each component of the acoustic model learning device 10 shown in FIG. 10 and the speech recognition device 20 shown in FIG. 12 is functionally conceptual, and does not necessarily need to be physically configured as illustrated.
- the specific forms of distribution and integration of the functions of the acoustic model learning device 10 and the speech recognition device 20 are not limited to those shown in the figure, and all or a part of them can be arbitrarily determined according to various loads, usage conditions, and the like. Can be configured functionally or physically distributed or integrated.
- the feature quantity extraction unit 101 and the second feature quantity extraction unit 102 may be integrated function units, and output different feature quantities to the HMM state output probability calculation unit 105 and the conditional feature quantity calculation unit 103, respectively. You may do it. The same applies to the feature quantity extraction unit 201 and the second feature quantity extraction unit 202.
- the acoustic model learning device 10 and the speech recognition device 20 may be an integrated device.
- the feature amount extraction unit 101 and the feature amount extraction unit 201 having the same functions in the acoustic model learning device 10 and the speech recognition device 20, and
- the output probability calculation unit 205 in the HMM state may be the same functional unit.
- each processing in the acoustic model learning device 10 and the speech recognition device 20 is not limited to the illustrated one, and the processing order and the processing can be integrated or separated.
- the processing order of steps S101 to S104 and steps S201 to S205 in the embodiment may be changed.
- all or some of the processes performed in the acoustic model learning device 10 and the speech recognition device 20 may be realized by a processing device such as a CPU and a program that is analyzed and executed by the processing device.
- a processing device such as a CPU and a program that is analyzed and executed by the processing device.
- Each process performed in the acoustic model learning device 10 and the speech recognition device 20 may be realized as hardware by wired logic.
- the acoustic model learning device and the speech recognition device are implemented by installing an acoustic model learning program or a speech recognition program for performing the above-described acoustic model learning or speech recognition as package software or online software in a desired computer. it can.
- the information processing apparatus can function as an acoustic model learning apparatus or a speech recognition apparatus.
- the information processing apparatus referred to here includes a desktop or notebook personal computer.
- the information processing apparatus includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistant).
- the acoustic model learning device and the speech recognition device can be implemented as a server device that uses the terminal device used by the user as a client and provides the client with services related to acoustic model learning or speech recognition.
- the acoustic model learning apparatus is implemented as a server apparatus that provides an acoustic model learning service that receives learning speech data as an input and outputs an acoustic condition calculation model as an output.
- the voice recognition device is implemented as a server device that provides a voice recognition service that receives recognition voice data as an input and outputs a word string as a recognition result.
- the acoustic model learning device and the speech recognition device may be implemented as a Web server, or may be implemented as a cloud that provides the above-described acoustic model learning or speech recognition service by outsourcing.
- FIG. 14 is a diagram illustrating an example of a computer in which an acoustic model learning device or a speech recognition device is realized by executing a program.
- the computer 1000 includes a memory 1010 and a CPU 1020, for example.
- the computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
- the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to the hard disk drive 1090.
- the disk drive interface 1040 is connected to the disk drive 1100.
- a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
- the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
- the video adapter 1060 is connected to the display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the acoustic model learning device or the speech recognition device is implemented as a program module 1093 in which a code executable by a computer is described.
- the program module 1093 is stored in the hard disk drive 1090, for example.
- a program module 1093 for executing processing similar to the functional configuration in the acoustic model learning device or the speech recognition device is stored in the hard disk drive 1090.
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.
- the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
- LAN Local Area Network
- WAN Wide Area Network
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
Description
第1の従来技術は、例えば文献1「G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups,” IEEE SIGNAL PROCESSING MAGAZINE, Vol. 29,No. 6, pp. 82-97, 2012.」に示される音声認識技術である。図1は、第1の従来技術に係る音声認識装置の構成の一例を示す図である。図1に示すように、第1の従来技術に係る音声認識装置20aは、特徴量抽出部201a、HMM状態の出力確率計算部205a、単語列検索部206aを有する。また、音声認識装置20aは、記憶部250aと接続される。
ここで、一般的に、音響モデルの学習時と認識時とでは、音響環境や話者特性等の音響条件が異なる。そのため、第1の従来技術の音声認識は、音響モデルと認識時の特徴量とが合致せず、十分な認識性能が得られない。そこで、音響モデルを認識時の特徴量と合致させるため、音響モデルのパラメータを補正(再推定)(以下、音響モデル補正と称す)して音声認識を行う第2の従来技術がある。第2の従来技術は、例えば文献2「H. Liao, “SPEAKER ADAPTATION OF CONTEXT DEPENDENT DEEP NEURAL NETWORKS,” in Proc. of ICASSP’13, 2013, pp. 7947-7951.」に示される音声認識技術である。以下、音響モデル補正を行う第2の従来技術について、第1の従来技術との差異部分を説明する。
以下、第2の従来技術に係る音声認識装置20bに、第3の従来技術に係る音響モデル補正(再推定)機能を有する音響モデル再学習装置10cを適用した場合を説明する。図6は、第3の従来技術に係る音響モデル再学習装置の構成の一例を示す図である。音響モデル再学習装置10cは、特徴量抽出部101c、音響モデル補正部104cを有する。また、音響モデル再学習装置10cは、記憶部150cと接続される。
以下、本願が開示する音響モデル学習方法、音声認識方法、音響モデル学習装置、音声認識装置、音響モデル学習プログラムおよび音声認識プログラムの実施形態を説明する。以下の実施形態は、一例を示すに過ぎず、本願が開示する技術を限定するものではない。また、以下に示す実施形態およびその他の実施形態は、矛盾しない範囲で適宜組み合わせてもよい。
文献4「M. Delcroix, K. Kinoshita, T. Hori, T. Nakatani, “Context adaptive deep neural networks for fast acoustic model adaptation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015 , pp.4535-4539, 2015.」には、パラメータを音響条件特徴量と関連付けて学習する音響モデルであるCADNN(Context Adaptive Deep Neural Network)が記載されている。
図8は、従来の音響条件適応型音響モデルの概要の一例を示す図である。従来の音響条件適応型音響モデルであるCADNNは、図8に示すように、NNの隠れ層は、音響条件特徴量を示すベクトル(以下、「音響条件特徴量ベクトル」と称す)の要素毎に分解された形になっている。図8では、例示として、1つの隠れ層(i番目の隠れ層)が分解されている状態を示すが、少なくとも1つの隠れ層または全ての隠れ層を分解するとしてもよい。分解された隠れ層の線形変換後の出力は、下記(5)式のように計算する。
図9は、実施形態に係る音響条件適応型音響モデルの概要の一例を示す図である。図9に示すように、実施形態に係る音響条件適応型音響モデルであるCADNN-JTでは、音響条件を表す音響条件特徴量の計算モデルにはニューラルネットワークを用い、そのニューラルネットワークのパラメータは従来のCADNNのパラメータと同時最適化される。すなわち、音響モデル、および音響条件の計算モデルは、1つ以上の隠れ層を有するニューラルネットワークで表され、音響条件の計算モデルの出力層は、音響モデルに結合されている。
(実施形態に係る音響モデル学習装置の構成)
上記実施形態の数理的背景を踏まえ、以下、実施形態の一例を説明する。図10は、実施形態に係る音響モデル学習装置の構成の一例を示す図である。図10に示すように、実施形態に係る音響モデル学習装置10は、特徴量抽出部101、第2特徴量抽出部102、条件特徴量計算部103、音響モデルパラメータ補正部104、HMM状態の出力確率計算部105、エラー計算部121、音響モデルパラメータ微分値計算部122、音響モデルパラメータ更新部123、条件特徴量計算部のパラメータ微分値計算部124、条件特徴量計算部のパラメータ更新部125、収束判定部126を有する。また、音響モデル学習装置10は、記憶部150と接続される。
図11は、実施形態に係る音響モデル学習処理の一例を示すフローチャートである。まず、音響モデル学習装置10は、記憶部150から音響モデル(音響モデルパラメータΛ)を読み込む(ステップS101)。次に、音響モデル学習装置10は、記憶部150から音響条件の計算モデル(音響条件の計算モデルパラメータΩ)を読み込む(ステップS102)。次に、音響モデル学習装置10は、学習用音声データを読み込む(ステップS103)。次に、音響モデル学習装置10は、正解シンボル系列-Srを読み込む(ステップS104)。
図12は、実施形態に係る音声認識装置の構成の一例を示す図である。図12に示すように、実施形態に係る音声認識装置20は、特徴量抽出部201、第2特徴量抽出部202、条件特徴量計算部203、音響モデルパラメータ補正部204、HMM状態の出力確率計算部205、単語列検索部206を有する。また、音声認識装置20は、記憶部250と接続される。
図13は、実施形態に係る音声認識処理の一例を示すフローチャートである。まず、音声認識装置20は、記憶部250から音響モデル(音響モデルパラメータΛ)を読み込む(ステップS201)。次に、音声認識装置20は、記憶部250から音響条件の計算モデルを読み込む(ステップS202)。次に、音声認識装置20は、記憶部250から言語モデルを読み込む(ステップS203)。次に、音声認識装置20は、認識用音声データを読み込む(ステップS204)。次に、音声認識装置20は、記憶部250から音響モデルパラメータ補正用パラメータηおよび音響条件の計算モデルパラメータ補正用パラメータη´を読み込む(ステップS205)。
実施形態では、DNN(CADNN)に基づく音響モデルを音響条件の計算モデルと結合させる場合について説明したが、DNNに限らず、CNN(Convolutional Neural Network)、RNN(Recurrent Neural Network)、BLSTM(Bidirectional Long Short-Term Memory)のニューラルネットワーク等、種々のニューラルネットワークに基づく音響モデルを音響条件の計算モデルと結合させ、同様の定式化を行うことが可能である。また、本発明における音響モデルは、HMMに基づくものに限られずニューラルネットワークを用いて出力確率を計算する任意の音響モデルとすることができる。例えば、本発明における音響モデルは、CTC(Connectionist Temporal Classification)やencoder-decoderに基づくモデルであってもよい。
実施形態の音響モデル学習装置10において、特徴量抽出部101および第2特徴量抽出部102は、音声データの特徴を示す特徴量を抽出する。また、条件特徴量計算部103は、ニューラルネットワークで表される音響条件の計算モデルを特徴付ける音響条件計算モデルパラメータ、および特徴量を基に、音声データの音響条件の特徴を示す音響条件特徴量を、音響条件の計算モデルを用いて計算する。また、音響モデルパラメータ補正部104は、音響条件の計算モデルの出力層が結合されたニューラルネットワークで表される音響モデルを特徴付ける音響モデルパラメータを、音響条件特徴量を基に補正したパラメータである補正後パラメータを生成する。また、音響モデルパラメータ更新部123は、補正後パラメータおよび特徴量を基に音響モデルパラメータを更新する。また、条件特徴量計算部のパラメータ更新部125は、補正後パラメータおよび特徴量を基に音響条件計算モデルパラメータを更新する。
図10に示す音響モデル学習装置10および図12に示す音声認識装置20の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要さない。すなわち、音響モデル学習装置10および音声認識装置20の機能の分散および統合の具体的形態は図示のものに限られず、全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。例えば、特徴量抽出部101および第2特徴量抽出部102は、一体の機能部であってもよく、HMM状態の出力確率計算部105と条件特徴量計算部103にそれぞれ異なる特徴量を出力するようにしてもよい。なお、特徴量抽出部201および第2特徴量抽出部202についても同様である。
一実施形態として、音響モデル学習装置および音声認識装置は、パッケージソフトウェアやオンラインソフトウェアとして上記の音響モデル学習または音声認識を実行する音響モデル学習プログラムまたは音声認識プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の音響モデル学習プログラムまたは音声認識プログラムを情報処理装置に実行させることにより、情報処理装置を音響モデル学習装置または音声認識装置として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。
20 音声認識装置
101、201 特徴量抽出部
102、202 第2特徴量抽出部
103、203 条件特徴量計算部
104、204 音響モデルパラメータ補正部
105、205 HMM状態の出力確率計算部
121 エラー計算部
122 音響モデルパラメータ微分値計算部
123 音響モデルパラメータ更新部
124 条件特徴量計算部のパラメータ微分値計算部
125 条件特徴量計算部のパラメータ更新部
126 収束判定部
206 単語列検索部
150、250 記憶部
Claims (7)
- 音声データの特徴を示す音声特徴量を抽出する特徴量抽出工程と、
ニューラルネットワークで表される音響条件の計算モデルを特徴付ける音響条件計算モデルパラメータ、および前記音声特徴量を基に、前記音声データの音響条件の特徴を示す音響条件特徴量を、前記音響条件の計算モデルを用いて計算する音響条件特徴量計算工程と、
前記音響条件の計算モデルの出力層が結合されたニューラルネットワークで表される音響モデルを特徴付ける音響モデルパラメータを、前記音響条件特徴量を基に補正したパラメータである補正後パラメータを生成する音響モデルパラメータ補正工程と、
前記補正後パラメータおよび前記音声特徴量を基に前記音響モデルパラメータを更新する音響モデルパラメータ更新工程と、
前記補正後パラメータおよび前記音声特徴量を基に前記音響条件計算モデルパラメータを更新する音響条件計算モデルパラメータ更新工程と、
を含んだことを特徴とする音響モデル学習方法。 - 前記特徴量抽出工程は、前記音声特徴量として第1の音声特徴量、および前記第1の音声特徴量と異なる第2の特徴量を抽出し、
前記音響条件特徴量計算工程は、前記音響条件計算モデルパラメータおよび前記第2の特徴量を基に前記音響条件特徴量を計算し、
前記音響モデルパラメータ更新工程は、前記補正後パラメータおよび前記第1の音声特徴量を基に前記音響モデルパラメータを更新し、
前記音響条件計算モデルパラメータ更新工程は、前記補正後パラメータおよび前記第2の特徴量を基に前記音響条件計算モデルパラメータを更新することを特徴とする請求項1に記載の音響モデル学習方法。 - 請求項1に記載の音響モデル学習方法により更新された前記音響モデルパラメータおよび前記音響条件計算モデルパラメータと、単語列の生成確率をモデル化する言語モデルと、を用いて音声認識を行う音声認識方法であって、
音声データの特徴を示す音声特徴量を抽出する音声認識用特徴量抽出工程と、
前記音響条件計算モデルパラメータ、および前記音声特徴量を基に、前記音声データの音響条件の特徴を示す音響条件特徴量を、前記音響条件の計算モデルを用いて計算する音声認識用音響条件特徴量計算工程と、
前記音響モデルパラメータを、前記音響条件特徴量を基に補正したパラメータである補正後パラメータを生成する音声認識用音響モデルパラメータ補正工程と、
前記音声特徴量および前記補正後パラメータを基に計算された前記音響モデルの出力確率と、前記言語モデルの生成確率と、を用いて、単語列を出力する単語列検索工程と、
を含んだことを特徴とする音声認識方法。 - 音声データの特徴を示す音声特徴量を抽出する特徴量抽出部と、
ニューラルネットワークで表される音響条件の計算モデルを特徴付ける音響条件計算モデルパラメータ、および前記音声特徴量を基に、前記音声データの音響条件の特徴を示す音響条件特徴量を、前記音響条件の計算モデルを用いて計算する音響条件特徴量計算部と、
前記音響条件の計算モデルの出力層が結合された音響モデルを特徴付ける音響モデルパラメータを、前記音響条件特徴量を基に補正したパラメータである補正後パラメータを生成する音響モデルパラメータ補正部と、
前記補正後パラメータおよび前記音声特徴量を基に前記音響モデルパラメータを更新する音響モデルパラメータ更新部と、
前記補正後パラメータおよび前記音声特徴量を基に前記音響条件計算モデルパラメータを更新する音響条件計算モデルパラメータ更新部と、
を有することを特徴とする音響モデル学習装置。 - 請求項4に記載の音響モデル学習装置により更新された前記音響モデルパラメータおよび前記音響条件計算モデルパラメータと、単語列の生成確率をモデル化する言語モデルと、を用いて音声認識を行う音声認識装置であって、
音声データの特徴を示す音声特徴量を抽出する音声認識用特徴量抽出部と、
前記音響条件計算モデルパラメータ、および前記音声特徴量を基に、前記音声データの音響条件の特徴を示す音響条件特徴量を、前記音響条件の計算モデルを用いて計算する音声認識用音響条件特徴量計算部と、
前記音響モデルパラメータを、前記音響条件特徴量を基に補正したパラメータである補正後パラメータを生成する音声認識用音響モデルパラメータ補正部と、
前記音声特徴量および前記補正後パラメータを基に計算された前記音響モデルの出力確率と、前記言語モデルの生成確率と、を用いて、単語列を出力する単語列検索部と、
を有することを特徴とする音声認識装置。 - 請求項4に記載の音響モデル学習装置としてコンピュータを機能させる音響モデル学習プログラム。
- 請求項5に記載の音声認識装置としてコンピュータを機能させる音声認識プログラム。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2017565514A JP6637078B2 (ja) | 2016-02-02 | 2017-01-26 | 音響モデル学習装置、音響モデル学習方法及びプログラム |
| US16/074,367 US11264044B2 (en) | 2016-02-02 | 2017-01-26 | Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program |
| CN201780009153.4A CN108701452B (zh) | 2016-02-02 | 2017-01-26 | 音频模型学习方法、语音识别方法、音频模型学习装置、语音识别装置及记录介质 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2016018016 | 2016-02-02 | ||
| JP2016-018016 | 2016-02-02 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017135148A1 true WO2017135148A1 (ja) | 2017-08-10 |
Family
ID=59499773
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2017/002740 Ceased WO2017135148A1 (ja) | 2016-02-02 | 2017-01-26 | 音響モデル学習方法、音声認識方法、音響モデル学習装置、音声認識装置、音響モデル学習プログラムおよび音声認識プログラム |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US11264044B2 (ja) |
| JP (1) | JP6637078B2 (ja) |
| CN (1) | CN108701452B (ja) |
| WO (1) | WO2017135148A1 (ja) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2018031812A (ja) * | 2016-08-22 | 2018-03-01 | 日本電信電話株式会社 | 音声データ処理装置、音声データ処理方法および音声データ処理プログラム |
| WO2019171925A1 (ja) * | 2018-03-08 | 2019-09-12 | 日本電信電話株式会社 | 言語モデルを利用する装置、方法及びプログラム |
| JP2020012928A (ja) * | 2018-07-17 | 2020-01-23 | 国立研究開発法人情報通信研究機構 | 耐雑音音声認識装置及び方法、並びにコンピュータプログラム |
| CN111540364A (zh) * | 2020-04-21 | 2020-08-14 | 同盾控股有限公司 | 音频识别方法、装置、电子设备及计算机可读介质 |
| CN111862952A (zh) * | 2019-04-26 | 2020-10-30 | 华为技术有限公司 | 一种去混响模型训练方法及装置 |
| JP2022028772A (ja) * | 2019-08-23 | 2022-02-16 | サウンドハウンド,インコーポレイテッド | オーディオデータおよび画像データに基づいて人の発声を解析する車載装置および発声処理方法、ならびにプログラム |
| US11551694B2 (en) | 2021-01-05 | 2023-01-10 | Comcast Cable Communications, Llc | Methods, systems and apparatuses for improved speech recognition and transcription |
| JP2024547129A (ja) * | 2021-12-22 | 2024-12-26 | ビゴ テクノロジー ピーティーイー. リミテッド | モデル訓練及び音色変換方法、装置、デバイス及び媒体 |
Families Citing this family (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3692634B1 (en) * | 2017-10-04 | 2025-09-10 | Google LLC | Methods and systems for automatically equalizing audio output based on room characteristics |
| JP6891144B2 (ja) * | 2018-06-18 | 2021-06-18 | ヤフー株式会社 | 生成装置、生成方法及び生成プログラム |
| US10380997B1 (en) * | 2018-07-27 | 2019-08-13 | Deepgram, Inc. | Deep learning internal state index-based search and classification |
| CN109979436B (zh) * | 2019-04-12 | 2020-11-13 | 南京工程学院 | 一种基于频谱自适应法的bp神经网络语音识别系统及方法 |
| CN110503944B (zh) * | 2019-08-29 | 2021-09-24 | 思必驰科技股份有限公司 | 语音唤醒模型的训练和使用方法及装置 |
| CN114627863B (zh) * | 2019-09-24 | 2024-03-22 | 腾讯科技(深圳)有限公司 | 一种基于人工智能的语音识别方法和装置 |
| CN110827801B (zh) * | 2020-01-09 | 2020-04-17 | 成都无糖信息技术有限公司 | 一种基于人工智能的自动语音识别方法及系统 |
| US12321853B2 (en) * | 2020-01-17 | 2025-06-03 | Syntiant | Systems and methods for neural network training via local target signal augmentation |
| CN111415682A (zh) * | 2020-04-03 | 2020-07-14 | 北京乐界乐科技有限公司 | 一种用于乐器的智能评测方法 |
| CN111477249A (zh) * | 2020-04-03 | 2020-07-31 | 北京乐界乐科技有限公司 | 一种用于乐器的智能评分方法 |
| US11244668B2 (en) * | 2020-05-29 | 2022-02-08 | TCL Research America Inc. | Device and method for generating speech animation |
| CN112466285B (zh) * | 2020-12-23 | 2022-01-28 | 北京百度网讯科技有限公司 | 离线语音识别方法、装置、电子设备及存储介质 |
| CN113035177B (zh) * | 2021-03-11 | 2024-02-09 | 平安科技(深圳)有限公司 | 声学模型训练方法及装置 |
| CN113129870B (zh) * | 2021-03-23 | 2022-03-25 | 北京百度网讯科技有限公司 | 语音识别模型的训练方法、装置、设备和存储介质 |
| CN113327585B (zh) * | 2021-05-31 | 2023-05-12 | 杭州芯声智能科技有限公司 | 一种基于深度神经网络的自动语音识别方法 |
| CN117546235A (zh) * | 2021-06-22 | 2024-02-09 | 发那科株式会社 | 语音识别装置 |
| US11862147B2 (en) * | 2021-08-13 | 2024-01-02 | Neosensory, Inc. | Method and system for enhancing the intelligibility of information for a user |
| CN113823275A (zh) * | 2021-09-07 | 2021-12-21 | 广西电网有限责任公司贺州供电局 | 一种用于电网调度的语音识别方法及系统 |
| CN114842837B (zh) * | 2022-07-04 | 2022-09-02 | 成都启英泰伦科技有限公司 | 一种快速声学模型训练方法 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012053218A (ja) * | 2010-08-31 | 2012-03-15 | Nippon Hoso Kyokai <Nhk> | 音響処理装置および音響処理プログラム |
| JP2015102806A (ja) * | 2013-11-27 | 2015-06-04 | 国立研究開発法人情報通信研究機構 | 統計的音響モデルの適応方法、統計的音響モデルの適応に適した音響モデルの学習方法、ディープ・ニューラル・ネットワークを構築するためのパラメータを記憶した記憶媒体、及び統計的音響モデルの適応を行なうためのコンピュータプログラム |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2733955B2 (ja) * | 1988-05-18 | 1998-03-30 | 日本電気株式会社 | 適応型音声認識装置 |
| JP4230254B2 (ja) * | 2003-03-12 | 2009-02-25 | 日本電信電話株式会社 | 音声生成モデル話者適応化方法、その装置、そのプログラム及びその記録媒体 |
| JP4950600B2 (ja) * | 2006-09-05 | 2012-06-13 | 日本電信電話株式会社 | 音響モデル作成装置、その装置を用いた音声認識装置、これらの方法、これらのプログラム、およびこれらの記録媒体 |
| JP5738216B2 (ja) * | 2012-02-27 | 2015-06-17 | 日本電信電話株式会社 | 特徴量補正パラメータ推定装置、音声認識システム、特徴量補正パラメータ推定方法、音声認識方法及びプログラム |
| JP5982297B2 (ja) * | 2013-02-18 | 2016-08-31 | 日本電信電話株式会社 | 音声認識装置、音響モデル学習装置、その方法及びプログラム |
| US9177550B2 (en) * | 2013-03-06 | 2015-11-03 | Microsoft Technology Licensing, Llc | Conservatively adapting a deep neural network in a recognition system |
| CN104143327B (zh) * | 2013-07-10 | 2015-12-09 | 腾讯科技(深圳)有限公司 | 一种声学模型训练方法和装置 |
| CN104376842A (zh) * | 2013-08-12 | 2015-02-25 | 清华大学 | 神经网络语言模型的训练方法、装置以及语音识别方法 |
| GB2517503B (en) * | 2013-08-23 | 2016-12-28 | Toshiba Res Europe Ltd | A speech processing system and method |
| US9401148B2 (en) * | 2013-11-04 | 2016-07-26 | Google Inc. | Speaker verification using neural networks |
| US9324321B2 (en) * | 2014-03-07 | 2016-04-26 | Microsoft Technology Licensing, Llc | Low-footprint adaptation and personalization for a deep neural network |
| CN104157290B (zh) * | 2014-08-19 | 2017-10-24 | 大连理工大学 | 一种基于深度学习的说话人识别方法 |
| KR102449837B1 (ko) * | 2015-02-23 | 2022-09-30 | 삼성전자주식회사 | 신경망 학습 방법 및 장치, 및 인식 방법 및 장치 |
-
2017
- 2017-01-26 JP JP2017565514A patent/JP6637078B2/ja active Active
- 2017-01-26 CN CN201780009153.4A patent/CN108701452B/zh active Active
- 2017-01-26 US US16/074,367 patent/US11264044B2/en active Active
- 2017-01-26 WO PCT/JP2017/002740 patent/WO2017135148A1/ja not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012053218A (ja) * | 2010-08-31 | 2012-03-15 | Nippon Hoso Kyokai <Nhk> | 音響処理装置および音響処理プログラム |
| JP2015102806A (ja) * | 2013-11-27 | 2015-06-04 | 国立研究開発法人情報通信研究機構 | 統計的音響モデルの適応方法、統計的音響モデルの適応に適した音響モデルの学習方法、ディープ・ニューラル・ネットワークを構築するためのパラメータを記憶した記憶媒体、及び統計的音響モデルの適応を行なうためのコンピュータプログラム |
Non-Patent Citations (2)
| Title |
|---|
| MARC DELCROIX ET AL.: "Context adaptive deep neural networks for fast acoustic model adaptation", 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, SPEECH AND SIGNAL PROCESSING, April 2015 (2015-04-01), XP033064557 * |
| MARK DELCROIX ET AL.: "Context adaptive deep neural networks fo fast acoustic model adaptation in noisy conditions", 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, March 2016 (2016-03-01), XP032901609 * |
Cited By (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2018031812A (ja) * | 2016-08-22 | 2018-03-01 | 日本電信電話株式会社 | 音声データ処理装置、音声データ処理方法および音声データ処理プログラム |
| WO2019171925A1 (ja) * | 2018-03-08 | 2019-09-12 | 日本電信電話株式会社 | 言語モデルを利用する装置、方法及びプログラム |
| JP2019159464A (ja) * | 2018-03-08 | 2019-09-19 | 日本電信電話株式会社 | 言語モデルを利用する装置、方法及びプログラム |
| JP7231181B2 (ja) | 2018-07-17 | 2023-03-01 | 国立研究開発法人情報通信研究機構 | 耐雑音音声認識装置及び方法、並びにコンピュータプログラム |
| WO2020017226A1 (ja) * | 2018-07-17 | 2020-01-23 | 国立研究開発法人情報通信研究機構 | 耐雑音音声認識装置及び方法、並びにコンピュータプログラム |
| JP2020012928A (ja) * | 2018-07-17 | 2020-01-23 | 国立研究開発法人情報通信研究機構 | 耐雑音音声認識装置及び方法、並びにコンピュータプログラム |
| CN111862952A (zh) * | 2019-04-26 | 2020-10-30 | 华为技术有限公司 | 一种去混响模型训练方法及装置 |
| CN111862952B (zh) * | 2019-04-26 | 2024-04-12 | 华为技术有限公司 | 一种去混响模型训练方法及装置 |
| JP2022028772A (ja) * | 2019-08-23 | 2022-02-16 | サウンドハウンド,インコーポレイテッド | オーディオデータおよび画像データに基づいて人の発声を解析する車載装置および発声処理方法、ならびにプログラム |
| JP7525460B2 (ja) | 2019-08-23 | 2024-07-30 | サウンドハウンド,インコーポレイテッド | オーディオデータおよび画像データに基づいて人の発声を解析するコンピューティングデバイスおよび発声処理方法、ならびにプログラム |
| CN111540364A (zh) * | 2020-04-21 | 2020-08-14 | 同盾控股有限公司 | 音频识别方法、装置、电子设备及计算机可读介质 |
| US11551694B2 (en) | 2021-01-05 | 2023-01-10 | Comcast Cable Communications, Llc | Methods, systems and apparatuses for improved speech recognition and transcription |
| US11869507B2 (en) | 2021-01-05 | 2024-01-09 | Comcast Cable Communications, Llc | Methods, systems and apparatuses for improved speech recognition and transcription |
| JP2024547129A (ja) * | 2021-12-22 | 2024-12-26 | ビゴ テクノロジー ピーティーイー. リミテッド | モデル訓練及び音色変換方法、装置、デバイス及び媒体 |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2017135148A1 (ja) | 2018-11-29 |
| US20210193161A1 (en) | 2021-06-24 |
| US11264044B2 (en) | 2022-03-01 |
| CN108701452B (zh) | 2023-09-26 |
| CN108701452A (zh) | 2018-10-23 |
| JP6637078B2 (ja) | 2020-01-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6637078B2 (ja) | 音響モデル学習装置、音響モデル学習方法及びプログラム | |
| JP6506074B2 (ja) | 音響モデル学習装置、音声認識装置、音響モデル学習方法、音声認識方法及びプログラム | |
| CN110914899B (zh) | 掩模计算装置、簇权重学习装置、掩模计算神经网络学习装置、掩模计算方法、簇权重学习方法和掩模计算神经网络学习方法 | |
| US10643602B2 (en) | Adversarial teacher-student learning for unsupervised domain adaptation | |
| JP5423670B2 (ja) | 音響モデル学習装置および音声認識装置 | |
| US10490182B1 (en) | Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks | |
| US8494847B2 (en) | Weighting factor learning system and audio recognition system | |
| CN109801621A (zh) | 一种基于残差门控循环单元的语音识别方法 | |
| CN110853630B (zh) | 面向边缘计算的轻量级语音识别方法 | |
| KR102406512B1 (ko) | 음성인식 방법 및 그 장치 | |
| JP2017058877A (ja) | 学習装置、音声検出装置、学習方法およびプログラム | |
| WO2019151507A1 (ja) | 学習装置、学習方法及び学習プログラム | |
| Kadyan et al. | A comparative study of deep neural network based Punjabi-ASR system | |
| JP2017117045A (ja) | 言語確率算出方法、言語確率算出装置および言語確率算出プログラム | |
| JP6646337B2 (ja) | 音声データ処理装置、音声データ処理方法および音声データ処理プログラム | |
| JP6612796B2 (ja) | 音響モデル学習装置、音声認識装置、音響モデル学習方法、音声認識方法、音響モデル学習プログラム及び音声認識プログラム | |
| Du et al. | Boosted mixture learning of Gaussian mixture hidden Markov models based on maximum likelihood for speech recognition | |
| JP6158105B2 (ja) | 言語モデル作成装置、音声認識装置、その方法及びプログラム | |
| JP6772115B2 (ja) | 音響モデル学習装置、音声認識装置、音響モデル学習方法、音声認識方法、及びプログラム | |
| Wang et al. | Empirical evaluation of speaker adaptation on DNN based acoustic model | |
| JP2013174768A (ja) | 特徴量補正パラメータ推定装置、音声認識システム、特徴量補正パラメータ推定方法、音声認識方法及びプログラム | |
| JP4571921B2 (ja) | 音響モデル適応化装置、音響モデル適応化方法、音響モデル適応化プログラム及びその記録媒体 | |
| WO2024127472A1 (ja) | 感情認識学習方法、感情認識方法、感情認識学習装置、感情認識装置及びプログラム | |
| Dong et al. | E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models | |
| CN117351947A (zh) | 模型训练方法及装置、计算机可读存储介质、终端 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17747306 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 201780009153.4 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2017565514 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17747306 Country of ref document: EP Kind code of ref document: A1 |