WO2018153200A1 - Hlstm model-based acoustic modeling method and device, and storage medium - Google Patents
Hlstm model-based acoustic modeling method and device, and storage medium Download PDFInfo
- Publication number
- WO2018153200A1 WO2018153200A1 PCT/CN2018/073887 CN2018073887W WO2018153200A1 WO 2018153200 A1 WO2018153200 A1 WO 2018153200A1 CN 2018073887 W CN2018073887 W CN 2018073887W WO 2018153200 A1 WO2018153200 A1 WO 2018153200A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- hlstm
- training
- state
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- the present disclosure relates to the field of speech recognition technologies, and in particular, to an acoustic modeling method, apparatus, and storage medium based on a Highway Long Short Time Memory (HLSTM) model.
- HLSTM Highway Long Short Time Memory
- the Long Short Time Memory (LSTM) model was introduced into acoustic modeling, and the LSTM model has stronger acoustic modeling capabilities than a simple feedforward network. Due to the increasing amount of data, it is necessary to deepen the number of layers of the acoustic model neural network to improve the modeling ability. However, as the number of network layers in the LSTM model deepens, the training difficulty of the network increases, and the gradient disappears. In order to avoid the disappearance of the gradient, an HLSTM model based on the LSTM model was proposed, which introduced direct connections between memory cells in adjacent layers of the LSTM model.
- the proposed HLSTM model enables the deeper network structure to be practically applied in the recognition system and greatly improves the recognition accuracy.
- the deep HLSTM model has stronger modeling capabilities, the deepening of the number of layers and the introduction of new connections (the above-mentioned direct connection) also make the acoustic model have a more complex network structure, so the forward calculation takes longer. , eventually leading to slower decoding. Therefore, how to improve the performance without increasing the complexity of the acoustic model becomes a problem to be solved.
- Embodiments of the present disclosure provide an acoustic modeling method, apparatus, and storage medium based on an HLSTM model.
- Embodiments of the present disclosure provide an acoustic modeling method based on an HLSTM model, including:
- the randomly initialized HLSTM model is trained based on a preset function, and the training result is optimized;
- the HLSTM model is the same as the network parameter of the LSTM model.
- the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
- the HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
- the F CE represents a cross entropy objective function
- the speech feature at time t is the annotation value of the output point in the y state
- X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
- the X represents the training data
- the N is the total duration of the speech feature.
- the objective function corresponding to the state-level minimum Bayesian risk criterion is:
- the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
- the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
- the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
- the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function including:
- An embodiment of the present disclosure further provides an acoustic modeling device based on an HLSTM model, including:
- the HLSTM model processing module is configured to train the randomly initialized HLSTM model based on a preset function and optimize the training result;
- a calculation module configured to perform training forward calculation through the optimized HLSTM model
- the LSTM model processing module is configured to train the randomly initialized long and short time memory LSTM model based on the forward calculation result and the preset function, and the obtained model is an acoustic model of the speech recognition system;
- the HLSTM model is the same as the network parameter of the LSTM model.
- the HLSTM model processing module includes:
- a first training unit configured to train the randomly initialized HLSTM model using a cross entropy objective function
- An optimization unit configured to optimize the HLSTM model obtained by the training according to a state-level minimum Bayesian risk criterion.
- the F CE represents a cross entropy objective function
- the speech feature at time t is the annotation value of the output point in the y state
- X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
- the X represents the training data
- the N is the total duration of the speech feature.
- the objective function corresponding to the state-level minimum Bayesian risk criterion is:
- the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
- the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
- the LSTM model processing module includes:
- An obtaining unit configured to obtain an output result of each frame obtained by the forward calculation
- a second training unit configured to train the randomly initialized LSTM model based on the output result of each frame and the cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
- Embodiments of the present disclosure further provide a storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of any of the above methods.
- the HLSTM model-based acoustic modeling method, apparatus and storage medium train a randomly initialized HLSTM model based on a preset function, and optimize the training result; and the training data is obtained through the optimization.
- the HLSTM model performs forward calculation; based on the result of the forward calculation and the preset function, training the randomly initialized LSTM model, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model and the The network parameters of the LSTM model are the same.
- the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
- FIG. 1 is a schematic flow chart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure
- FIG. 2 is a network structure diagram of a bidirectional HLSTM model according to an embodiment of the present disclosure
- FIG. 3 is a schematic structural diagram of an acoustic modeling device based on an HLSTM model according to an embodiment of the present disclosure
- FIG. 4 is a schematic structural diagram of an HLSTM model processing module according to an embodiment of the present disclosure.
- FIG. 5 is a schematic structural diagram of an LSTM model processing module according to an embodiment of the present disclosure.
- FIG. 1 is a schematic flowchart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:
- Step 101 Train the randomly initialized HLSTM model based on a preset function, and optimize the training result.
- Step 102 Perform training calculation by using the HLSTM model obtained by the optimization
- Step 103 Train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and obtain the model as an acoustic model of the speech recognition system;
- the HLSTM model is the same as the network parameter of the LSTM model.
- the HLSTM model and the LSTM model may both be bidirectional or both unidirectional.
- the network parameters may include: an input layer node number, an output layer node number, an input observation vector, a hidden layer node number, a recursive delay, and a mapping layer connected after each hidden layer.
- the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
- the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, as shown in FIG. 2.
- the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications.
- the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
- the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
- the HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
- the F CE represents a cross entropy objective function
- the speech feature at time t is the annotation value of the output point in the y state
- X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
- the X represents the training data
- the N is the total duration of the speech feature.
- the objective function corresponding to the state-level minimum Bayesian risk criterion is:
- the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
- the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
- the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
- the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function including:
- the embodiment of the present disclosure further provides an acoustic modeling device based on the HLSTM model, which is used to implement the above embodiments and specific implementations, and has not been described again.
- the term “module” "unit” may implement a combination of software and/or hardware of a predetermined function.
- the device comprises:
- the HLSTM model processing module 301 is configured to train the randomly initialized HLSTM model based on a preset function, and optimize the training result;
- the calculating module 302 is configured to perform training calculation by using the optimized HLSTM model
- the LSTM model processing module 303 is configured to train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and the obtained model is an acoustic model of the speech recognition system;
- the HLSTM model is the same as the network parameter of the LSTM model.
- the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
- the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, and the connection formula is as shown in FIG. 2.
- the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications.
- the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
- the HLSTM model processing module 301 includes:
- the first training unit 3011 is configured to train the randomly initialized HLSTM model by using a cross entropy objective function
- the optimization unit 3012 is configured to optimize the HLSTM model obtained by the training according to the state-level minimum Bayesian risk criterion.
- the F CE represents a cross entropy objective function
- the speech feature at time t is the annotation value of the output point in the y state
- X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
- the X represents the training data
- the N is the total duration of the speech feature.
- the objective function corresponding to the state-level minimum Bayesian risk criterion is:
- the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
- the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
- the LSTM model processing module 303 includes:
- the obtaining unit 3031 is configured to acquire an output result of each frame obtained by the forward calculation
- a second training unit 3032 configured to train the randomly initialized LSTM model based on the output result of each frame and the cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
- the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
- the HLSTM model processing module 301, the calculation module 302, the LSTM model processing module 303, the first training unit 3011, the optimization unit 3012, the acquisition unit 3031, and the second training unit 3032 may be in an HLSTM model-based acoustic modeling device. Processor implementation.
- a deep two-way HLSTM model with stronger modeling capability is trained as a "teacher” model
- a randomly initialized two-way LSTM model is used as a "student” model
- a "teacher” model is used to train a student with a relatively small parameter amount. "model. The specific method is described as follows:
- the HLSTM model is randomly initialized.
- the network structure of the HLSTM model is shown in Figure 2. Since HLSTM introduces direct connection between adjacent layer memory cells, the problem of gradient disappearance is avoided, and the difficulty of network training is reduced. Therefore, a deeper structure can be used in practical applications.
- the number of network layers cannot be infinitely deepened because the excessive parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of HLSTM network layers can be adjusted according to the amount of training data available.
- the training data can be 300h (hours)
- the HLSTM model used is 6 layers, namely: an input layer, an output layer, and four hidden layers between them.
- the HLSTM model is iteratively updated using the CrossEntropy (CE) objective function.
- CE objective function formula is as follows:
- the F CE represents a cross entropy objective function
- the speech feature at time t is the annotation value of the output point in the y state
- X t ) is the speech feature of the neural network t, corresponding to the output of the y state point
- the X represents the training data
- the N is the total duration of the speech feature.
- the HLSTM model generated based on CE objective function training has better recognition performance.
- the model is further optimized by the discriminative sequence-level training criterion, namely: State-level Minimum Bayes Risk (SMBR) criterion.
- SMBR State-level Minimum Bayes Risk
- the difference from the acoustic model training of the CE criterion is that the discriminative sequence-level training criterion tries to learn more classes from the positive and negative training samples on a limited training set by optimizing the function related to the system recognition rate. Distinguish information.
- Its objective function is as follows:
- the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u
- the O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
- HLSTM model newly connected model
- the information transmission method of the embodiment of the present disclosure is to perform the forward calculation by using the "teacher” model, obtain the output corresponding to each frame input, mark the obtained output, and use the CE criterion mentioned above as the objective function to train.
- the "student” model, the trained LSTM model is used as an acoustic model for speech recognition systems.
- An advantage of embodiments of the present disclosure is to improve LSTM baseline model performance without increasing model complexity.
- the HLSTM model has stronger modeling capabilities and higher recognition performance, the decoding real-time rate is also one of the indicators for evaluating the performance of the recognition system.
- the HLSTM model has a higher parameter size and model complexity than the LSTM model, which inevitably slows down the decoding speed.
- the HLSTM model network information is transmitted to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model.
- the “student” model performance is lower than the “teacher” model, but Still higher than the performance of the directly trained LSTM model.
- Step 1 Extract the speech features of the training data.
- the EM algorithm is used to iteratively update the mean variance of the GMM-HMM system, and the GMM-HMM system is used to force the alignment of the feature data to obtain the three-factor cluster state annotation.
- Step 2 Train the two-way HLSTM model based on the cross entropy criterion.
- a six-layer bidirectional HLSTM model is used, and the parameter quantity of the model is 190M.
- the specific configuration is as follows: the input layer has 260 nodes, the input observation vector is 2 frames for the context, and the number of nodes of the four hidden layers are both For 1024, the recursive delay is 1, 2, 3, 4; each layer of hidden layer is connected with a 512-dimensional mapping layer to reduce the dimension reduction parameter.
- the number of nodes in the output layer is 2821, which corresponds to 2821 triphone clustering states.
- Step 3 The model generated in step 2 is used as a seed model, and the bidirectional HLSTM model is iteratively updated based on the state-level minimum Bayesian risk criterion.
- Step 4 Perform the forward calculation by using the two-way HLSTM model generated in step three to obtain the output vector.
- Step 5 The output vector obtained in step 4 is labeled corresponding to the input feature, and the bidirectional LSTM model with three hidden layers is trained, and the parameter quantity is 120M.
- the network parameters of the model are consistent with the HLSTM model in step 2.
- embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
- the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
- the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
- These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
- the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
- an embodiment of the present disclosure further provides a storage medium, in particular a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method in the embodiment of the present disclosure are implemented.
- the solution provided by the embodiment of the present disclosure trains the randomly initialized HLSTM model based on a preset function, and optimizes the training result; and performs training data through the optimized HLSTM model for forward calculation; To the calculated result and the preset function, the randomly initialized LSTM model is trained, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model has the same network parameters as the LSTM model.
- the embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
相关申请的交叉引用Cross-reference to related applications
本申请基于申请号为201710094191.6、申请日为2017年02月21日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。The present application is filed on the basis of the Chinese Patent Application No. PCT Application No.
本公开涉及语音识别技术领域,尤其涉及一种基于直连长短时记忆(Highway Long Short Time Memory,HLSTM)模型的声学建模方法、装置和存储介质。The present disclosure relates to the field of speech recognition technologies, and in particular, to an acoustic modeling method, apparatus, and storage medium based on a Highway Long Short Time Memory (HLSTM) model.
近年来,大词汇连续语音识别系统取得了重大进步。传统的语音识别系统采用隐马尔科夫模型(Hidden Markov Model,HMM)来表达语音信号的时变特性,采用高斯混合模型(Gaussian Mixture Model,GMM)建模语音信号的发音多样性。后来,深度学习技术被引入到语音识别研究领域,使语音识别系统的性能有了显著的提高,真正把语音识别推动到商业可用级别。由于语音识别技术存在巨大的实用价值,该领域成为科技巨头、互联网公司和知名高校的研究热点。深度神经网络(Deep Neural Network,DNN)被引入语音识别后,人们又进一步研究了神经网络的序列鉴别性训练和卷积神经网络(Convolutional Neural Network,CNN)在语音识别中的应用。In recent years, major vocabulary continuous speech recognition systems have made significant progress. The traditional speech recognition system uses the Hidden Markov Model (HMM) to express the time-varying characteristics of speech signals, and uses the Gaussian Mixture Model (GMM) to model the pronunciation diversity of speech signals. Later, deep learning techniques were introduced into the field of speech recognition research, which significantly improved the performance of speech recognition systems and truly pushed speech recognition to a commercially available level. Due to the huge practical value of speech recognition technology, this field has become a research hotspot for technology giants, Internet companies and well-known universities. After the deep neural network (DNN) was introduced into speech recognition, the application of neural network sequence discrimination training and Convolutional Neural Network (CNN) in speech recognition was further studied.
随后,长短时记忆(Long Short Time Memory,LSTM)模型被引入声学建模,相比于简单的前馈网络,LSTM模型具有更强的声学建模能力。由于数据量日益增大,因此需要加深声学模型神经网络的层数来提升建模能力。但随着LSTM模型网络层数的加深,网络的训练难度也随之增大, 同时伴随着梯度消失问题。为了避免梯度消失,一种基于LSTM模型的HLSTM模型被提出,HLSTM模型是在LSTM模型相邻层的记忆单元之间引入直连。Subsequently, the Long Short Time Memory (LSTM) model was introduced into acoustic modeling, and the LSTM model has stronger acoustic modeling capabilities than a simple feedforward network. Due to the increasing amount of data, it is necessary to deepen the number of layers of the acoustic model neural network to improve the modeling ability. However, as the number of network layers in the LSTM model deepens, the training difficulty of the network increases, and the gradient disappears. In order to avoid the disappearance of the gradient, an HLSTM model based on the LSTM model was proposed, which introduced direct connections between memory cells in adjacent layers of the LSTM model.
HLSTM模型的提出使更深层的网络结构在识别系统中得到实际的应用,并大幅度提升了识别准确度。虽然深层的HLSTM模型有更强的建模能力,但层数的加深和引入的新连接(上述直连)也使声学模型具有了更复杂的网络结构,因而前向计算耗费的时间会更长,最终导致解码变慢。因此,如何在提升性能的同时不增加声学模型的复杂度成为有待解决的问题。The proposed HLSTM model enables the deeper network structure to be practically applied in the recognition system and greatly improves the recognition accuracy. Although the deep HLSTM model has stronger modeling capabilities, the deepening of the number of layers and the introduction of new connections (the above-mentioned direct connection) also make the acoustic model have a more complex network structure, so the forward calculation takes longer. , eventually leading to slower decoding. Therefore, how to improve the performance without increasing the complexity of the acoustic model becomes a problem to be solved.
发明内容Summary of the invention
本公开实施例提供一种基于HLSTM模型的声学建模方法、装置和存储介质。Embodiments of the present disclosure provide an acoustic modeling method, apparatus, and storage medium based on an HLSTM model.
本公开实施例的技术方案是这样实现的:The technical solution of the embodiment of the present disclosure is implemented as follows:
本公开实施例提供了一种基于HLSTM模型的声学建模方法,包括:Embodiments of the present disclosure provide an acoustic modeling method based on an HLSTM model, including:
基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;The randomly initialized HLSTM model is trained based on a preset function, and the training result is optimized;
将训练数据通过经所述优化得到的HLSTM模型进行前向计算;Performing forward calculation by using the HLSTM model obtained by the optimization;
基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别系统的声学模型;And training the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and the obtained model is an acoustic model of the speech recognition system;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。Wherein, the HLSTM model is the same as the network parameter of the LSTM model.
上述方案中,所述基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化,包括:In the above solution, the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
采用交叉熵目标函数训练已随机初始化的HLSTM模型;Training the randomly initialized HLSTM model with the cross entropy objective function;
依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。The HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
上述方案中,所述交叉熵目标函数为:In the above solution, the cross entropy objective function is:
其中,所述F CE表示交叉熵目标函数;所述 为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。 Wherein the F CE represents a cross entropy objective function; The speech feature at time t is the annotation value of the output point in the y state; the p(y|X t ) is the speech feature of the neural network t, corresponding to the output of the y state point; the X represents the training data; To output the number of state points, the N is the total duration of the speech feature.
上述方案中,所述状态级最小贝叶斯风险准则对应的目标函数为:In the above solution, the objective function corresponding to the state-level minimum Bayesian risk criterion is:
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。 Wherein, the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u |S) is an acoustic likelihood probability; and the A(W, W u ) represents the number of correct state annotations in the sequence of decoding states; the seed model is: the HLSTM model obtained after the optimization; the u represents an index of the statement number in the training data, and the k is an acoustic score coefficient, The O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
上述方案中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。In the above solution, the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
上述方案中,所述基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,包括:In the above solution, the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function, including:
获取所述前向计算得到的每帧的输出结果;Obtaining an output result of each frame obtained by the forward calculation;
基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的 为所述前向计算得到的每帧输出结果。 Training a randomly initialized LSTM model based on the output result of each frame and a cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
本公开实施例还提供了一种基于HLSTM模型的声学建模装置,包括:An embodiment of the present disclosure further provides an acoustic modeling device based on an HLSTM model, including:
HLSTM模型处理模块,配置为基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;The HLSTM model processing module is configured to train the randomly initialized HLSTM model based on a preset function and optimize the training result;
计算模块,配置为将训练数据通过经所述优化得到的HLSTM模型进行前向计算;a calculation module configured to perform training forward calculation through the optimized HLSTM model;
LSTM模型处理模块,配置为基于所述前向计算的结果和所述预设函数,训练已随机初始化的长短时记忆LSTM模型,得到的模型为语音识别 系统的声学模型;The LSTM model processing module is configured to train the randomly initialized long and short time memory LSTM model based on the forward calculation result and the preset function, and the obtained model is an acoustic model of the speech recognition system;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。Wherein, the HLSTM model is the same as the network parameter of the LSTM model.
上述方案中,所述HLSTM模型处理模块包括:In the above solution, the HLSTM model processing module includes:
第一训练单元,配置为采用交叉熵目标函数训练已随机初始化的HLSTM模型;a first training unit configured to train the randomly initialized HLSTM model using a cross entropy objective function;
优化单元,配置为依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。An optimization unit configured to optimize the HLSTM model obtained by the training according to a state-level minimum Bayesian risk criterion.
上述方案中,所述交叉熵目标函数为:In the above solution, the cross entropy objective function is:
其中,所述F CE表示交叉熵目标函数;所述 为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。 Wherein the F CE represents a cross entropy objective function; The speech feature at time t is the annotation value of the output point in the y state; the p(y|X t ) is the speech feature of the neural network t, corresponding to the output of the y state point; the X represents the training data; To output the number of state points, the N is the total duration of the speech feature.
上述方案中,所述状态级最小贝叶斯风险准则对应的目标函数为:In the above solution, the objective function corresponding to the state-level minimum Bayesian risk criterion is:
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。 Wherein, the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u |S) is an acoustic likelihood probability; and the A(W, W u ) represents the number of correct state annotations in the sequence of decoding states; the seed model is: the HLSTM model obtained after the optimization; the u represents an index of the statement number in the training data, and the k is an acoustic score coefficient, The O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
上述方案中,所述LSTM模型处理模块包括:In the above solution, the LSTM model processing module includes:
获取单元,配置为获取所述前向计算得到的每帧的输出结果;An obtaining unit configured to obtain an output result of each frame obtained by the forward calculation;
第二训练单元,配置为基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的 为所述前向计算得到的每帧输出结果。 a second training unit configured to train the randomly initialized LSTM model based on the output result of each frame and the cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
本公开实施例又提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一方法的步骤。Embodiments of the present disclosure further provide a storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of any of the above methods.
本公开实施例提供的基于HLSTM模型的声学建模方法、装置和存储介质,基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;将训练数据通过经所述优化得到的HLSTM模型进行前向计算;基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别系统的声学模型;其中,所述HLSTM模型与所述LSTM模型的网络参数相同。本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。The HLSTM model-based acoustic modeling method, apparatus and storage medium provided by the embodiments of the present disclosure train a randomly initialized HLSTM model based on a preset function, and optimize the training result; and the training data is obtained through the optimization. The HLSTM model performs forward calculation; based on the result of the forward calculation and the preset function, training the randomly initialized LSTM model, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model and the The network parameters of the LSTM model are the same. The embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
图1为本公开实施例所述基于HLSTM模型的声学建模方法流程示意图;1 is a schematic flow chart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure;
图2为本公开一实施例所述双向HLSTM模型网络结构图;2 is a network structure diagram of a bidirectional HLSTM model according to an embodiment of the present disclosure;
图3为本公开实施例所述基于HLSTM模型的声学建模装置结构示意图;3 is a schematic structural diagram of an acoustic modeling device based on an HLSTM model according to an embodiment of the present disclosure;
图4为本公开实施例所述HLSTM模型处理模块的结构示意图;4 is a schematic structural diagram of an HLSTM model processing module according to an embodiment of the present disclosure;
图5为本公开实施例所述LSTM模型处理模块的结构示意图。FIG. 5 is a schematic structural diagram of an LSTM model processing module according to an embodiment of the present disclosure.
下面结合具体实施例对本公开进行详细描述。The present disclosure is described in detail below in conjunction with the specific embodiments.
图1为本公开实施例所述基于HLSTM模型的声学建模方法流程示意图,如图1所示,该方法包括:FIG. 1 is a schematic flowchart of an acoustic modeling method based on an HLSTM model according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:
步骤101:基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;Step 101: Train the randomly initialized HLSTM model based on a preset function, and optimize the training result.
步骤102:将训练数据通过经所述优化得到的HLSTM模型进行前向计 算;Step 102: Perform training calculation by using the HLSTM model obtained by the optimization;
步骤103:基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别系统的声学模型;Step 103: Train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and obtain the model as an acoustic model of the speech recognition system;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。Wherein, the HLSTM model is the same as the network parameter of the LSTM model.
这里,所述HLSTM模型与所述LSTM模型可均为双向、或均为单向。所述网络参数可包括:输入层节点数、输出层节点数、输入的观测矢量、隐层的节点数、递归时延,以及每层隐层后连接的映射层等等。Here, the HLSTM model and the LSTM model may both be bidirectional or both unidirectional. The network parameters may include: an input layer node number, an output layer node number, an input observation vector, a hidden layer node number, a recursive delay, and a mapping layer connected after each hidden layer.
本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。The embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
作为一个实例,所述随机初始化的HLSTM模型如图2所示,虚线框中为在LSTM模型基础上设置的层间记忆单元连接(直连),如图2所示。由于HLSTM模型里引入相邻层记忆单元间的直连,可避免梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,较大参数量模型会引起过拟合。实际使用中,HLSTM模型的网络层数可以根据可用训练数据量进行调整。As an example, the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, as shown in FIG. 2. Since the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications. On the other hand, subject to the limitation of the parameter quantity, the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
本公开实施例中,所述基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化,包括:In the embodiment of the present disclosure, the training is performed on the randomly initialized HLSTM model based on a preset function, and the training result is optimized, including:
采用交叉熵目标函数训练已随机初始化的HLSTM模型;Training the randomly initialized HLSTM model with the cross entropy objective function;
依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。The HLSTM model obtained by the training is optimized according to the state-level minimum Bayesian risk criterion.
其中,所述交叉熵目标函数为:Wherein the cross entropy objective function is:
其中,所述F CE表示交叉熵目标函数;所述 为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。 Wherein the F CE represents a cross entropy objective function; The speech feature at time t is the annotation value of the output point in the y state; the p(y|X t ) is the speech feature of the neural network t, corresponding to the output of the y state point; the X represents the training data; To output the number of state points, the N is the total duration of the speech feature.
其中,所述状态级最小贝叶斯风险准则对应的目标函数为:Wherein, the objective function corresponding to the state-level minimum Bayesian risk criterion is:
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。 Wherein, the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u |S) is an acoustic likelihood probability; and the A(W, W u ) represents the number of correct state annotations in the sequence of decoding states; the seed model is: the HLSTM model obtained after the optimization; the u represents an index of the statement number in the training data, and the k is an acoustic score coefficient, The O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
本公开实施例中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。In the embodiment of the present disclosure, the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
本公开实施例中,所述基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,包括:In the embodiment of the present disclosure, the training the randomly initialized LSTM model based on the result of the forward calculation and the preset function, including:
获取所述前向计算得到的每帧的输出结果;Obtaining an output result of each frame obtained by the forward calculation;
基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的 为所述前向计算得到的每帧输出结果。 Training a randomly initialized LSTM model based on the output result of each frame and a cross entropy objective function; wherein, in the cross entropy objective function The result is output for each frame obtained by the forward calculation.
经过对HLSTM与LSTM进行对比实验发现:对引入直连后的LSTM模型,即:HLSTM模型做鉴别性训练获得的性能提升明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。After comparing the HLSTM and LSTM, it is found that the performance improvement of the LSTM model after the introduction of the direct connection, that is, the HLSTM model is significantly higher than that obtained by the LSTM model. Therefore, the performance of the discrimination training on the HLSTM model is Promotion is very meaningful.
本公开实施例还提供了一种基于HLSTM模型的声学建模装置,用于实现上述实施例及具体实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”“单元”可以实现预定功能的软件和/或硬件的组合。如图3所示,该装置包括:The embodiment of the present disclosure further provides an acoustic modeling device based on the HLSTM model, which is used to implement the above embodiments and specific implementations, and has not been described again. As used below, the term "module" "unit" may implement a combination of software and/or hardware of a predetermined function. As shown in Figure 3, the device comprises:
HLSTM模型处理模块301,配置为基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;The HLSTM model processing module 301 is configured to train the randomly initialized HLSTM model based on a preset function, and optimize the training result;
计算模块302,配置为将训练数据通过经所述优化得到的HLSTM模型进行前向计算;The calculating module 302 is configured to perform training calculation by using the optimized HLSTM model;
LSTM模型处理模块303,配置为基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别系统的声学模型;The LSTM model processing module 303 is configured to train the randomly initialized LSTM model based on the result of the forward calculation and the preset function, and the obtained model is an acoustic model of the speech recognition system;
其中,所述HLSTM模型与所述LSTM模型的网络参数相同。Wherein, the HLSTM model is the same as the network parameter of the LSTM model.
本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。The embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
作为一个实例,所述随机初始化的HLSTM模型如图2所示,虚线框中为在LSTM模型基础上设置的层间记忆单元连接(直连),连接公式如图2所示。由于HLSTM模型里引入相邻层记忆单元间的直连,可避免梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,较大参数量模型会引起过拟合。实际使用中,HLSTM模型的网络层数可以根据可用训练数据量进行调整。As an example, the randomly initialized HLSTM model is shown in FIG. 2, and the dotted line box is an inter-layer memory unit connection (direct connection) set on the basis of the LSTM model, and the connection formula is as shown in FIG. 2. Since the direct connection between adjacent layer memory cells is introduced in the HLSTM model, the problem of gradient disappearance can be avoided, and the difficulty of network training is reduced, so that a deeper structure can be used in practical applications. On the other hand, subject to the limitation of the parameter quantity, the number of network layers cannot be infinitely deepened because the larger parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of network layers of the HLSTM model can be adjusted based on the amount of training data available.
本公开实施例中,如图4所示,所述HLSTM模型处理模块301包括:In the embodiment of the present disclosure, as shown in FIG. 4, the HLSTM model processing module 301 includes:
第一训练单元3011,配置为采用交叉熵目标函数训练已随机初始化的HLSTM模型;The first training unit 3011 is configured to train the randomly initialized HLSTM model by using a cross entropy objective function;
优化单元3012,配置为依据状态级最小贝叶斯风险准则优化经所述训练得到的HLSTM模型。The optimization unit 3012 is configured to optimize the HLSTM model obtained by the training according to the state-level minimum Bayesian risk criterion.
其中,所述交叉熵目标函数为:Wherein the cross entropy objective function is:
其中,所述F CE表示交叉熵目标函数;所述 为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。 Wherein the F CE represents a cross entropy objective function; The speech feature at time t is the annotation value of the output point in the y state; the p(y|X t ) is the speech feature of the neural network t, corresponding to the output of the y state point; the X represents the training data; To output the number of state points, the N is the total duration of the speech feature.
其中,所述状态级最小贝叶斯风险准则对应的目标函数为:Wherein, the objective function corresponding to the state-level minimum Bayesian risk criterion is:
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。 Wherein, the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u |S) is an acoustic likelihood probability; and the A(W, W u ) represents the number of correct state annotations in the sequence of decoding states; the seed model is: the HLSTM model obtained after the optimization; the u represents an index of the statement number in the training data, and the k is an acoustic score coefficient, The O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores.
本公开实施例中,如图5所示,所述LSTM模型处理模块303包括:In the embodiment of the present disclosure, as shown in FIG. 5, the LSTM model processing module 303 includes:
获取单元3031,配置为获取所述前向计算得到的每帧的输出结果;The obtaining unit 3031 is configured to acquire an output result of each frame obtained by the forward calculation;
第二训练单元3032,配置为基于所述每帧的输出结果和交叉熵目标函数,训练已随机初始化的LSTM模型;其中,所述交叉熵目标函数中的
为所述前向计算得到的每帧输出结果。
a
本公开实施例中,所述HLSTM模型的网络层数大于或等于所述LSTM模型的网络层数。In the embodiment of the present disclosure, the number of network layers of the HLSTM model is greater than or equal to the number of network layers of the LSTM model.
经过对HLSTM与LSTM进行对比实验发现:对引入直连后的LSTM模型,即:HLSTM模型做鉴别性训练获得的性能提升明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。After comparing the HLSTM and LSTM, it is found that the performance improvement of the LSTM model after the introduction of the direct connection, that is, the HLSTM model is significantly higher than that obtained by the LSTM model. Therefore, the performance of the discrimination training on the HLSTM model is Promotion is very meaningful.
实际应用时,HLSTM模型处理模块301、计算模块302、LSTM模型处理模块303、第一训练单元3011、优化单元3012、获取单元3031、第二训练单元3032可由基于HLSTM模型的声学建模装置中的处理器实现。In practical applications, the HLSTM model processing module 301, the calculation module 302, the LSTM model processing module 303, the first training unit 3011, the optimization unit 3012, the acquisition unit 3031, and the
下面结合一具体场景实施例对本公开进行描述。The present disclosure is described below in conjunction with a specific scenario embodiment.
本实施例将训练完成具有更强建模能力的深层双向HLSTM模型做“教师”模型,将随机初始化的双向LSTM模型做“学生”模型,利用“教师”模型训练参数量相对较小的“学生”模型。具体方法描述如下:In this embodiment, a deep two-way HLSTM model with stronger modeling capability is trained as a "teacher" model, a randomly initialized two-way LSTM model is used as a "student" model, and a "teacher" model is used to train a student with a relatively small parameter amount. "model. The specific method is described as follows:
一、训练“教师”模型First, training the "teacher" model
首先,随机初始化HLSTM模型,HLSTM模型网络结构如图2所示。 由于HLSTM引入相邻层记忆单元间的直连,避免了梯度消失的问题,降低了网络训练的难度,因此实际应用中可以使用更深层的结构。但另一方面受制于参数量的限制,网络层数不能无限加深,因为相比于训练数据量来说,过大参数量模型会引起过拟合。实际使用中,HLSTM网络层数可以根据可用训练数据量做调整。本实施例中训练数据可为300h(小时),使用的HLSTM模型为6层,即:输入层,输出层以及它们之间的四层隐层。First, the HLSTM model is randomly initialized. The network structure of the HLSTM model is shown in Figure 2. Since HLSTM introduces direct connection between adjacent layer memory cells, the problem of gradient disappearance is avoided, and the difficulty of network training is reduced. Therefore, a deeper structure can be used in practical applications. On the other hand, subject to the limitation of the parameter quantity, the number of network layers cannot be infinitely deepened because the excessive parameter quantity model causes over-fitting compared to the amount of training data. In actual use, the number of HLSTM network layers can be adjusted according to the amount of training data available. In this embodiment, the training data can be 300h (hours), and the HLSTM model used is 6 layers, namely: an input layer, an output layer, and four hidden layers between them.
使用交叉熵(CrossEntropy,CE)目标函数迭代更新训练HLSTM模型,CE目标函数公式如下所示:The HLSTM model is iteratively updated using the CrossEntropy (CE) objective function. The CE objective function formula is as follows:
其中,所述F CE表示交叉熵目标函数;所述 为t时刻的语音特征在y状态输出点的标注值;所述p(y|X t)为神经网络t时刻的语音特征,对应y状态点的输出;所述X表示训练数据;所述S为输出状态点的数目,所述N为语音特征总时长。 Wherein the F CE represents a cross entropy objective function; The speech feature at time t is the annotation value of the output point in the y state; the p(y|X t ) is the speech feature of the neural network t, corresponding to the output of the y state point; the X represents the training data; To output the number of state points, the N is the total duration of the speech feature.
基于CE目标函数训练生成的HLSTM模型已具有较好的识别性能。在此基础上,利用鉴别性序列级训练准则,即:状态级最小贝叶斯风险(State-level Minimum Bayes Risk,SMBR)准则进一步优化模型。与CE准则的声学模型训练不同之处在于,鉴别性序列级训练准则通过优化与系统识别率相关的函数,在有限的训练集上力图从正反两方面的训练样本中学习到更多的类区分度信息。它的目标函数如下所示:The HLSTM model generated based on CE objective function training has better recognition performance. On this basis, the model is further optimized by the discriminative sequence-level training criterion, namely: State-level Minimum Bayes Risk (SMBR) criterion. The difference from the acoustic model training of the CE criterion is that the discriminative sequence-level training criterion tries to learn more classes from the positive and negative training samples on a limited training set by optimizing the function related to the system recognition rate. Distinguish information. Its objective function is as follows:
其中,所述W u为语音的标注文本;所述W与W'均为种子模型的解码路径对应的标注;所述p(O u|S)为声学似然概率;所述A(W,W u)代表解码状态序列中正确状态标注的数目;所述种子模型为:所述优化后得到的HLSTM模型;所述u代表训练数据中语句编号的索引,所述k为声学得分系数,所述O u为第u句语料的语音特征,所述S表示解码路径的状态序列,所述P(W)与P(W')均为语言模型概率得分。通过对HLSTM与LSTM的对比实验,发 现引入新连接后的模型(HLSTM模型)做鉴别性训练获得的性能提升要明显高于LSTM模型获得的提升,因此,鉴别性训练对HLSTM模型性能的提升是非常有意义的。至此,训练完成的模型即为“教师”模型。 Wherein, the Wu is an annotated text of a voice; the W and W′ are labels corresponding to a decoding path of the seed model; the p(O u |S) is an acoustic likelihood probability; and the A(W, W u ) represents the number of correct state annotations in the sequence of decoding states; the seed model is: the HLSTM model obtained after the optimization; the u represents an index of the statement number in the training data, and the k is an acoustic score coefficient, The O u is the speech feature of the corpus of the u sentence, the S represents the state sequence of the decoding path, and the P(W) and P(W') are both language model probability scores. Through the comparison experiment between HLSTM and LSTM, it is found that the performance improvement obtained by introducing the newly connected model (HLSTM model) for discriminative training is significantly higher than that obtained by the LSTM model. Therefore, the performance improvement of the HLSTM model by discriminative training is Very meaningful. At this point, the trained model is the "teacher" model.
二、训练“学生”模型Second, training the "student" model
随机初始化一个含三层隐层的LSTM模型,模型的其它参数与“老师”模型一致。接下来,需要将HLSTM模型学到的信息传递给LSTM模型。本公开实施例的信息传递方式是将训练数据通过“教师”模型做前向计算,得到每帧输入对应的输出,将得到的输出做标注,使用上文提到的CE准则为目标函数,训练“学生”模型,训练得到的LSTM模型作为语音识别系统使用的声学模型。Randomly initialize an LSTM model with three hidden layers. The other parameters of the model are consistent with the "teacher" model. Next, you need to pass the information learned by the HLSTM model to the LSTM model. The information transmission method of the embodiment of the present disclosure is to perform the forward calculation by using the "teacher" model, obtain the output corresponding to each frame input, mark the obtained output, and use the CE criterion mentioned above as the objective function to train. The "student" model, the trained LSTM model is used as an acoustic model for speech recognition systems.
本公开实施例的优点是在不增加模型复杂度的情况下,提升LSTM基线模型性能。虽然,HLSTM模型有更强的建模能力和更高的识别性能,但解码实时率同样为评价识别系统性能的指标之一。HLSTM模型从参数规模和模型复杂度都高于LSTM模型,必然会拖慢解码速度。将HLSTM模型网络信息通过后验概率传递到LSTM网络,以此提升LSTM基线模型的性能,虽然信息传递过程中会有不可避免的性能损失,即“学生”模型性能低于“老师”模型,但仍然高于直接训练的LSTM模型性能。An advantage of embodiments of the present disclosure is to improve LSTM baseline model performance without increasing model complexity. Although the HLSTM model has stronger modeling capabilities and higher recognition performance, the decoding real-time rate is also one of the indicators for evaluating the performance of the recognition system. The HLSTM model has a higher parameter size and model complexity than the LSTM model, which inevitably slows down the decoding speed. The HLSTM model network information is transmitted to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model. Although there is an inevitable performance loss in the information transfer process, the “student” model performance is lower than the “teacher” model, but Still higher than the performance of the directly trained LSTM model.
下面结合具体的模型参数对所述方法实施例进行描述。The method embodiments are described below in conjunction with specific model parameters.
步骤一:提取训练数据的语音特征。利用EM算法迭代更新GMM-HMM系统均值方差,使用GMM-HMM系统对特征数据做强制对齐,得到三因子聚类状态标注。Step 1: Extract the speech features of the training data. The EM algorithm is used to iteratively update the mean variance of the GMM-HMM system, and the GMM-HMM system is used to force the alignment of the feature data to obtain the three-factor cluster state annotation.
步骤二:基于交叉熵准则训练双向HLSTM模型。Step 2: Train the two-way HLSTM model based on the cross entropy criterion.
本实施例中使用六层的双向HLSTM模型,模型的参数量为190M,具体配置如下:输入层有260个节点,输入的观测矢量为上下文各做2帧扩展,四层隐层的节点数目均为1024,递归时延分别为1,2,3,4;每层隐层后连接512维的映射层,用于降低维度减少参数量。输出层的节点数为2821,对应2821个三音子聚类状态。In this embodiment, a six-layer bidirectional HLSTM model is used, and the parameter quantity of the model is 190M. The specific configuration is as follows: the input layer has 260 nodes, the input observation vector is 2 frames for the context, and the number of nodes of the four hidden layers are both For 1024, the recursive delay is 1, 2, 3, 4; each layer of hidden layer is connected with a 512-dimensional mapping layer to reduce the dimension reduction parameter. The number of nodes in the output layer is 2821, which corresponds to 2821 triphone clustering states.
步骤三:以步骤二生成的模型为种子模型,基于状态级最小贝叶斯风 险准则迭代更新双向HLSTM模型。Step 3: The model generated in step 2 is used as a seed model, and the bidirectional HLSTM model is iteratively updated based on the state-level minimum Bayesian risk criterion.
步骤四:将训练数据通过步骤三生成的双向HLSTM模型做前向计算,得到输出向量。Step 4: Perform the forward calculation by using the two-way HLSTM model generated in step three to obtain the output vector.
步骤五:将步骤四得到的输出向量做对应输入特征的标注,训练含三层隐层的双向LSTM模型,参数量为120M。模型的网络参数与步骤二中的HLSTM模型一致。Step 5: The output vector obtained in step 4 is labeled corresponding to the input feature, and the bidirectional LSTM model with three hidden layers is trained, and the parameter quantity is 120M. The network parameters of the model are consistent with the HLSTM model in step 2.
本领域内的技术人员应明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
基于此,本公开实施例还提供了一种存储介质,具体为计算机可读存 储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本公开实施例所述方法的步骤。Based on this, an embodiment of the present disclosure further provides a storage medium, in particular a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method in the embodiment of the present disclosure are implemented.
以上所述,仅为本公开的较佳实施例而已,并非用于限定本公开的保护范围。The above description is only for the preferred embodiments of the present disclosure, and is not intended to limit the scope of the disclosure.
本公开实施例提供的方案,基于预设函数对已随机初始化的HLSTM模型进行训练,并对训练结果进行优化;将训练数据通过经所述优化得到的HLSTM模型进行前向计算;基于所述前向计算的结果和所述预设函数,训练已随机初始化的LSTM模型,得到的模型为语音识别系统的声学模型;其中,所述HLSTM模型与所述LSTM模型的网络参数相同。本公开实施例将优化后的HLSTM模型的网络信息通过后验概率传递到LSTM网络,达到了在不增加模型复杂度的情况下,提升LSTM基线模型的性能。The solution provided by the embodiment of the present disclosure trains the randomly initialized HLSTM model based on a preset function, and optimizes the training result; and performs training data through the optimized HLSTM model for forward calculation; To the calculated result and the preset function, the randomly initialized LSTM model is trained, and the obtained model is an acoustic model of the speech recognition system; wherein the HLSTM model has the same network parameters as the LSTM model. The embodiment of the present disclosure transmits the network information of the optimized HLSTM model to the LSTM network through the posterior probability, thereby improving the performance of the LSTM baseline model without increasing the complexity of the model.
Claims (12)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710094191.6 | 2017-02-21 | ||
| CN201710094191.6A CN108461080A (en) | 2017-02-21 | 2017-02-21 | A kind of Acoustic Modeling method and apparatus based on HLSTM models |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018153200A1 true WO2018153200A1 (en) | 2018-08-30 |
Family
ID=63222056
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/073887 Ceased WO2018153200A1 (en) | 2017-02-21 | 2018-01-23 | Hlstm model-based acoustic modeling method and device, and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN108461080A (en) |
| WO (1) | WO2018153200A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110517679A (en) * | 2018-11-15 | 2019-11-29 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method and device, storage medium of artificial intelligence |
| US11158303B2 (en) | 2019-08-27 | 2021-10-26 | International Business Machines Corporation | Soft-forgetting for connectionist temporal classification based automatic speech recognition |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110569700B (en) * | 2018-09-26 | 2020-11-03 | 创新先进技术有限公司 | Method and device for optimizing damage identification result |
| CN111709513B (en) * | 2019-03-18 | 2023-06-09 | 百度在线网络技术(北京)有限公司 | Training system and method for long-term and short-term memory network LSTM and electronic equipment |
| CN110751941B (en) * | 2019-09-18 | 2023-05-26 | 平安科技(深圳)有限公司 | Speech synthesis model generation method, device, equipment and storage medium |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104538028A (en) * | 2014-12-25 | 2015-04-22 | 清华大学 | Continuous voice recognition method based on deep long and short term memory recurrent neural network |
| CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
| CN105810193A (en) * | 2015-01-19 | 2016-07-27 | 三星电子株式会社 | Method and apparatus for training language model, and method and apparatus for recognizing language |
| CN106098059A (en) * | 2016-06-23 | 2016-11-09 | 上海交通大学 | customizable voice awakening method and system |
| CN106170800A (en) * | 2014-09-12 | 2016-11-30 | 微软技术许可有限责任公司 | Student DNN is learnt via output distribution |
| CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
-
2017
- 2017-02-21 CN CN201710094191.6A patent/CN108461080A/en active Pending
-
2018
- 2018-01-23 WO PCT/CN2018/073887 patent/WO2018153200A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106170800A (en) * | 2014-09-12 | 2016-11-30 | 微软技术许可有限责任公司 | Student DNN is learnt via output distribution |
| CN104538028A (en) * | 2014-12-25 | 2015-04-22 | 清华大学 | Continuous voice recognition method based on deep long and short term memory recurrent neural network |
| CN105810193A (en) * | 2015-01-19 | 2016-07-27 | 三星电子株式会社 | Method and apparatus for training language model, and method and apparatus for recognizing language |
| CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
| CN106098059A (en) * | 2016-06-23 | 2016-11-09 | 上海交通大学 | customizable voice awakening method and system |
| CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110517679A (en) * | 2018-11-15 | 2019-11-29 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method and device, storage medium of artificial intelligence |
| CN110517679B (en) * | 2018-11-15 | 2022-03-08 | 腾讯科技(深圳)有限公司 | Artificial intelligence audio data processing method and device and storage medium |
| US11158303B2 (en) | 2019-08-27 | 2021-10-26 | International Business Machines Corporation | Soft-forgetting for connectionist temporal classification based automatic speech recognition |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108461080A (en) | 2018-08-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106845411B (en) | Video description generation method based on deep learning and probability map model | |
| CN108733792B (en) | An Entity Relationship Extraction Method | |
| CN107562792B (en) | A Question Answer Matching Method Based on Deep Learning | |
| Zhang et al. | Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition | |
| CN106649514B (en) | System and Method for Human-Inspired Simple Question Answering (HISQA) | |
| CN106156003B (en) | A kind of question sentence understanding method in question answering system | |
| WO2022022163A1 (en) | Text classification model training method, device, apparatus, and storage medium | |
| CN109065032B (en) | External corpus speech recognition method based on deep convolutional neural network | |
| WO2018153200A1 (en) | Hlstm model-based acoustic modeling method and device, and storage medium | |
| CN116662552A (en) | Financial text data classification method, device, terminal equipment and medium | |
| CN107818164A (en) | A kind of intelligent answer method and its system | |
| JP2019159654A (en) | Time-series information learning system, method, and neural network model | |
| CN108292305A (en) | Method for handling sentence | |
| CN106502985A (en) | A kind of neural network modeling approach and device for generating title | |
| CN108647191B (en) | A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors | |
| CN104376842A (en) | Neural network language model training method and device and voice recognition method | |
| CN113255366B (en) | An Aspect-level Text Sentiment Analysis Method Based on Heterogeneous Graph Neural Network | |
| WO2021208455A1 (en) | Neural network speech recognition method and system oriented to home spoken environment | |
| US10529322B2 (en) | Semantic model for tagging of word lattices | |
| CN109036467A (en) | CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM | |
| CN108846063A (en) | Method, device, equipment and computer readable medium for determining answers to questions | |
| CN111126040A (en) | Biomedical named entity identification method based on depth boundary combination | |
| CN111914555B (en) | Automatic relation extraction system based on Transformer structure | |
| CN108109615A (en) | A kind of construction and application method of the Mongol acoustic model based on DNN | |
| CN116982054A (en) | Sequence-to-sequence neural network system using lookahead tree search |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18757896 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18757896 Country of ref document: EP Kind code of ref document: A1 |