Summary of the invention
The object of the present invention is to provide a kind of language model training method and system, these method and system have realized the distributed version of Katz algorithm, can train the language model based on mass data effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support.
For addressing the above problem, the invention provides a kind of language model training method, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the probable value of N tuple;
Carry out many wheel MapReduce operation, calculate a tuple respectively to the rollback coefficient of m tuple, wherein m=N-1;
Gather described probable value and rollback coefficient and obtain the language model of APRA form.
Optionally, in said method, described corpus is carried out taking turns MapReduce operation, the step of the word frequency statistics amount of statistics N tuple comprises:
The Map operation is exported first word of N tuple as key assignments;
Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Optionally, in said method, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the step of the COC statistic of N tuple, comprising:
The boundary value K that default discount is calculated, wherein K is positive integer;
Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K;
Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes;
Gather described word frequency statistics amount and obtain the COC statistic of described N tuple.
Optionally, in said method, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains a tuple to the step of the probable value of N tuple, comprising:
The Map operation is exported first word of N tuple of the word frequency statistics amount correspondence of described N tuple as key assignments;
Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes;
On each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount;
Calculate a tuple to the probable value of N tuple according to described discount factor.
Optionally, in said method, carry out many wheel MapReduce operation, calculate a tuple respectively to the step of the rollback coefficient of m tuple, comprising:
Carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple;
Carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple.
Optionally, in said method, carry out taking turns the MapReduce operation, calculate the step of a tuple rollback coefficient, comprising:
Distribute the data of all tuples and the data of two tuples at each Reduce node;
First word in one tuple or two tuples is exported as key assignments;
Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition;
Described word frequency statistics amount according to a tuple and two tuples adopts the Katz smoothing algorithm to calculate the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node.
Optionally, in said method, calculate two tuples respectively to the step of the rollback coefficient of m tuple, comprising:
Distribute the data of all m tuples and the data of m+1 tuple at each node;
The penult word of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate on each node the rollback coefficient of the m tuple corresponding with the data of described m tuple and m+1 tuple according to the described word frequency statistics amount of m tuple and m+1 tuple.
According to another side of the present invention, a kind of language model training system is provided, comprising:
The word frequency module is used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, and wherein N is the positive integer more than or equal to 2;
The COC module is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the COC statistic of N tuple;
The probability module is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the probable value of N tuple;
The rollback coefficient module is carried out many wheel MapReduce operation, calculates a tuple respectively to the rollback coefficient of m tuple, wherein m=N-1; And
Summarizing module gathers the language model that described probable value and rollback coefficient obtain the APRA form.
Optionally, in said system, described word frequency module is carried out the Map operation first word of N tuple is exported as key assignments, Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Optionally, in said system, the boundary value K that the default discount of described COC module is calculated, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple.
Optionally, in said system, described probability module is carried out the Map operation first word of N tuple of the word frequency statistics amount correspondence of described N tuple is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor.
Optionally, in said system, described rollback coefficient module comprises:
One tuple unit is used for carrying out taking turns the MapReduce operation, calculates the rollback coefficient of a tuple;
Polynary group of unit carries out many wheel MapReduce operation, calculates two tuples respectively to the rollback coefficient of m tuple.
Optionally, in said system, a described tuple unit distributes the data of all tuples and the data of two tuples at each Reduce node, first word in one tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, calculates the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node according to the described word frequency statistics amount employing Katz smoothing algorithm of a tuple and two tuples.
Optionally, in said system, described polynary group of unit distributes the data of all m tuples and the data of m+1 tuple at each node, the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculates the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node according to the described word frequency statistics amount of m tuple and m+1 tuple.
Compared with prior art, the present invention adopts the data structure based on the Hash prefix trees, dexterously mass data is broken and make up, data are distributed to each node of cluster, add up corresponding data value, carry out concurrent operation then, obtain one based on the language model of mass data, realized the distributed version of Katz algorithm, train the language model based on mass data effectively, simultaneously can effectively solve the sparse problem of data, improve its discrimination.
Embodiment
Below in conjunction with the drawings and specific embodiments language model training method and the system that the present invention proposes further described.
As depicted in figs. 1 and 2, the invention provides a kind of language model training method, this method is with the instrument of MapReduce as cluster distributed management, comprises following step for the distributed smoothing algorithm of Katz:
Step S 1, corpus is carried out taking turns MapReduce (be also referred to as Map/Reduce, piecemeal/key assignments piecemeal at random) operation, the word frequency statistics amount (WordCount) of statistics N tuple, wherein N is the positive integer more than or equal to 2, comprising the Map operation first word of N tuple is exported as key assignments, Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, for example: a tuple " we " and two tuples " we not " will be assigned to same Reduce node, the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple, concrete, MapReduce is a programming model, 2004, the MapReduce system that Google publishes thesis and introduces them, 2005, web crawlers project Nutch has realized a MapReduce system that increases income, MapReduce became an independent project of increasing income in 2006, list of references is Hadoop authority guide, Tom White work, Zhou Minqi etc. translate, and publishing house of Tsing-Hua University publishes;
Step S2, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain COC statistic (the word frequency equivalent number of N tuple, count of count), boundary value K comprising default discount calculating, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gather described word frequency statistics amount and obtain the COC statistic of described N tuple, concrete, the COC statistic is represented is that the word frequency statistics amount is the number of 1 to K word, K is the boundary value that discount is calculated, and the word frequency that surpasses K-1 does not need to carry out discount, and the COC statistic is used for the discount computing of word frequency;
Step S3, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, calculate the probable value of N tuple, first word of N tuple comprising the word frequency statistics amount correspondence of the described N tuple of Map operation handlebar is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the node of different Reduce, on each Reduce node, import earlier described COC statistic and carry out discount factor (discounting) calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value (probability) of N tuple according to described discount factor then, gather at last and obtain all N-gram probability;
Step S4, carry out taking turns the MapReduce operation, calculate a tuple rollback coefficient, comprising distributing the data of all tuples and the data of two tuples at each node earlier, then first word in a tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, at last adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient of a tuple corresponding with the data of the data of a described tuple and two tuples according to the described word frequency statistics amount of a tuple and two tuples, concrete, calculating a tuple rollback coefficient needs the word frequency statistics amount of a tuple and two tuples, the mode of data allocations is different with step S1, each node need distribute the data of all tuples and the data of two tuples, be assigned to certain node in the cluster according to first word in a tuple or two tuples, can guarantee that like this monobasic rollback coefficient that each node calculates is overall, it is the same namely all being placed on the value of calculating on the node with total data;
Step S5, carry out taking turns the MapReduce operation, calculate m tuple rollback coefficient (m=2), comprising distributing the data of all m tuples and the data of m+1 tuple at each Reduce node earlier, then the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculate the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node at last according to the described word frequency statistics amount of m tuple and m+1 tuple, concrete, need to distribute a m tuple and m+1 tuple data for each node, data allocations is undertaken by the penult word of m unit and m+1 tuple, for example: " we not " and " striving for by us ", " inspiring ours " etc. will be assigned to same node, can guarantee that like this each node has the rollback coefficient of sufficient data computation m unit;
M is added one, judge that whether m is smaller or equal to N-1 (as the step S6 of Fig. 1);
If, repeat step S5, up to m=N-1, all calculative N tuple rollback coefficients are all intact;
If not, gather the language model that described probable value and rollback coefficient obtain the APRA form, withdraw from calculating (as step S7 among Fig. 1).
Among step S5 in this method and the step S6 with the penult word of the N tuple distribution key assignments as data, can guarantee in the rollback coefficient that calculates the m rank, other m rank and the m+1 rank corresponding data that need use are assigned to same node, thereby correctness and the validity of data allocations have been guaranteed, compare with unit, the number of supposing m rank and m+1 rank N-gram is H, X word (everyday words in the Chinese can reach hundreds thousand of) arranged in the dictionary, such distribution can guarantee that N-gram corresponding in each node in the cluster on average has only H/X, gives full play of the advantage of distributed arithmetic.
This method has proposed a kind of new method of training big data natural language model, this method can overcome the restriction of unit internal memory under the present art, expansion along with cluster scale, the scale of language model has extensibility, in the training based on the natural language model of Web language material, this method can be handled big data very effectively, train large-scale language model, ideally, (in fact data-handling capacity can expand to hundreds thousand of times of unit, processing power will be subjected to the influence of the factors such as scale of data distributing equilibrium degree and server cluster), accumulation along with data volume, the sparse problem relative remission of data, thus for speech recognition provides speech model near true environment, improve its discrimination effectively.
With the mode of simple example, the distributed training that how to realize trigram language model is described below.
If the input corpus has three words: " today, I will go out ", " weather of today is pretty good ", " I will go out carwash "
Through after the participle operation, obtain one and to the tlv triple raw data be:
Sentence one:
" today "
" today I "
" today, I wanted "
" I "
" I want "
" I will go out "
" want "
" to go out "
" go out "
Sentence two:
" today "
" today "
" weather of today "
" "
" weather "
" weather pretty good "
" weather "
" weather is pretty good "
" well "
Sentence three:
" I "
" I want "
" I will go out "
" want "
" to go out "
" to go out "
" go out "
" go out "
" carwash of going out "
" go "
" remove carwash "
" carwash "
As depicted in figs. 1 and 2, the distributed training flow process of concrete trigram language model is as follows:
Step S1:
Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple of " today ", as:
" today "
" today I "
" today, I wanted "
" today "
" today "
" weather of today "
Through obtaining the word frequency statistics amount after the Reduce statistics
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
All the other N tuple operations similarly.
Step S2:
The word frequency of supposing N unit is that the COC statistic of k is defined as COC[N] [k].
Through Map with after resetting, key assignments will be assigned to same Reduce node for the word frequency of " 2 ", as:
" today " 2
" I " 2
" I want " 2
" I will go out " 2
" want " 2
" to go out " 2
" go out " 2
Reduce adds up, and word frequency is that a tuple of 2 has 4, then COC[1] [2]=4; Two tuples have 2, i.e. COC[2] [2]=2; Tlv triple has 1, i.e. COC[3] [2]=1, other COC statistic all can come out according to same procedure.
Step S3:
Need to calculate the discount factor of each tuple correspondence before the beginning MapReduce distributed operation, the discount factor that Katz is level and smooth calculates based on the Good-Turing method:
Wherein, r is word frequency statistics amount (number of a N tuple), dr represents the discount factor of word frequency r correspondence, nr represents that the COC statistic is that word frequency is the number of the N tuple of r, and k is the boundary value that discount is calculated, and does not need to do discount greater than the word frequency of k and calculates, suppose that the N tuple word frequency statistics amount that calculates is that the discount factor of k is D[N] [k], through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple word frequency statistics amount of " today ", as:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
To calculate " today " probability be example:
P (" today ")=Count (" today ") * D[1] [2]/Count (the word frequency summations of all tuples),
Calculate " today " back occurs " I " probability:
P (" I " | " today ")=Count (" today I ") * D[2] [1]/Count (" today ").
Step S4:
Through Map and after resetting, the data of depositing at a node are the two tuple word frequency of " today " for all tuples and key assignments:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
" I " 2
" want " 2
" go out " 2
" " 1
" weather " 1
" well " 1
" go " 1
" carwash " 1
On this node, will calculate the rollback coefficient of a tuple " today " with the Katz smoothing algorithm, correspondingly, all will calculate the rollback coefficient of the tuple that corresponding key-value pair answers on each node.
Step S5:
Through Map and after resetting, the data of depositing at certain node are that key assignments is binary and the tlv triple word frequency data of " I ":
" I want " 2
" today, I wanted " 1
Can calculate the rollback coefficient of preceding two yuan of correspondences of tlv triple on each node, correspondingly on this node be " today I ".
The rollback coefficient formulas is as follows:
Wherein, the rollback coefficient of α (xy) expression two tuple xy correspondences, the word frequency of C (xyz) expression tlv triple xyz, PKatz represents through N-gram probability corresponding after the discount computing, hence one can see that, need on the molecule to add up all word frequency greater than zero and prefix be the general probability of the tlv triple of xy, needing to add up all prefixes on the denominator is y, and when suffix is z the word frequency of corresponding ternary group xyz greater than the general probability of two zero tuples.
Hence one can see that, calculates " today I " corresponding rollback coefficient, needs prefix to be " today I " all tlv triple and prefix be all two tuple word frequency statistics amounts of " I ".Therefore just in time can all be placed on same node to all corresponding data with this allocation scheme.
Under the situation that has more sentences as input, this node can calculate all suffix and be two tuples of " I ", as the rollback coefficient of " today I ", " tomorrow I ", " being me " etc.
Step S6:
M is added one, judge that whether m is smaller or equal to N-1, in three gram language model, tlv triple does not have the rollback coefficient, therefore directly enter step S7, the calculating of the rollback coefficient of tlv triple and the rollback coefficient calculations of two tuples are similarly, will distribute tlv triple and data four-tuple as key assignments according to the penult word of tlv triple.
Step S7:
ARPA is the N-gram language model storage format standard of generally acknowledging at present, no longer describes herein.
As shown in Figure 3, according to another side of the present invention, also provide a kind of language model training system, comprise word frequency module 1, COC module 2, probability module 3, rollback coefficient module 4 and summarizing module 5.
Described word frequency module 1 word frequency module, be used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein N is the positive integer more than or equal to 2, described word frequency module 1 is carried out the Map operation first word of N tuple is exported as key assignments, Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Described COC module 2 is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic, the boundary value K that described COC module 2 default discounts are calculated, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple.
Described probability module 3 is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, calculate the probable value of N tuple, described probability module 3 is carried out the Map operation first word of N tuple of the word frequency statistics amount correspondence of described N tuple is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import earlier described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor then, gather at last and obtain all N-gram probability.
Described rollback coefficient module 4 is used for carrying out many wheel MapReduce operation, calculates a tuple respectively to the rollback coefficient of m tuple, m=N-1 wherein, and described rollback coefficient module 4 comprises:
One tuple unit 41, be used for carrying out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple (m=1), a described tuple unit 41 distributes the data of all tuples and the data of two tuples at each Reduce node earlier, then first word in a tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, at last adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient of a tuple corresponding with the data of the data of a described tuple and two tuples according to the described word frequency statistics amount of a tuple and two tuples, concrete, each node need distribute the data of all tuples and the data of two tuples, be assigned to certain node in the cluster according to first word in a tuple or two tuples, can guarantee that like this monobasic rollback coefficient that each node calculates is overall, it is the same namely all being placed on the value of calculating on the node with total data.
Polynary group of unit 42, carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple (the rollback coefficient of 2<=m<=N-1), described polynary group of unit 42 distributes the data of all m tuples and the data of m+1 tuple at each node earlier, then the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculate on each node at last the rollback coefficient of the m tuple corresponding with the data of described m tuple and m+1 tuple according to the described word frequency statistics amount of m tuple and m+1 tuple, concrete, here data allocations is undertaken by the penult word of m unit and m+1 tuple, for example: " we not " and " striving for by us ", " inspiring ours " etc. will be assigned to same node, can guarantee that like this each node has the rollback coefficient of sufficient data computation m unit.
Described summarizing module 5 is used for gathering the language model that described probable value and rollback coefficient obtain the APRA form.
The present invention adopts the data structure based on the Hash prefix trees, dexterously mass data is broken and make up, data are distributed to each node of cluster, add up corresponding data value, carry out concurrent operation then, obtain one based on the language model of mass data, realized the distributed version of Katz algorithm, train the language model based on mass data effectively, can effectively solve the sparse problem of data simultaneously, improve its discrimination.
Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because corresponding with the embodiment disclosed method, so description is fairly simple, relevant part partly illustrates referring to method and gets final product.
The professional can also further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, client software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions still are that software mode is carried out with hardware actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for using distinct methods to realize described function to each, but this realization should not thought and exceeds scope of the present invention.
Obviously, those skilled in the art can carry out various changes and modification to invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these change and modification.