[go: up one dir, main page]

CN102509549B - Language model training method and system - Google Patents

Language model training method and system Download PDF

Info

Publication number
CN102509549B
CN102509549B CN 201110301029 CN201110301029A CN102509549B CN 102509549 B CN102509549 B CN 102509549B CN 201110301029 CN201110301029 CN 201110301029 CN 201110301029 A CN201110301029 A CN 201110301029A CN 102509549 B CN102509549 B CN 102509549B
Authority
CN
China
Prior art keywords
tuple
word frequency
key assignments
statistics amount
frequency statistics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201110301029
Other languages
Chinese (zh)
Other versions
CN102509549A (en
Inventor
孙宏亮
蔡洪滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI GEAK ELECTRONICS Co.,Ltd.
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN 201110301029 priority Critical patent/CN102509549B/en
Publication of CN102509549A publication Critical patent/CN102509549A/en
Application granted granted Critical
Publication of CN102509549B publication Critical patent/CN102509549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a language model training method and system. The method comprises the following steps of: performing a round of MapReduce operation on the training corpus, and counting the word frequency statistic of an N-element group; performing a round of MapReduce operation on the word frequency statistic of the N-element group to obtain a COC statistic of the N-element group; performing a round of MapReduce operation on the word frequency statistic of the N-element group to obtain the probability value of the N-element group; performing multiple rounds of MapReduce operations, and calculating the rollback coefficients of one-element group to m-element group respectively; and summarizing the probability value and the rollback coefficients to obtain a language model of APRA format. In the invention, a data structure based on a Hash prefix tree is adopted, the mass data are skillfully divided and combined and dispersed to each node of the cluster, corresponding data values are counted and concurrent operation is performed to obtain a language model based on mass data; and through the method, a distributed version of the Katz algorithm is realized, the language model based on mass data is effectively trained, the problem of sparse data can be effectively solved, and the identification rate is improved.

Description

Language model training method and system
Technical field
The present invention relates to a kind of language model training method and system.
Background technology
Jelinek and his team have started the statistical language model application, and N gram language model (N-gram) is by the time and facts have proved effectively simple.A sparse problem of basic data is arranged in the N gram language model, and namely training data is sufficient inadequately forever concerning actual application environment, therefore when the probability of certain N tuple that did not occur of prediction, the zero probability problem always occurs in training data.Corresponding solution is exactly smoothing algorithm.Wherein, the Katz algorithm is the classical smoothing algorithm that is suggested in 1987, the visible list of references of the specific descriptions of this algorithm: Katz, Slava is of probabilities from sparse data for the language model component of a speech recognizer.IEEE Transactions on Acoustic M.1987.Estimation, Speech and Signal Processing, ASSP-35 (3): 400-401, March.
Fast development along with the internet, accumulation and processing mass data become very important gradually, Distributed Calculation is in this stage fast development, the development of speech recognition and mechanical translation, the language model that requirement has big data to support improves discrimination, and separate unit or a small amount of several computing machines can not adapt to the needs of current application, and unit computing and memory source are limited, mass data processing and language model training are wasted time and energy, and can not effectively solve the sparse problem of data simultaneously.
Therefore, need a kind of language model training method and system badly, can train the language model based on mass data effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support.
Summary of the invention
The object of the present invention is to provide a kind of language model training method and system, these method and system have realized the distributed version of Katz algorithm, can train the language model based on mass data effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support.
For addressing the above problem, the invention provides a kind of language model training method, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the probable value of N tuple;
Carry out many wheel MapReduce operation, calculate a tuple respectively to the rollback coefficient of m tuple, wherein m=N-1;
Gather described probable value and rollback coefficient and obtain the language model of APRA form.
Optionally, in said method, described corpus is carried out taking turns MapReduce operation, the step of the word frequency statistics amount of statistics N tuple comprises:
The Map operation is exported first word of N tuple as key assignments;
Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Optionally, in said method, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the step of the COC statistic of N tuple, comprising:
The boundary value K that default discount is calculated, wherein K is positive integer;
Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K;
Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes;
Gather described word frequency statistics amount and obtain the COC statistic of described N tuple.
Optionally, in said method, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains a tuple to the step of the probable value of N tuple, comprising:
The Map operation is exported first word of N tuple of the word frequency statistics amount correspondence of described N tuple as key assignments;
Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes;
On each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount;
Calculate a tuple to the probable value of N tuple according to described discount factor.
Optionally, in said method, carry out many wheel MapReduce operation, calculate a tuple respectively to the step of the rollback coefficient of m tuple, comprising:
Carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple;
Carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple.
Optionally, in said method, carry out taking turns the MapReduce operation, calculate the step of a tuple rollback coefficient, comprising:
Distribute the data of all tuples and the data of two tuples at each Reduce node;
First word in one tuple or two tuples is exported as key assignments;
Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition;
Described word frequency statistics amount according to a tuple and two tuples adopts the Katz smoothing algorithm to calculate the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node.
Optionally, in said method, calculate two tuples respectively to the step of the rollback coefficient of m tuple, comprising:
Distribute the data of all m tuples and the data of m+1 tuple at each node;
The penult word of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate on each node the rollback coefficient of the m tuple corresponding with the data of described m tuple and m+1 tuple according to the described word frequency statistics amount of m tuple and m+1 tuple.
According to another side of the present invention, a kind of language model training system is provided, comprising:
The word frequency module is used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, and wherein N is the positive integer more than or equal to 2;
The COC module is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the COC statistic of N tuple;
The probability module is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtains the probable value of N tuple;
The rollback coefficient module is carried out many wheel MapReduce operation, calculates a tuple respectively to the rollback coefficient of m tuple, wherein m=N-1; And
Summarizing module gathers the language model that described probable value and rollback coefficient obtain the APRA form.
Optionally, in said system, described word frequency module is carried out the Map operation first word of N tuple is exported as key assignments, Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Optionally, in said system, the boundary value K that the default discount of described COC module is calculated, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple.
Optionally, in said system, described probability module is carried out the Map operation first word of N tuple of the word frequency statistics amount correspondence of described N tuple is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor.
Optionally, in said system, described rollback coefficient module comprises:
One tuple unit is used for carrying out taking turns the MapReduce operation, calculates the rollback coefficient of a tuple;
Polynary group of unit carries out many wheel MapReduce operation, calculates two tuples respectively to the rollback coefficient of m tuple.
Optionally, in said system, a described tuple unit distributes the data of all tuples and the data of two tuples at each Reduce node, first word in one tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, calculates the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node according to the described word frequency statistics amount employing Katz smoothing algorithm of a tuple and two tuples.
Optionally, in said system, described polynary group of unit distributes the data of all m tuples and the data of m+1 tuple at each node, the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculates the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node according to the described word frequency statistics amount of m tuple and m+1 tuple.
Compared with prior art, the present invention adopts the data structure based on the Hash prefix trees, dexterously mass data is broken and make up, data are distributed to each node of cluster, add up corresponding data value, carry out concurrent operation then, obtain one based on the language model of mass data, realized the distributed version of Katz algorithm, train the language model based on mass data effectively, simultaneously can effectively solve the sparse problem of data, improve its discrimination.
Description of drawings
Fig. 1 is the process flow diagram of the language model training method of one embodiment of the invention;
Fig. 2 is the distributed training process flow diagram of the trigram language model of one embodiment of the invention;
Fig. 3 is the high-level schematic functional block diagram of the language model training system of one embodiment of the invention.
Embodiment
Below in conjunction with the drawings and specific embodiments language model training method and the system that the present invention proposes further described.
As depicted in figs. 1 and 2, the invention provides a kind of language model training method, this method is with the instrument of MapReduce as cluster distributed management, comprises following step for the distributed smoothing algorithm of Katz:
Step S 1, corpus is carried out taking turns MapReduce (be also referred to as Map/Reduce, piecemeal/key assignments piecemeal at random) operation, the word frequency statistics amount (WordCount) of statistics N tuple, wherein N is the positive integer more than or equal to 2, comprising the Map operation first word of N tuple is exported as key assignments, Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, for example: a tuple " we " and two tuples " we not " will be assigned to same Reduce node, the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple, concrete, MapReduce is a programming model, 2004, the MapReduce system that Google publishes thesis and introduces them, 2005, web crawlers project Nutch has realized a MapReduce system that increases income, MapReduce became an independent project of increasing income in 2006, list of references is Hadoop authority guide, Tom White work, Zhou Minqi etc. translate, and publishing house of Tsing-Hua University publishes;
Step S2, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain COC statistic (the word frequency equivalent number of N tuple, count of count), boundary value K comprising default discount calculating, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gather described word frequency statistics amount and obtain the COC statistic of described N tuple, concrete, the COC statistic is represented is that the word frequency statistics amount is the number of 1 to K word, K is the boundary value that discount is calculated, and the word frequency that surpasses K-1 does not need to carry out discount, and the COC statistic is used for the discount computing of word frequency;
Step S3, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, calculate the probable value of N tuple, first word of N tuple comprising the word frequency statistics amount correspondence of the described N tuple of Map operation handlebar is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the node of different Reduce, on each Reduce node, import earlier described COC statistic and carry out discount factor (discounting) calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value (probability) of N tuple according to described discount factor then, gather at last and obtain all N-gram probability;
Step S4, carry out taking turns the MapReduce operation, calculate a tuple rollback coefficient, comprising distributing the data of all tuples and the data of two tuples at each node earlier, then first word in a tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, at last adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient of a tuple corresponding with the data of the data of a described tuple and two tuples according to the described word frequency statistics amount of a tuple and two tuples, concrete, calculating a tuple rollback coefficient needs the word frequency statistics amount of a tuple and two tuples, the mode of data allocations is different with step S1, each node need distribute the data of all tuples and the data of two tuples, be assigned to certain node in the cluster according to first word in a tuple or two tuples, can guarantee that like this monobasic rollback coefficient that each node calculates is overall, it is the same namely all being placed on the value of calculating on the node with total data;
Step S5, carry out taking turns the MapReduce operation, calculate m tuple rollback coefficient (m=2), comprising distributing the data of all m tuples and the data of m+1 tuple at each Reduce node earlier, then the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculate the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node at last according to the described word frequency statistics amount of m tuple and m+1 tuple, concrete, need to distribute a m tuple and m+1 tuple data for each node, data allocations is undertaken by the penult word of m unit and m+1 tuple, for example: " we not " and " striving for by us ", " inspiring ours " etc. will be assigned to same node, can guarantee that like this each node has the rollback coefficient of sufficient data computation m unit;
M is added one, judge that whether m is smaller or equal to N-1 (as the step S6 of Fig. 1);
If, repeat step S5, up to m=N-1, all calculative N tuple rollback coefficients are all intact;
If not, gather the language model that described probable value and rollback coefficient obtain the APRA form, withdraw from calculating (as step S7 among Fig. 1).
Among step S5 in this method and the step S6 with the penult word of the N tuple distribution key assignments as data, can guarantee in the rollback coefficient that calculates the m rank, other m rank and the m+1 rank corresponding data that need use are assigned to same node, thereby correctness and the validity of data allocations have been guaranteed, compare with unit, the number of supposing m rank and m+1 rank N-gram is H, X word (everyday words in the Chinese can reach hundreds thousand of) arranged in the dictionary, such distribution can guarantee that N-gram corresponding in each node in the cluster on average has only H/X, gives full play of the advantage of distributed arithmetic.
This method has proposed a kind of new method of training big data natural language model, this method can overcome the restriction of unit internal memory under the present art, expansion along with cluster scale, the scale of language model has extensibility, in the training based on the natural language model of Web language material, this method can be handled big data very effectively, train large-scale language model, ideally, (in fact data-handling capacity can expand to hundreds thousand of times of unit, processing power will be subjected to the influence of the factors such as scale of data distributing equilibrium degree and server cluster), accumulation along with data volume, the sparse problem relative remission of data, thus for speech recognition provides speech model near true environment, improve its discrimination effectively.
With the mode of simple example, the distributed training that how to realize trigram language model is described below.
If the input corpus has three words: " today, I will go out ", " weather of today is pretty good ", " I will go out carwash "
Through after the participle operation, obtain one and to the tlv triple raw data be:
Sentence one:
" today "
" today I "
" today, I wanted "
" I "
" I want "
" I will go out "
" want "
" to go out "
" go out "
Sentence two:
" today "
" today "
" weather of today "
" "
" weather "
" weather pretty good "
" weather "
" weather is pretty good "
" well "
Sentence three:
" I "
" I want "
" I will go out "
" want "
" to go out "
" to go out "
" go out "
" go out "
" carwash of going out "
" go "
" remove carwash "
" carwash "
As depicted in figs. 1 and 2, the distributed training flow process of concrete trigram language model is as follows:
Step S1:
Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple of " today ", as:
" today "
" today I "
" today, I wanted "
" today "
" today "
" weather of today "
Through obtaining the word frequency statistics amount after the Reduce statistics
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
All the other N tuple operations similarly.
Step S2:
The word frequency of supposing N unit is that the COC statistic of k is defined as COC[N] [k].
Through Map with after resetting, key assignments will be assigned to same Reduce node for the word frequency of " 2 ", as:
" today " 2
" I " 2
" I want " 2
" I will go out " 2
" want " 2
" to go out " 2
" go out " 2
Reduce adds up, and word frequency is that a tuple of 2 has 4, then COC[1] [2]=4; Two tuples have 2, i.e. COC[2] [2]=2; Tlv triple has 1, i.e. COC[3] [2]=1, other COC statistic all can come out according to same procedure.
Step S3:
Need to calculate the discount factor of each tuple correspondence before the beginning MapReduce distributed operation, the discount factor that Katz is level and smooth calculates based on the Good-Turing method:
d r = ( r + 1 ) n r + 1 rn r - ( k + 1 ) n k + 1 n 1 1 - ( k + 1 ) n k + 1 n 1
Wherein, r is word frequency statistics amount (number of a N tuple), dr represents the discount factor of word frequency r correspondence, nr represents that the COC statistic is that word frequency is the number of the N tuple of r, and k is the boundary value that discount is calculated, and does not need to do discount greater than the word frequency of k and calculates, suppose that the N tuple word frequency statistics amount that calculates is that the discount factor of k is D[N] [k], through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple word frequency statistics amount of " today ", as:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
To calculate " today " probability be example:
P (" today ")=Count (" today ") * D[1] [2]/Count (the word frequency summations of all tuples),
Calculate " today " back occurs " I " probability:
P (" I " | " today ")=Count (" today I ") * D[2] [1]/Count (" today ").
Step S4:
Through Map and after resetting, the data of depositing at a node are the two tuple word frequency of " today " for all tuples and key assignments:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
" I " 2
" want " 2
" go out " 2
" " 1
" weather " 1
" well " 1
" go " 1
" carwash " 1
On this node, will calculate the rollback coefficient of a tuple " today " with the Katz smoothing algorithm, correspondingly, all will calculate the rollback coefficient of the tuple that corresponding key-value pair answers on each node.
Step S5:
Through Map and after resetting, the data of depositing at certain node are that key assignments is binary and the tlv triple word frequency data of " I ":
" I want " 2
" today, I wanted " 1
Can calculate the rollback coefficient of preceding two yuan of correspondences of tlv triple on each node, correspondingly on this node be " today I ".
The rollback coefficient formulas is as follows:
α ( xy ) = 1 - Σ z , C ( xyz ) > 0 P Katz ( z | xy ) 1 - Σ z , C ( xyz ) > 0 P Katz ( z | y )
Wherein, the rollback coefficient of α (xy) expression two tuple xy correspondences, the word frequency of C (xyz) expression tlv triple xyz, PKatz represents through N-gram probability corresponding after the discount computing, hence one can see that, need on the molecule to add up all word frequency greater than zero and prefix be the general probability of the tlv triple of xy, needing to add up all prefixes on the denominator is y, and when suffix is z the word frequency of corresponding ternary group xyz greater than the general probability of two zero tuples.
Hence one can see that, calculates " today I " corresponding rollback coefficient, needs prefix to be " today I " all tlv triple and prefix be all two tuple word frequency statistics amounts of " I ".Therefore just in time can all be placed on same node to all corresponding data with this allocation scheme.
Under the situation that has more sentences as input, this node can calculate all suffix and be two tuples of " I ", as the rollback coefficient of " today I ", " tomorrow I ", " being me " etc.
Step S6:
M is added one, judge that whether m is smaller or equal to N-1, in three gram language model, tlv triple does not have the rollback coefficient, therefore directly enter step S7, the calculating of the rollback coefficient of tlv triple and the rollback coefficient calculations of two tuples are similarly, will distribute tlv triple and data four-tuple as key assignments according to the penult word of tlv triple.
Step S7:
ARPA is the N-gram language model storage format standard of generally acknowledging at present, no longer describes herein.
As shown in Figure 3, according to another side of the present invention, also provide a kind of language model training system, comprise word frequency module 1, COC module 2, probability module 3, rollback coefficient module 4 and summarizing module 5.
Described word frequency module 1 word frequency module, be used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein N is the positive integer more than or equal to 2, described word frequency module 1 is carried out the Map operation first word of N tuple is exported as key assignments, Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple.
Described COC module 2 is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic, the boundary value K that described COC module 2 default discounts are calculated, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple.
Described probability module 3 is used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, calculate the probable value of N tuple, described probability module 3 is carried out the Map operation first word of N tuple of the word frequency statistics amount correspondence of described N tuple is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import earlier described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor then, gather at last and obtain all N-gram probability.
Described rollback coefficient module 4 is used for carrying out many wheel MapReduce operation, calculates a tuple respectively to the rollback coefficient of m tuple, m=N-1 wherein, and described rollback coefficient module 4 comprises:
One tuple unit 41, be used for carrying out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple (m=1), a described tuple unit 41 distributes the data of all tuples and the data of two tuples at each Reduce node earlier, then first word in a tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, at last adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient of a tuple corresponding with the data of the data of a described tuple and two tuples according to the described word frequency statistics amount of a tuple and two tuples, concrete, each node need distribute the data of all tuples and the data of two tuples, be assigned to certain node in the cluster according to first word in a tuple or two tuples, can guarantee that like this monobasic rollback coefficient that each node calculates is overall, it is the same namely all being placed on the value of calculating on the node with total data.
Polynary group of unit 42, carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple (the rollback coefficient of 2<=m<=N-1), described polynary group of unit 42 distributes the data of all m tuples and the data of m+1 tuple at each node earlier, then the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculate on each node at last the rollback coefficient of the m tuple corresponding with the data of described m tuple and m+1 tuple according to the described word frequency statistics amount of m tuple and m+1 tuple, concrete, here data allocations is undertaken by the penult word of m unit and m+1 tuple, for example: " we not " and " striving for by us ", " inspiring ours " etc. will be assigned to same node, can guarantee that like this each node has the rollback coefficient of sufficient data computation m unit.
Described summarizing module 5 is used for gathering the language model that described probable value and rollback coefficient obtain the APRA form.
The present invention adopts the data structure based on the Hash prefix trees, dexterously mass data is broken and make up, data are distributed to each node of cluster, add up corresponding data value, carry out concurrent operation then, obtain one based on the language model of mass data, realized the distributed version of Katz algorithm, train the language model based on mass data effectively, can effectively solve the sparse problem of data simultaneously, improve its discrimination.
Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because corresponding with the embodiment disclosed method, so description is fairly simple, relevant part partly illustrates referring to method and gets final product.
The professional can also further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, client software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions still are that software mode is carried out with hardware actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for using distinct methods to realize described function to each, but this realization should not thought and exceeds scope of the present invention.
Obviously, those skilled in the art can carry out various changes and modification to invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these change and modification.

Claims (4)

1. a language model training method is characterized in that, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2, corpus is carried out taking turns the MapReduce operation, the step of the word frequency statistics amount of statistics N tuple, comprise: the Map operation is exported first word of N tuple as key assignments, Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple, wherein, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the step of the COC statistic of N tuple, comprise: the boundary value K that default discount is calculated, wherein K is positive integer, Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple;
The word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain a tuple to the probable value of N tuple, wherein, the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain a tuple to the step of the probable value of N tuple, comprise: the Map operation is exported first word of N tuple of the word frequency statistics amount correspondence of described N tuple as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor;
Carry out many wheel MapReduce operation, calculate a tuple respectively to the rollback coefficient of m tuple, m=N-1 wherein, wherein, carry out many wheel MapReduce operation, calculate a tuple respectively to the step of the rollback coefficient of m tuple, comprise: carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple, carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple, wherein, carry out taking turns the MapReduce operation, calculate the step of a tuple rollback coefficient, comprising: distribute the data of all tuples and the data of two tuples at each Reduce node, first word in a tuple or two tuples is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, calculates the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node according to the described word frequency statistics amount employing Katz smoothing algorithm of a tuple and two tuples;
Gather described probable value and rollback coefficient and obtain the language model of APRA form.
2. language model training method as claimed in claim 1 is characterized in that, calculates two tuples respectively to the step of the rollback coefficient of m tuple, comprising:
Distribute the data of all m tuples and the data of m+1 tuple at each node;
The penult word of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node according to the described word frequency statistics amount of m tuple and m+1 tuple.
3. a language model training system is characterized in that, comprising:
The word frequency module, be used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein N is the positive integer more than or equal to 2, wherein, described word frequency module is carried out Map operation first word of N tuple is exported as key assignments, and Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs at each Reduce node is as the word frequency statistics amount of N tuple;
The COC module, be used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple, wherein, the boundary value K that the default discount of described COC module is calculated, wherein K is positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K, Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that described word frequency statistics amount obtains described N tuple;
The probability module, be used for the word frequency statistics amount of described N tuple is carried out taking turns the MapReduce operation, obtain a tuple to the probable value of N tuple, wherein, described probability module is carried out the Map operation first word of N tuple of the word frequency statistics amount correspondence of described N tuple is exported as key assignments, Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, import described COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate a tuple to the probable value of N tuple according to described discount factor;
The rollback coefficient module, carry out many wheel MapReduce operation, calculate a tuple respectively to the rollback coefficient of m tuple, m=N-1 wherein, wherein, described rollback coefficient module comprises: a tuple unit, be used for carrying out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple, polynary group of unit, carry out many wheel MapReduce operation, calculate two tuples respectively to the rollback coefficient of m tuple, wherein, a described tuple unit distributes the data of all tuples and the data of two tuples at each Reduce node, first word in one tuple or two tuples is exported as key assignments, and Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition, calculates the rollback coefficient of a tuple corresponding with the data of a described tuple and two tuples on each node according to the described word frequency statistics amount employing Katz smoothing algorithm of a tuple and two tuples; And
Summarizing module gathers the language model that described probable value and rollback coefficient obtain the APRA form.
4. language model training system as claimed in claim 3, it is characterized in that, described polynary group of unit distributes the data of all m tuples and the data of m+1 tuple at each node, the penult word of m tuple or m+1 tuple is exported as key assignments, Shuffle resets m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes, calculates the rollback coefficient of m tuple corresponding with the data of described m tuple and m+1 tuple on each node according to the described word frequency statistics amount of m tuple and m+1 tuple.
CN 201110301029 2011-09-28 2011-09-28 Language model training method and system Active CN102509549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110301029 CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110301029 CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Publications (2)

Publication Number Publication Date
CN102509549A CN102509549A (en) 2012-06-20
CN102509549B true CN102509549B (en) 2013-08-14

Family

ID=46221624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110301029 Active CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Country Status (1)

Country Link
CN (1) CN102509549B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229286A (en) * 2017-05-27 2018-06-29 北京市商汤科技开发有限公司 Language model generates and application process, device, electronic equipment and storage medium

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514230B (en) * 2012-06-29 2018-06-05 北京百度网讯科技有限公司 A kind of method and apparatus being used for according to language material sequence train language model
CN103631771A (en) * 2012-08-28 2014-03-12 株式会社东芝 Method and device for improving linguistic model
CN103871404B (en) * 2012-12-13 2017-04-12 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
CN104112447B (en) * 2014-07-28 2017-08-25 安徽普济信息科技有限公司 Method and system for improving accuracy of statistical language model
KR102167719B1 (en) * 2014-12-08 2020-10-19 삼성전자주식회사 Method and apparatus for training language model, method and apparatus for recognizing speech
CN106156010B (en) * 2015-04-20 2019-10-11 阿里巴巴集团控股有限公司 Translate training method, device, system and translation on line method and device
CN106055543B (en) * 2016-05-23 2019-04-09 南京大学 A Spark-Based Training Method for Large-Scale Phrase Translation Models
CN107436865B (en) * 2016-05-25 2020-10-16 阿里巴巴集团控股有限公司 Word alignment training method, machine translation method and system
CN106257441B (en) * 2016-06-30 2019-03-15 电子科技大学 A training method of skip language model based on word frequency
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Method and device for extracting spoken sentences
CN108021712B (en) * 2017-12-28 2021-12-31 中南大学 Method for establishing N-Gram model
CN110379416B (en) * 2019-08-15 2021-10-22 腾讯科技(深圳)有限公司 Neural network language model training method, device, equipment and storage medium
CN112862662A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Method and equipment for distributed training of transform-xl language model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673997A (en) * 2004-03-26 2005-09-28 微软公司 Representation of a deleted interpolation n-gram language model in ARPA standard format
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332207B2 (en) * 2007-03-26 2012-12-11 Google Inc. Large language models in machine translation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673997A (en) * 2004-03-26 2005-09-28 微软公司 Representation of a deleted interpolation n-gram language model in ARPA standard format
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229286A (en) * 2017-05-27 2018-06-29 北京市商汤科技开发有限公司 Language model generates and application process, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102509549A (en) 2012-06-20

Similar Documents

Publication Publication Date Title
CN102509549B (en) Language model training method and system
US9864807B2 (en) Identifying influencers for topics in social media
US10310812B2 (en) Matrix ordering for cache efficiency in performing large sparse matrix operations
CN102426610B (en) Microblog rank searching method and microblog searching engine
US10726018B2 (en) Semantic matching and annotation of attributes
CN107608953B (en) A word vector generation method based on variable-length context
US9092422B2 (en) Category-sensitive ranking for text
CN114168740B (en) Transformer concurrency fault diagnosis method based on graph convolution neural network and knowledge graph
Ordentlich et al. Network-efficient distributed word2vec training system for large vocabularies
CN112181659B (en) Cloud simulation memory resource prediction model construction method and memory resource prediction method
Wang et al. Design and Application of a Text Clustering Algorithm Based on Parallelized K-Means Clustering.
CN114819148B (en) Language model compression method based on uncertainty estimation knowledge distillation
CN101119302A (en) A method for mining frequent patterns in recent time windows on transactional data streams
US20150170030A1 (en) Determining geo-locations of users from user activities
CN109710621A (en) Keyword Search KSANEW Algorithm Combining Semantic Class Nodes and Edge Weights
US10769140B2 (en) Concept expansion using tables
CN106126721A (en) The data processing method of a kind of real-time calculating platform and device
CN109828965B (en) Data processing method and electronic equipment
CN104778205A (en) Heterogeneous information network-based mobile application ordering and clustering method
CN106156142A (en) The processing method of a kind of text cluster, server and system
CN107436865A (en) A kind of word alignment training method, machine translation method and system
JP6261669B2 (en) Query calibration system and method
CN113190763A (en) Information recommendation method and system
CN108268982A (en) A kind of extensive active power distribution network decomposition strategy evaluation method and device
CN105095239A (en) Uncertain graph query method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHANGHAI GUOKE ELECTRONIC CO., LTD.

Free format text: FORMER OWNER: SHENGYUE INFORMATION TECHNOLOGY (SHANGHAI) CO., LTD.

Effective date: 20140919

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140919

Address after: 201203, room 1, building 380, 108 Yin Yin Road, Shanghai, Pudong New Area

Patentee after: Shanghai Guoke Electronic Co., Ltd.

Address before: 201203 Shanghai Guo Shou Jing Road, Zhangjiang High Tech Park of Pudong New Area No. 356 building 3 Room 102

Patentee before: Shengle Information Technology (Shanghai) Co., Ltd.

CP03 Change of name, title or address

Address after: Room 127, building 3, 356 GuoShouJing Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Patentee after: SHANGHAI GEAK ELECTRONICS Co.,Ltd.

Address before: Room 108, building 1, 380 Yinbei Road, Pudong New Area, Shanghai 201203

Patentee before: Shanghai Nutshell Electronics Co.,Ltd.

CP03 Change of name, title or address