MXPA97002521A

MXPA97002521A - Method and apparatus for a better language recognition system

Info

Publication number: MXPA97002521A
Application number: MXPA/A/1997/002521A
Authority: MX
Inventors: Alshawi Hiyan
Original assignee: Lucent Technologies Inc
Priority date: 1996-04-10
Filing date: 1997-04-07
Publication date: 1998-07-03

Abstract

The present invention relates to a method for recognizing language, characterized in that a signal indicative of language to be recognized is generated, comprising the steps of: generating strings of candidate words for the signal, selecting among the candidates using a language model using a plurality of finite-state machines, each machine has the ability to recognize a pair of sequences, one sequence is scanned to the left, the other is scanned to the right, and each machine corresponds to a word in a vocabulary of the language model

Description

METHOD AND APPARATUS FOR AN IMPROVED LANGUAGE RECOGNITION SYSTEM FIELD OF THE INVENTION The present invention relates in general to language recognition. More particularly, the present invention relates to an improved language recognition system, which uses a model of probabilistic lexical associations. BACKGROUND PE THE INVENTION Speech recognition is a process by which a pronunciation or mention of unknown speech is identified ^ ("power signal") »Speech recognition typically involves a signal processing step wherein a plurality of speech hypotheses of word strings, that is to say possible sequences of words, are proposed for the fed signal. The task then is to recognize or identify the "best" string of words, from a set of hypotheses, ie strings of words proposed that are consistent with the fed signal. Speech recognition systems use a language model for this purpose. Typical speech recognition systems can employ a quantitative language model. Quantitative models associate a "cost" with each hypothesis, REF: 24217 selecting the lowest cost hypothesis as the recognized string of words. An example of a quantitative model is a probabilistic language model. Probabilistic models assign probabilities to strings of words and choose the string that has the highest probability of representing a given feed signal. The probability calculation can be made using a variety of methods. Such a method, referred to as the N-Gram model, specifies the probability of a word that is part of a conditional string in the previous N-words in the string. See for example Jelinek and collaborators, "Principles of Lexical Language Modeling for Speech Recognition" (Principles of Linguistic Language Modeling for Speech Recognition), Adv. Speech Signal Processing, pgs. 651-699 (1992). This article and all other articles mentioned in this specification are incorporated herein by reference. The N-Gram model is lexically sensitive, since the parameters of the model are associated with particular lexical items, that is, words. This sensitivity allows the model to capture local distribution patterns that are idiosyncratic to particular words. A second method referred to as a stochastic context-free grammar, uses a tree-like data structure, where words with a feed signal appear as marginal nodes of a tree. Probabilities are assigned as the sum of probabilities of all three derivations for which words in the candidate string appear as marginal nodes. See, for example, Jelinek et al., "Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammers" (Calculation of Probability of Initial Submission Generation by Free Grammars of Stochastic Context), Computational Linguistics, v. 17 (3), pages. 315-324 (1991). In context-free grammars, structural properties are modeled, that is, the probability that a phrase in a particular category, for example sentences ^ with name or verb, can be decomposed into specific sub-phrases of categories. Both of the aforementioned methods for estimating probability have disadvantages. The N-Gram model, with lexical sensitivity, suffers as a result of its failure to capture significant long-range associations between words. When the grammar is ignored, useful information that can only be derived from grammatical relationships between words is lost. While a stochastic context-free grammar is sensitive to these grammatical relationships, it fails to capture associations between lexical items that reflect semantic information, which makes one string much more likely than another. A language model that fails to consider both structural and semantic information, inevitably suffers from a loss of precision. Probability models of the prior art are typically compiled into a large state machine. The aforementioned disadvantage of lexically sensitive probability models is partly due to this structure. The machines usually implemented for speech recognition, typically limit themselves to moving from left to right through the word string hypotheses, processing string words in a word-by-word way. As a result, long-range associations between words are lost. Compiling stochastic context-free grammars, or more preferably, approximations of these grammars, in a large state machine, does not limit the ability of these models to capture long-range associations. As discussed previously, these associations are captured due to the nature of the model. There is another disadvantage, however, related to the use of a large simple state machine that affects both types of probability models. When the model is compiled into a large state machine, the complete vocabulary or lexicon of the language model must be contained there. In the typical case of a software implementation, these state machines become too large for computers with limited RAM. In this way, there is a need for a language model that has both lexical and structural sensitivity, and when it is implemented in software, it is compact enough to be installed on computers that have memory RAM limited. COMPENDIUM OF THE INVENTION Methods and apparatus for improved language recognition and language model systems are described. In accordance with the present invention, a plurality of "small" ... finite state machines direct the language model. Each-one of these machines has the ability to recognize a pair of sequences, one scanned to the left, the other scanned to the right. Each machine of finite state, here referred to as a lexical header machine, corresponds to a word in the vocabulary of the language model. Only the lexical header machines that correspond to the words contained in the word string hypothesis are activated according to the present methods. Active lexical header machines construct or derive phrases from the words contained in the word string hypotheses, by a series of left or right transitions. A plurality of these phrases is created by the lexical header machines for the various words since they form variations with other words in the word string hypotheses. The lexical header machines incrementally calculate a "cost" for the derived phrases. The cost is related to the probability that the derivative phrase corresponds to the fed language signal. The lowest cost phrase is chosen as the phrase that corresponds to the fed language signal. As noted above, the present method uses a limited set of "small" lexical header machines that correspond to the words in the word string hypotheses rather than to a "large" state machine that incorporates all the vocabulary . As such, the present methods and apparatuses can be implemented using significantly less RAM than the language recognition systems of the prior art. The finite state machines of the present invention that recognize a pair of sequences, are distinct from the so-called "two-way" finite state machines that can be moved either left or right, but recognize only a single sequence. These two-way machines are known in the art and have the same recognition power as finite-state machines that can only move left or right. See for example, Hopcroft et al., Introduction to Automaton Theory.

Lanquages. and Computation (Introduction to Theory of Automaton, Languages and Calculus). (Addison Wesley, 1979). Notwithstanding these two-way state machines, the state machines typically employed in the prior art for speech recognition are usually limited to processing word strings when moving from left to right, while the lexical header machines employed in The present invention simultaneously explores the feeding to the left and right of particular words in the middle of the string. This results in accurate predictions of attached words, since you can start processing with a less-common word, which limits the possibilities for the word-attached. Consider the following example sentence: "I want the transistor". A restricted-state machine to process from left to right, would have to select the word next to "the", that is "I want the?". Supposedly, a large number of words in the particular vocabulary used can conveniently follow the word "the" in the exemplary phrase. The lexical header machines of the present invention which process in any direction, are free to start with the word "transistor" and choose the preceding word. There are far fewer selections for words that may conveniently precede the word "transistor" that follow the word "the." By virtue of using a plurality of small lexical header machines, the present methods and apparatus are not sensitive both lexically and structurally. Lexicon associations are captured because each header machine transition involves costs tied to particular items or lexical items, that is, word associations. The structural associations implicit in the hierarchical organization of a phrase are captured as a result of the cascade of lexical header machines. BRIEF DESCRIPTION OF THE DRAWINGS Additional features of the invention will be more apparent from the following detailed description of specific embodiments thereof, when read in conjunction with the accompanying drawings wherein: FIGURE 1 illustrates a method according to the present invention to implement a speech recognition system; FIGURE 2 is an illustration of a network of words; FIGURE 3 is a diagram showing state transitions for an exemplary lexical header machine, in accordance with the present invention; FIGURE 4 is an embodiment of a method according to the present invention for generating a plurality of phrase records to choose the best selection of a plurality of word string hypotheses; FIGURE 5 shows an exemplary sub-phrase derivation produced by the present methods and apparatus for a string of sample words; and FIGURE 6 shows the lexical header machines and the transitions required to generate the sub-phrase derivation of FIGURE 5. DESCRIPTION R-KTAT.T.Ar-A The present invention relates to language demodulation methods to use in a variety of language recognition applications. The role of the language model in language recognition involves identifying the "best" string of words from a set of word string hypotheses developed by other parts of the language recognition system. The present invention will be described in the context of speech recognition. It will be understood, however, that the present methods are applicable to all modes of language recognition, including without limitation, handwriting recognition and optical character recognition. It will be appreciated that the present methods can be implemented as software or hardware.

FIGURE 1 is an illustration of a speech recognition method according to the present invention.

Reduced to its basic concept, this method can include SSP speech signal processing, LA language analysis and AP application processing. The method starts with SSP speech signal processing, where a speech signal is accepted and a set of word string hypotheses consistent with this speech signal is generated. In a speech recognition system, word string hypotheses are generated by what is referred to as an "acoustic model". These models are good., Known to those with skill in the specialty. In more detail, the SSP speech signal processing includes converting an analog speech signal to a digitized speech signal in the operating block 10 and searching with a recognition network and generating hypotheses of string words in the 15 operating block. it is used in the present language recognition methods, this signal processing generates the string hypotheses of words as a sequence of words, either by processing a speech signal or if this processing refers to other modalities of language recognition. This sequencing of words can include, without limitation, an explicit set of strings of candidate words or preferably a word network data structure. The word network is a well-known construction for storing a collection of possible strings that allow sub-strings to be shared. The techniques referred to in the operating blocks 10 and 16 are well known in the art. The LA language analysis accepts the hypotheses of word strings and uses a language model according to the present teachings, hence it chooses the best string of words. The methods and apparatus of the present invention relate in particular to this aspect of the language recognition process. The present language model can then be implemented in a language recognition system, such as the speech recognition system currently described. In more detail, word string hypotheses are received from the speech signal processing SSP in the operating block 20. The language model is applied to generate and assign a range to a list of possible string of words or phrases corresponding to . the signals in the operation block 22. In the operation block 24, the best string of words is chosen and as indicated in the operation block 26, the best string of words is sent to AP application processing. Application processing in this way accepts the best string and then processes that string as appropriate, for example translation, transcription or the like. Having described where the methods and apparatus for the present language model fit into a language recognition process or system according to the present invention, the present language model and methods for its implementation will now be described in detail. As previously described, the analysis of Language LA receives a set of hypotheses of string words. Preferably, these strings of words are in the form of a network of words, that is, a directed acyclic graph. An exemplary word network is illustrated in FIGURE 2. The word network has a set of initial nodes J, represented in FIGURE 2 by io, and a set of final nodes J, represent FIGURE 2 by jl and j2. The hypotheses-represented by the network of words correspond to possible trajectories of the set of initial nodes J to the set of end nodes J. The word network is also characterized by a plurality of "network arcs" or "word arcs", which they are labeled with a word, w, between two positions that represent time points for speech. The arcs are also labeled with a cost, cc, which reflects that it also adjusts the word to that portion of the power signal. For example, in FIGURE 2, the word arcs are labeled wO, cO to w8, c8. Network and costs are generated during SSP speech signal processing using techniques well known in the art. The word arc cost is accumulated by the present method and thus contributes to the cost of a phrase by playing a role in determining the best phrase in this way. The set of arcs in the network of feed words, in this way can represent a set of records of the form < w, i, j, c0 > where i and j are indexes for the network nodes. For a network produced from a speech signal, the usual interpretation of this arc register is that the word w corresponds to the feed speech signal from the position of time i to the position of time j with the cost c0. Presented with a network of words, the vocabulary decabecera machines for the words present in the network, are activated. Each lexical header machine consists of a finite set of states Q, and a cost action table T. Entries in the action table can already be start actions, transitions to the left, transitions to the right or stop actions . The annotation C (A, m) represents the total cost of a sequence of actions A = ax ... ak that are performed by the lexical header machine m, where a_. is a start action and ak is a stop action. C (A,) in this way is the sum of the costs for actions in sequence A.

FIGURE 3 is a diagram showing state transitions for an exemplary lexical header machine in accordance with the present invention. The nodes gl-g6 represent various states, of the lexical header machine. Arcs between the nodes show the state transitions, where the header machine consumes a phrase, indicated by w "n", for example 2, etc. State transitions can be characterized as left or right actions, indicated by the direction of the arrow adjacent to the phrase. For example, the exemplary lexical header machine moves from state 1 to state 2 by consuming the phrase w2 eru a left transition. The machine tracks two position pointers in the string. A transition to the left moves the left pointer to the left and a transition to the right moves the pointer to the right. The arrowheads in gl and g2 indicate that there is a finite start cost of action in these states. In other words, there are probably starting points for the phrase. The other states have infinite start-up costs and thus starting points for the phrase are unlikely. The concentric circles in g3 and g6 indicate that there is a finite stop cost of action in these states. The lexical header machine for a word, w, builds or derives a phrase, that is, an ordering of the words in the network, by a series of left or right transitions. At each transition, the phrase extends to "consume" an adjacent phrase, which in turn was formed as a "sub-phrase derivation" by another lexical header machine for a word w '. This movement corresponds to forming an association between w "the head" and w "the dependent". In this way a plurality of these phrases are created by the lexical header machines for the various words, as they constitute associations with other words in the word network. An example of a subphrase derivation, lexical header machines and current transitions for these machines, as generated according to the present invention to recognize a sample phrase, is presented later in this specification. The method proceeds by adding these phrases, in different states of termination, to a network of phrases. This network, distinguished from the network of words, is a set of phrase registers, each one corresponding to a particular state of running a lexical header machine for some word. A phrase record has the following fields: <; vr, s, i, j, q, m, c > . In the record, w is the header of a sentence, possibly incomplete, that extends the positions i to j in its current termination stage. The phrase is constructed according to a lexical header machine n, the current state of m is q. In addition, s is the list of output words, built up to that point, and c is the current rating associated with the phrase hypothesis. The current rating is the sum of the costs ied to that point in the formation of the phrase. The cost for a phrase is calculated by lexical header machines. Each movement of the header machine contributes to the cost of the phrase by an amount that depends on the state of the machine and the identities of the two words w and w '. The word string or phrase selected by the method is the one with the lowest cost that spans the entire word network, ie from the beginning of the feed speech signal to the end of that signal. The cost to derive a phrase that spans the entire network involves the costs of machine actions leading to derivation, along with additional costs to associate machines with words and for associations between each header word and its dependent words. Additional cost parameters include association parameters that specify the cost for a word Wj. which is the head of the word w ^: C (h (wp w ^)), and the lexical parameters that supply the cost for the word w that runs the machine m: C (m, w). Each mating between a word and a machine, together with its corresponding lexical parameter, ars as an entry in a lexicon or dictionary. It will be understood that there may be more than one entry, that is, machine by word in the lexicon. The reason for this is that a certain word can be used in more than one sense, such as for example as a pronoun or a verb. The cost C (Do r w0) of a derivation of sub-phrase Do l with a header word, is the sum of the lexical cost to choose a machine m, -, the cost of the AQ machine shares that are taken by m0 in the derivation, the association parameters to associate w0 with their dependent words Wx ... wm l and the derivation costs of the sub-phrases headed by these dependents calculated recursively: C (Do t wQ) = fAIo, w0 ) + C (? 3, fllo) + !! < . < »C (h (wo r wm)) + C (Da, wm) Various cost functions can be used to calculate the cost of a phrase. Usually, the cost function is based on probabilities, where associations of less probable or less possible words lead to higher costs. In this way, the cost reflects long-range associations between words in a string. Cost functions will be described in more detail later in this specification. The method by which the lexical header machines analyze the word network is described in more detail below. In a preferred embodiment, as the phrases are extended, an operation or running cost of these phrases is calculated, so that the phrases can be cut off as it is evident that they will not be part of the lowest cost phrase. An arbitrary choice table of preference is used for this purpose. The entries in the arbitrary choice table include an arbitrary choice key < w, i, j, q, w > , and an arbitrary choice value that is a pointing to the phrase record. The arbitrary choice table maintains a pointer to the lowest cost phrase record that is between i and j headed by w in the state q of machine m. The information that constitutes the arbitrary choice key is referred to as an "intangible status" and c is referred to as the "full status cost". The method for analyzing the word network of preference has a "bottom-up" control structure similar to that for context-free syntax-analysis algorithms, such as CKY as described "by Younger and uses similar data structures to those in the so-called "Syntactic Letter Analysis" as described by Early See Younger, D., "Recognition and Parsing of Context-Free Languages in Time n3" (Syntactic Analysis and Analysis of Context-Free Language in Time n3 ), Information and Control, Vol. 10, pp. 189-208, 1967; Early, J., "An Efficient Context-Free Parsing Algorithm" (An Algorithm of Free Syntactic Analysis of Efficient Context), Comm. Of the ACM, Vol. 14, p. 453-460, 1970. The present method is addressed by the lexical header machines for words in the word network.

FIGURE 4 illustrates one embodiment of a method according to the present invention, whereby the plurality of lexical header machines are used to generate a plurality of phrase registers from which a better phrase record is chosen, i.e. best string of words. In this way, FIGURE 4 illustrates a method according to the present invention to achieve step 2 of FIGURE 1. As indicated in operation block 100, the word network generated by the SSP speech signal processor is receives. The method starts with an initialization step that is achieved by the operation blocks 105 to 120, identified collectively by the reference number 130. The initialization is carried out when a phrase record set is added to a waiting list.; w, [w], i, j, m, qo l c > developed from the word network. This phrase record is added for each item < w, i, j, cQ > , in the network of words and each entry (m, w) in the lexicon. In this way, in the operation block 105, a lexical header machine corresponding to one of the words in the word network is activated. More specifically, a lexical header machine corresponding to a word w, of the word network is retrieved from a lexicon stored in a memory device. The lexical entry corresponding to the word w includes a machine, m and a cost cx = C (m,). The machine m includes a start action that has a cost c2 = Cf start, qa, m). The cost c, of each phrase record is the lexical cost sum of c1 (the machine start cost c2 and the word arc cost c0 allocated by the speech recognizer 10. All the lexical header machines for each word arc in the network of words are activated through the loop configurations by decision blocks 115 and 105. The remaining operation / decision blocks 140-196 form a loop that consumes items from the waiting list and creates new ones phrase records Decision block 140 interrogates whether the waiting list is empty If the waiting list is empty, all the low cost phrase records that can be developed from the feed word network have been extended. The phrase network, ie the collection of phrase records developed by the present methods, is then post-processed to select the best string of words as indicated in the 200 operating block. you post-processing is described later in this specification. If the waiting list is not empty, processing continues in operation block 145, where a phrase record is removed from the waiting list. In a preferred embodiment of the present method, the cost c of the phrase register is compared to the lowest cost phrase, ie the cost of the whole state, in the arbitrary choice table in block 150. If the cost c of the record of phrases under consideration ("the present phrase") is not less than the cost of full status, the present phrase is discarded or "short". The processing then returns to block 140, to determine if another phrase record is available. If the present phrase record has a lower cost than the lowest cost phrase in the arbitrary choice table, it is added to the sentence network in operation block 155. While block 150 is not required as a step in the present method improves efficiency because it avoids creating phrase records that will be discarded later. If, after adding the phrase record to the phrase network in the operation block 155, the phrase record is adjacent to another phrase, then a combination action can be carried out. In this way, the decision block 160 interrogates whether there are more phrases or not to combine. Otherwise, processing loops back to decision block 140. If there are more phrases, a merge operation performed by the operating blocks collectively identified by reference number 180 results in a new record for an extended phrase . The old records still remain in the network. Two types of combinations are possible, left combination and right combination. In a left combination, the machine for the phrase register on the right is subjected to a left action as described below. If the network contains a first phrase record <; W? , yes t i, K, pi? , qx, c_ > , to the left of a second phrase record < w2, s2, k, j, m2, q2, c2 > , m2 includes a left action with a cost c3 = C (left, q '2, q2, m2) and? ÍI includes a stop action with a cost c4 = C (stop, qx? iB?), then the combination made in the operation block 165 produces the following extended phrase: < w2, s' 2 l i, j, m2,, > , where s'2 = concatenate (Sx, s2). The right combination is the mirror image of the left combination. A new state was established in the operation block 170, according to the machine transition. In the previous example, the new__ state is q '2, so that the extended phrase becomes: - < w2 f s '2, i, j, m2, q' 2, > . The cost of the new phrase is determined in operation block 175. The new cost is the sum of the cost of the machine transition, cost of word association, cost of consuming phrases, cost of phrase consumed and cost of machine stop consumed For the previous example, the new cost C2, in this way is given by c'2 = c + c2 + c3 + ct + C (h (w2, Wx)). The extended phrase record then becomes: < w2, s' 2, i, j, m2, q '2, c' 2 > . For each new phrase record that results from the combination operation 180, the registration cost is compared, in block 185, with the integral state cost in the arbitrary choice table. If the cost of the new phrase record is greater than the total state cost, then the processing returns to the operating block 160, without adding the new phrase record to the waiting list, so that it is effectively discarded. If the cost of the new phrase record is less than the full status value, then the arbitrary choice table entry is updated with the new record keeper in step 190 and the old complete state record is removed from the registration network. phrases. The new low-cost phrase register is then added to the waiting list in step 195 and processing continues in block 160. After the waiting list has been emptied, e_L-processing continues to operating block 200, in where the following stages are carried out to choose the string of words that has the lowest cost. First, a list of all network records < w, s, i, j, q,, c > from an initial node i e J to a final node j e j is compiled. For each record in the list, the cost for the machine m is added stopping in the state q, that is C (stop, q, m). Then, the string s of the record is selected with the lowest total cost. If there are several of these extension phrases with the minimum cost, then one is chosen at random. Considering the cost parameters, the present methods do not require any specific interpretation of the various cost parameters for the machine actions and the lexical and association costs apart from the general requirement that the lower costs correspond to more convenient strings. These are methods known to those skilled in the art to provide specific cost functions applicable to the present methods. Preferably, the cost function is log-denied probability. A method for deriving log likelihood costs for the methods of the present invention is described below. Machine actions, lexical selections and association selections are taken as events in a generative mode, specifically a probabilistic model to generate word strings. A set of speech mention signals of__feed is transcribed from the data collected for the particular speech recognition application. The method illustrated in FIGURE 4 for generating phrase records is run over the word networks produced by speech signal processing, keeping a count of machine actions, lexical machine selections and word matching choices leading to the strings transcribed by each sign of speech mention. Then, probabilities are estimated for these accounts using the standard statistical methods. For each estimated probability P (e) for an event e, the cost for e is adjusted as -log (P (e)).

It will be appreciated that other methods for estimating probabilities known to those skilled in the art can be used, such as expectation-maximization. In addition, basket functions other than probability log, may be used, such as, without limitation, the probability ratio. The probability ratio is the ratio of the number of times in training that a particular machine action or selection leads to the wrong string and the number of times it leads to choose the string transcribed. Example FIGURE 5 shows a derivation of sub-phrase-exemplary produced by the present methods and apparatus for the "string" Please show the cheapest flights from Boston to Newark "(Please show the cheapest flights from Boston to Newark). The lexical headings associated with the words in this sentence are illustrated in the Fig. The current transitions required to recognize the string are illustrated in FIGURE 6 as solid lines, a few of the many other possible transitions not taken in this derivation. Particularly, they are illustrated as dotted lines. The annotation "- > "indicates a transition to the right and the annotation" < - "indicates a transition to the left." The words under which the machines will appear in the lexicon are illustrated below to the start states, ie gl, g4, g7, etc.

The transitions made by the lexical header machines in recognizing the string "Please show me the cheapest flights from Boston to Newark" that is illustrated in FIGURE 6, are described below. Start actions for all the words in the strings are taken: "please" in g9, "show" in gl, "me" in q8, "the" in g7, "cheapest" in gl6, "from" in glO, " Boston "in gl4," to "in gl2 and" Newark "in gl5. The words for the following machines take stop actions since there are no possible transitions for them: "please", "me", "the", "cheapest", "Boston" and Newark. "The machine for the word" from " takes a transition to the right-from glO to gil by consuming the machine for the word "Boston" and stops, forming a complete sentence "from Boston." The machine for the word "" to "takes a transition to the right of gl2 a gl3, consuming the machine for the word "Newark" and stopping to form a complete sentence "to Boston." This completes the lowest level of the sub-phrase derivation illustrated in Figure 5"The machine for the word" flights "take a left transition from g4 to g5, consuming the machine for the word" cheapest ", a right transition from g5 back to g5 consuming the phrase" from Boston ", another right transition from g5 back to g5, consuming the phrase "to Newark", a left transition from? r5 to g6 consuming the machine for to the word "the" and it stops. This completes the recognition of the phrase "the cheapest flights from Boston to Newark" corresponding to the two lowest levels of the sub-phrase derivation illustrated in Figure 5. The machine for the word "show" takes a left transition of gl back to gl consuming the machine for the word "please", a right transition of gl in g2 consuming the machine for the word "me", a right transition from g2 to g3 consuming the phrase headed "flights", ie " the cheapest flights from Boston to Newark "and stops. This completes the entire derivation illustrated in Figure 5, and the recognition of "Please show the cheapest flights from Boston to Newark." 11 It will be understood that the embodiments described herein are illustrative of the principles of this invention and that various modifications may occur. 'a, and implemented by those skilled in the art, without departing from the scope and spirit of this invention, For example, while modalities described herein relate to speech recognition, the present methods can be used in other types of speech systems. recognition of language It is noted that in relation to this date, the best method known to the applicant to carry out the aforementioned invention, is that which is clear from the present description of the invention.

Having described the invention as above, it is claimed as property contained in the following:

Claims

CLAIMS 1. A method for recognizing language, characterized in that a signal indicative of language to be recognized is generated, comprising the steps of: generating strings of candidate words for the signal; select among the candidates using a language model using a plurality of finite state machines, each machine has the ability to recognize a pair of sequences, one sequence is scanned to the left, the other is scanned to the right, and each machine corresponds to a word in a vocabulary of the language model.
2. The method according to claim 1, characterized in that the step of choosing further comprises using only finite state machines corresponding to the words contained in the strings of candidate words.
The method according to claim 1, characterized in that the step of generating string of candidate words further comprises generating this string of words in the form of a network of words.
The method according to claim 1, characterized in that the step of choosing further comprises using the finite state machines to derive phrases from the words contained in the string of candidate words and calculate a cost for the derived phrases, wherein the Cost is related to the degree of correspondence between the derived phrase and the language represented by the signal.
5. The method according to claim 4, characterized in that a lower cost indicates a much greater degree of correspondence.
6. The method according to claim 5, characterized in that the step of choosing further comprises determining the most cost phrase. low.
The method according to claim 4, characterized in that the cost of a phrase is based on probabilities, where the associations of words less .. probable result in higher costs.
8. The method according to claim 4, characterized in that costs are applied incrementally as phrases are derived.
9. The method according to claim 1, characterized in that the language to be recognized is spoken and the step of generating strings of candidate words further comprises applying an acoustic model.
10. A computer readable storage medium, comprising computer-readable program instructions coded for use in conjunction with a programmable computer, these instructions cause the computer to select a language string from a plurality of language string hypotheses, the Selected string provides the best correspondence to a representative language signal, where this selection results from the action of a plurality of finite state machines capable of recognizing a pair of sequences, one sequence is scanned to the left, the other is scanned to the right through a data structure based on the plurality of language string hypotheses.
11. The computer readable storage medium according to claim 10, characterized in that the data structure is a network of phrases comprising phrases formed by the plurality of the finite state machine.
12. The computer readable storage medium according to claim 10, characterized in that the language recognition system is a speech recognition system, the hypotheses of language strings are presented in the form of a network of words and in where the finite state machines that act correspond to the words in the word network.
13. A method of choosing a string of words from a plurality of word string hypotheses, where word string hypotheses are derived from a representative language feed signal and the selected word string that best represents the word string. language, characterized in that it comprises the steps of: (a) activating state machines corresponding to the words in the word string hypotheses, wherein the activated state machines are chosen from a plurality of these state machines that define a lexicon , wherein each of the activated state machines are capable of recognizing a pair of sequences, one sequence is scanned to the left, the other is scanned to the right and in addition where each state machine is characterized by an initial state; (b) generating a first plurality of phrase records, wherein a phrase record is generated for each word in the word string hypotheses and each phrase record is characterized by a -.- word, a state machine, the initial state and a cost r (c) generate a sentence network by forming a data structure comprising the phrase records of step (b); (d) generating a plurality of extended phrase records, wherein an extended phrase record is formed when a phrase record within the phrase network consumes an adjacent phrase record in the phrase network by a state machine transition , where the extended phrase record: contains the words of both the phrase record and the adjacent phrase record and is characterized by the state machine of the phrase consumer record, a new state corresponding to the transition of the state machine and a new cost, where the new cost is the sum of the costs of the phrase consumed and the phrase consumer, a cost associated with the state machine transition of the phrase consumer and a cost associated with a stop performed by the state machine consumed, and a cost that belongs to an association between the words in the records of consumed and consuming phrases; (e) add the extended phrase record to the phrase network, if the new extended phrase record cost is less than a reference phrase record cost; (f) repeating steps (d) and (e), wherein a phrase record can consume an adjacent phrase record until all the phrase records have been fully extended and where the cost of record-of-reference registration is updated by extended phrase records-which are added to the phrase network; and (g) selecting the lowest-cost phrase record that spans the entire feed signal.