CN111651599A

CN111651599A - Method and device for sorting candidate voice recognition results

Info

Publication number: CN111651599A
Application number: CN202010475597.0A
Authority: CN
Inventors: 乔鹏程; 赵超; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-11
Anticipated expiration: 2040-05-29
Also published as: CN111651599B

Abstract

The embodiment of the application discloses a method and a device for sorting candidate results of voice recognition, wherein the method comprises the following steps: acquiring candidate result information of voice recognition; combining the candidate result information obtained by identification in pairs to generate at least one candidate combination; identifying common error words of the candidate combinations; replacing common error words included in each candidate result information in the candidate combination with sign words, replacing the word information of the common error words with sign word information, obtaining a replacement candidate result corresponding to each candidate result information, and obtaining a combination to be classified; obtaining the quality of two alternative candidate results in each combination to be classified by using a two-classification model obtained by pre-training; and sorting the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination. By obtaining the quality of every two replacement candidate results, the quality ranking between each piece of replacement candidate result information can be determined, and an accurate voice recognition result is obtained.

Description

Method and device for sorting candidate voice recognition results

Technical Field

The present application relates to the field of speech recognition technology, and in particular, to a method and an apparatus for ranking speech recognition candidate results.

Background

At present, in the process of speech recognition, firstly, the speech information is converted to obtain a plurality of candidate result information corresponding to the speech information, then, the obtained plurality of candidate result information are ranked, and finally, the speech recognition result is obtained through the optimal candidate result information in the ranking result.

However, the existing sorting method for candidate result information has the problem that the sorting result is inaccurate, so that the optimal candidate result information obtained through the sorting result is not accurate, and the voice recognition result and the voice information have larger deviation.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for sorting speech recognition candidate results, which can perform more accurate sorting on candidate result information and obtain more accurate speech recognition results through a sorting result.

In order to solve the above problem, the technical solution provided by the embodiment of the present application is as follows:

a method of ranking speech recognition candidate results, the method further comprising:

acquiring candidate result information of voice recognition, wherein each candidate result information comprises a recognition text word sequence and word information of each word in the recognition text word sequence;

combining the candidate result information obtained by identification in pairs to generate at least one candidate combination;

identifying a common error word of the candidate combination, wherein the common error word is an error word which is commonly possessed by two candidate result information included in the candidate combination and has the same position;

replacing common error words included in each candidate result information in the candidate combination with sign words, replacing the word information of the common error words with sign word information, obtaining a replacement candidate result corresponding to each candidate result information, and obtaining a combination to be classified;

obtaining the quality of two alternative candidate results in each to-be-classified combination by utilizing a two-classification model obtained by pre-training;

and sorting the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination.

In one possible implementation, the identifying the common error word of the candidate combination includes:

obtaining occurrence probability values of words included in the two candidate result information in the candidate combination;

determining words with probability values lower than a threshold value in the candidate result information as error words, and determining the position of each error word in the corresponding candidate result information;

and comparing the error words respectively included by the two candidate result information in the candidate combination with the positions of the error words appearing in the corresponding candidate result information, and identifying the same error words positioned at the same positions as the common error words corresponding to the candidate combination.

In a possible implementation manner, the obtaining, by using a pre-trained binary classification model, the superiority and inferiority of two alternative candidate results in each to-be-classified combination includes:

acquiring a first feature vector corresponding to the recognition text word sequence in a target replacement candidate result, wherein the target replacement candidate result is each replacement candidate result in the combination to be classified;

acquiring a second feature vector corresponding to word information of each word in the target replacement candidate result;

splicing the first feature vector and the second feature vector to generate feature representation of the target replacement candidate result;

and inputting the feature representation of the two replacement candidate results in the combination to be classified into a pre-trained binary classification model to obtain the quality of the two replacement candidate results in each combination to be classified.

In one possible implementation, the word information includes one or more of an acoustic model score, a speech model score, a duration, a confidence level;

the obtaining of the second feature vector corresponding to the word information of each word in the target replacement candidate result includes:

inputting the word information of each word in the target replacement candidate result into a full-connection network to obtain the feature vector of each word in the target replacement candidate result;

and splicing the feature vectors of all words in the target replacement candidate result to generate a second feature vector corresponding to the word information of all words in the target replacement candidate result.

In a possible implementation manner, the inputting the feature representation of the two candidate replacement results in the combination to be classified into a pre-trained binary classification model to obtain the quality of the two candidate replacement results in each combination to be classified includes:

representing the characteristics of the two alternative candidate results in the combination to be classified into an encoder part of an input converter model to obtain a first hidden layer vector output by the encoder part of the input converter model;

and inputting the first hidden layer vector into a pre-trained binary model to obtain the quality of two alternative candidate results in each to-be-classified combination.

In one possible implementation, the training process of the binary model includes:

acquiring voice sample information and a standard identification text corresponding to the voice sample information;

performing voice recognition on the voice sample information to obtain training candidate result information corresponding to the voice sample information, wherein the training candidate result information comprises a training recognition text word sequence and word information of each word in the training recognition text word sequence;

determining the training candidate result information with the highest similarity to the standard recognition text as standard training candidate result information;

respectively combining the standard training candidate result information with other training candidate result information to generate at least one combination to be trained;

and training to obtain a two-classification model by using the combination to be trained and the label of which the standard training candidate result information in the combination to be trained is superior to the training candidate result information.

In a possible implementation manner, the training using the combination to be trained and the label of the standard training candidate result information in the combination to be trained that is better than the training candidate result information to obtain the two-classification model includes:

acquiring a third feature vector corresponding to the recognition text word sequence in target training candidate result information, wherein the target training candidate result information is standard training candidate result information and training candidate result information in the combination to be trained respectively;

acquiring a fourth feature vector corresponding to word information of each word in the target training candidate result information;

splicing the third feature vector and the fourth feature vector to generate feature representation of the target training candidate result information;

and training to obtain a binary model by utilizing the characteristic representation of the standard training candidate result information in the combination to be trained, the characteristic representation of the training candidate result information in the combination to be trained and the label of the standard training candidate result information in the combination to be trained, which is superior to the training candidate result information.

An apparatus for ranking speech recognition candidate results, the apparatus comprising:

an acquisition unit configured to acquire candidate result information of speech recognition, each of the candidate result information including a recognition text word sequence and word information of each word in the recognition text word sequence;

the combination unit is used for pairwise combining the candidate result information obtained by identification to generate at least one candidate combination;

the identification unit is used for identifying a common error word of the candidate combination, wherein the common error word is an error word which is commonly possessed by two candidate result information included by the candidate combination and has the same position;

the replacing unit is used for replacing a common error word included in each candidate result information in the candidate combination with a mark word, replacing the word information of the common error word with mark word information, obtaining a replacing candidate result corresponding to each candidate result information, and obtaining a combination to be classified;

the quality obtaining unit is used for obtaining the quality of two replacement candidate results in each to-be-classified combination by using a two-classification model obtained through pre-training;

and the sorting unit is used for sorting the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination.

In a possible implementation manner, the identification unit includes:

a probability value obtaining subunit, configured to obtain occurrence probability values of words included in two pieces of candidate result information in the candidate combination;

a determining subunit, configured to determine, as an error word, a word with a probability value lower than a threshold value appearing in the candidate result information, and determine a position where each error word appears in the corresponding candidate result information;

and the identifying subunit is used for comparing the error words respectively included in the two candidate result information in the candidate combination with the positions of the error words appearing in the corresponding candidate result information, and identifying the same error words positioned at the same positions as the common error words corresponding to the candidate combination.

In a possible implementation manner, the quality obtaining unit includes:

a first obtaining subunit, configured to obtain a first feature vector corresponding to the recognition text word sequence in a target replacement candidate result, where the target replacement candidate result is each replacement candidate result in the combination to be classified;

a second obtaining subunit, configured to obtain a second feature vector corresponding to word information of each word in the target replacement candidate result;

a first splicing subunit, configured to splice the first feature vector and the second feature vector to generate a feature representation of the target replacement candidate result;

and the quality degree obtaining subunit is used for inputting the feature representation of the two replacement candidate results in the combination to be classified into a pre-trained binary classification model to obtain the quality degree of the two replacement candidate results in each combination to be classified.

the second acquisition subunit includes:

a feature vector obtaining subunit, configured to input word information of each word in the target replacement candidate result into a full-connection network, so as to obtain a feature vector of each word in the target replacement candidate result;

and the second splicing subunit is used for splicing the feature vectors of the words in the target replacement candidate result to generate a second feature vector corresponding to the word information of the words in the target replacement candidate result.

In a possible implementation manner, the quality obtaining subunit includes:

the hidden layer vector acquisition subunit is used for representing the features of the two alternative candidate results in the combination to be classified into an encoder part of the converter model to obtain a first hidden layer vector output by the encoder part of the converter model;

and the quality determining subunit is used for inputting the first hidden layer vector into a pre-trained binary model to obtain the quality of the two alternative candidate results in each to-be-classified combination.

An apparatus for ranking speech recognition candidate results comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by one or more processors the one or more programs including instructions for:

A computer readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the method of ranking speech recognition candidate results as described above.

Therefore, the embodiment of the application has the following beneficial effects:

in the method for sorting the candidate voice recognition results provided by the embodiment of the application, the candidate result information of the voice recognition is firstly obtained, and the candidate result information is pairwise combined to generate at least one candidate combination; identifying common error words of the candidate combinations, replacing the common error words included in each candidate result information in the candidate combinations with the sign words, and replacing the word information of the common error words with the sign word information to generate a combination to be classified including two replacement candidate results; then, obtaining the quality of two alternative candidate results in each combination to be classified by utilizing a two-classification model obtained by pre-training; and finally, obtaining the sorting result of the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination. Therefore, on one hand, by comparing two replacement candidate results in the to-be-classified combination by using the binary model, the goodness of every two replacement candidate results can be obtained to determine the goodness ranking between the information of each replacement candidate result, and further, the ranking result of more accurate candidate result information is obtained, so that the accurate voice recognition result is obtained. On the other hand, the common error words of the candidate combination and the word information of the common error words are replaced, so that the influence of the common error words on the judgment of the quality between the candidate replacement results can be reduced, the accuracy of the judgment of the quality of the candidate replacement results is improved, and the finally determined voice recognition result is more accurate.

Drawings

Fig. 1 is a schematic diagram of an exemplary application scenario of a method for ranking speech recognition candidate result information according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for ranking speech recognition candidates according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for obtaining the goodness of a candidate replacement result according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of generating a feature representation of a target replacement candidate result according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a training binary model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus for sorting speech recognition candidate results according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a client according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

In order to facilitate understanding and explaining the technical solutions provided by the embodiments of the present application, the following description will first describe the background art of the present application.

The inventor finds in the research of the traditional speech recognition that in the speech recognition system based in part on the hidden markov model, the speech information is decoded by the n-gram language model to generate possible candidate result information, i.e. a recognition result to which the speech information may correspond. And sequencing the obtained candidate result information through the n-gram language model, and taking the optimal candidate result information as a final voice recognition result. However, when ranking the candidate result information, the adopted n-gram language model is mainly used for predicting the current vocabulary based on the historical vocabulary, and is not consistent with the processing target of ranking a plurality of candidate result information to obtain the optimal candidate result information. Therefore, the ranking result of the candidate result information obtained by the n-gram language model is not accurate, resulting in that the speech recognition result obtained by the optimal candidate result information in the ranking result is not the most accurate speech recognition result.

In addition, in some cases, erroneous words that may be included in the candidate result information may affect the determination of the degree of superiority and inferiority of the candidate result information. However, the existing language model usually calculates the probability of the current word through the historical words, i.e. calculates the probability of the corresponding vocabulary which may appear in the subsequent position according to the words in the earlier position in a sentence. When the vocabulary in the former position is wrong, the influence of the mistake in the vocabulary in the former position is influenced, and even if the vocabulary in the subsequent position is wrong, the language model may judge the vocabulary in the subsequent position as being associated with the vocabulary in which the mistake is found earlier, and judge the wrong vocabulary in the subsequent position as being correct. Accordingly, if the subsequent vocabulary is correct, but the relevance between the subsequent vocabulary and the previous wrong vocabulary is not high, the language model may misjudge the subsequent correct vocabulary as the wrong vocabulary. Therefore, certain errors exist in the judgment of the candidate result information by the language model, so that the sequencing result of the candidate result information obtained by the language model is inaccurate, and the accuracy of the final voice recognition result is affected.

As an example, it is assumed that the same speech information is subjected to speech recognition, the obtained ith and jth candidate result information are subjected to word segmentation, and the word sets subjected to word segmentation are Wi ═ Wi1, Wi2, Wi3, Wi4, Wi5, Wi6, Wi7] and Wj ═ Wj1, Wj2, Wj3, Wj4, Wj5, Wj6, Wj7, respectively. Among them, wi3, wj3, and wi6 belong to words in which errors exist in two candidate result information. If wi3 and wj3 are the same, then wi3 and wj3 are considered to be common error words included in the two candidate result information. From the recognition result, Wi should be Wj better than Wj because Wi has two error words of Wi3 and Wi 6. However, if Wi and Wj are ranked using the existing language model, the presence of the previous error word Wi3 may cause the language model to make a false positive for the next error word Wi6, resulting in a better ranking result for Wi.

Based on this, the embodiment of the present application provides a method for ranking speech recognition candidate results, including: acquiring candidate result information of voice recognition, wherein each candidate result information comprises a recognition text word sequence and word information of each word in the recognition text word sequence; combining the candidate result information pairwise to generate at least one candidate combination; identifying common error words of the candidate combinations; replacing common error words included in each candidate result information in the candidate combination with mark words, replacing word information of the common error words with mark word information, and generating a combination to be classified; obtaining the quality of two alternative candidate results in each combination to be classified by using a two-classification model obtained by pre-training; and sorting the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination.

On one hand, the two candidate result information in the to-be-classified combination is ranked according to the goodness degree by using the pre-trained binary classification model, and the ranking of the goodness degree corresponding to every two replacement candidate results can be obtained. Therefore, a more accurate ranking result of the goodness of the replacement candidate result can be obtained through pairwise comparison, and a more accurate ranking result of the candidate result information is further obtained. On the other hand, the common error words of the candidate combination and the word information of the common error words are replaced, so that the influence of the common error words is reduced when the two-classification model is used for judging the quality of the replacement candidate result, the accuracy of judging the quality of the replacement candidate result is improved, and the finally determined voice recognition result is more accurate.

In order to facilitate understanding of the method for ranking the speech recognition candidate results provided in the embodiment of the present application, an application scenario of the method for ranking the speech recognition candidate result information provided in the embodiment of the present application is described below with reference to fig. 1. Fig. 1 is a schematic diagram of an exemplary application scenario of a method for ranking speech recognition candidate result information according to an embodiment of the present application.

In the speech recognition application, the client 101 sends speech information to be subjected to speech recognition to the server 102, the server 102 performs speech recognition on the speech information, obtains a plurality of possible candidate result information corresponding to the speech information, sorts the obtained candidate result information, selects the optimal candidate result information as a final speech recognition result, and sends the speech recognition result to the client 101. The server 102 may rank the obtained candidate result information by using the ranking method of the speech recognition candidate result provided in the embodiment of the present application.

Those skilled in the art will appreciate that the schematic diagram of the application scenario shown in fig. 1 is only one example in which embodiments of the present application may be implemented. The application scope of the embodiments of the present application is not limited in any way by the application scenario.

It is noted that client 101 may be hosted by a terminal, which may be any user equipment now existing, developing, or later developed that is capable of interacting with each other through any form of wired and/or wireless connection (e.g., Wi-Fi, LAN, cellular, coaxial cable, etc.), including but not limited to: smart wearable devices, smart phones, non-smart phones, tablets, laptop personal computers, desktop personal computers, minicomputers, midrange computers, mainframe computers, and the like, either now in existence, under development, or developed in the future. The embodiments of the present application are not limited in any way in this respect. It should also be noted that the server 102 in the embodiment of the present application may be an example of an existing, developing or future developed device capable of providing the speech recognition candidate ranking to the client 102. The embodiments of the present application are not limited in any way in this respect.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present application, a method for ranking speech recognition candidate results provided by the embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 2, the figure is a flowchart of a method for ranking speech recognition candidate results according to an embodiment of the present application.

The method for sorting the voice recognition candidate results provided by the embodiment of the application comprises the following steps of S201-S204:

s201: candidate result information of voice recognition is obtained, and each candidate result information comprises a recognition text word sequence and word information of each word in the recognition text word sequence. And the recognition text word sequence included in each candidate result information is a word sequence obtained by segmenting the recognition text corresponding to the candidate result.

After a piece of speech information to be recognized is acquired, the speech information can be subjected to preliminary speech recognition to obtain a plurality of candidate results, so that the optimal candidate result is selected from the obtained plurality of candidate results to serve as the speech recognition result. The candidate result information may include a recognition text obtained after the initial speech recognition and related information of the recognition text.

The candidate result information of the speech recognition may include word information identifying the text word sequence and identifying individual words in the text word sequence. The recognition text word sequence may be a sequence of words obtained by segmenting the recognition text corresponding to the candidate result. The word information identifying each word in the text word sequence may include acoustic model score, language model score, duration, confidence, etc. of each word and relevant information of the word itself. The acoustic model score may be a probability score of the acoustic model mapping the corresponding feature of the word to the corresponding phoneme. The language model score may be a probability score calculated by the language model for the word corresponding in the recognized text sequence. The duration may be a length of time of the speech to which the word corresponds. The confidence level is used to indicate the degree to which the word is a correct recognition result. In addition, in practical applications, when there are multiple candidate result information obtained by recognition, each candidate result information may further have a corresponding serial number identifier, for example, the serial number identifier of the 1 st candidate result information is 1, the serial number identifier of the 2 nd candidate result information is 2, and so on.

In the embodiment of the application, the candidate result information of the voice recognition is obtained, so that the candidate result information is sequentially ranked, and the voice recognition result is obtained according to the optimal candidate result information.

S202: and combining the candidate result information obtained by identification in pairs to generate at least one candidate combination.

The candidate combination is obtained by combining any two candidate result information, and in order to make the obtained ranking result of the plurality of candidate result information more accurate, the plurality of candidate result information obtained by identification can be respectively combined pairwise to obtain all possible candidate combinations.

And when the candidate result information is multiple, the number of candidate combinations obtained by pairwise combination of the multiple candidate result information is more than or equal to one.

It should be noted that, in the embodiment of the present application, a method for pairwise combining candidate result information is not limited. In one possible implementation, the method may include: splicing the recognition text word sequence included in each candidate result information and the word information of each word in the recognition text word sequence; and combining the candidate result information after the splicing operation is finished pairwise to obtain a candidate combination.

Correspondingly, a division flag may be inserted between two candidate result information included in each candidate combination for specifying the range of each candidate result information in the candidate combination. For example, for each candidate combination, a start flag may be inserted at the start position of the candidate combination, and an end flag may be inserted at the end position of each candidate result information included in the candidate combination, whereby the range of each candidate result information in the candidate combination may be clarified.

S203: identifying common error words comprised by the candidate combinations; the common error word is an error word which is common to the two candidate result information included in the candidate combination and has the same position.

The common error word may be caused by an environmental factor in which the speech information is acquired or a pronunciation error in the generation of the speech information, so that the error is included in each of the plurality of candidate result information obtained by the speech recognition. Based on this, when both candidate result information in the candidate combination have the common error word, the attention to the common error word should be reduced, and the attention to other error words in the candidate result information in determining the degree of superiority and inferiority should be improved.

The embodiment of the present application provides an implementation manner of identifying common error words included in candidate combinations in S203, please refer to the following detailed description.

S204: and replacing the common error words included in each candidate result information in the candidate combination with the mark words, replacing the word information of the common error words with the mark word information, obtaining a replacement candidate result corresponding to each candidate result information, and obtaining the combination to be classified.

In the embodiment of the application, after the common error word of the two candidate result information in the candidate combination is determined, the common error word in each candidate result information is replaced by the marker word aiming at each candidate result information included in the candidate combination, and the word information of the common error word in each candidate result information is replaced by the marker word information. Therefore, the candidate result information included in the candidate combination is converted into the corresponding replacement candidate result, and the combination to be classified corresponding to the candidate combination is obtained. Each combination to be classified comprises two alternative candidate results.

The flag words may be words without actual meanings, and the flag word information may also not have information with actual meanings, so as to reduce the influence of the common error words on other words in the candidate result information. It can be understood that the candidate replacement result corresponds to the candidate result information, and each candidate replacement result may also have a corresponding sequence number identifier, for example, the sequence number identifier of the candidate replacement result corresponding to the candidate result information with the sequence number identifier 1 is also 1, the sequence number identifier of the candidate replacement result corresponding to the candidate result information with the sequence number identifier 2 is also 2, and so on.

As an example, suppose that the two candidate result information in the candidate combination are respectively W_i＝[w_i1,w_i2,w_i3,w_i4,w_i5,w_i6,w_i7]And W_j＝[w_j1,w_j2,w_j3,w_j4,w_j5,w_j6,w_j7]. Wherein, w_i3、w_i6Belonging to candidate result information W_iError present in, w_j3Belonging to candidate result information W_jThe error present in (a); wherein, w_i3And w_j3Identical, i.e. an error common to both candidate result information and having the same position, w_i3And w_j3I.e. the common error word of the two candidate result information. The candidate result information W may be used_iW in_i3Replace with the signpost MARK, and replace w_i3Replacing the corresponding word information with word information corresponding to MARK; candidate result information W_jW in_j3Replace with the signpost MARK, and replace w_j3And replacing the corresponding word information with word information corresponding to MARK.

The marker word MARK can be used for enabling the binary classification model to ignore the marker word MARK in the replacement candidate result when the combination to be classified is input into the binary classification model, so that common errors can be reducedThe influence of the word on other words in the candidate result information. The replacement candidate results obtained after the replacement operation are executed are respectively: w_i＝[w_i1,w_i2,MARK,w_i4,w_i5,w_i6,w_i7]And W_j＝[w_j1,w_j2,MARK,w_j4,w_j5,w_j6,w_j7]. The two alternative candidates may constitute a combination to be classified.

S205: and obtaining the quality of the two alternative candidate results in each to-be-classified combination by using a two-classification model obtained by pre-training.

In the embodiment of the application, the two classification models obtained through pre-training can obtain the quality of two alternative candidate results in the combination to be classified. The two classification models are obtained by pre-training and used for outputting the quality of the replacement candidate result in the combination to be classified according to the input combination to be classified. Through the pre-trained binary classification model, the replacement candidate results can be pertinently compared, so that the replacement candidate results are more accurately sorted, and the voice recognition result can be more accurately determined.

The embodiment of the present application provides an implementation manner of a training method for a binary model, please refer to the following detailed description.

It should be noted that the degree of superiority can be used to reflect the accuracy of the two alternative candidates. The accuracy of the better candidate replacement result is higher and the accuracy of the worse candidate replacement result is lower in the two candidate replacement results. The factors influencing the quality of the candidate replacement result are more, and may include a wrong word in the candidate replacement result or a poor model score in the word information.

In a possible implementation manner, when the replacement candidate results forming the combination to be classified have a precedence order, a "1" may be output through the two classification models to indicate that the replacement candidate result with the precedence order is better, and a "0" may be output to indicate that the replacement candidate result with the precedence order is better.

In addition, the embodiment of the present application further provides an implementation manner of the goodness of the two alternative candidate results in each to-be-classified combination described in S205, please refer to the following detailed description.

S206: and sorting the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination.

In the embodiment of the application, after the goodness between the replacement candidate results included in all the combinations to be classified is obtained, all the replacement candidate results can be ranked according to the goodness, so that the goodness sequence of the replacement candidate results is obtained. Because the replacement candidate result and the candidate result information have a corresponding relationship, the ranking result of the candidate result information can be obtained from the ranking result of the replacement candidate result. And finally, selecting the optimal candidate result information as a final voice recognition result according to the sorting of the candidate result information.

The method for ranking the plurality of replacement candidate results is not limited in the embodiment of the application.

In a first possible implementation manner, the degrees of superiority and inferiority of two replacement candidate results in any one to-be-classified combination may be selected, and a superior replacement candidate result and a poor replacement candidate result in the two replacement candidate results may be determined. And then, determining the quality sequence of the four replacement candidate results by using the quality of the two replacement candidate results and the quality of the other replacement candidate results in the corresponding combination to be classified. And comparing the quality of the replacement candidate results by analogy, and finally obtaining the ranking of the replacement candidate results.

In a second possible implementation manner, when the number of candidate replacement results is large, the candidate replacement results may be divided first to obtain multiple sets of candidate replacement results. And respectively sequencing the replacement candidate results in each set to obtain the optimal replacement candidate result in each set. And then ranking the optimal replacement candidate results in each set, so as to obtain the optimal replacement candidate results in all the replacement candidate results. The manner of sorting the replacement candidate results in each set and sorting the optimal replacement candidate results in each set may refer to the first implementation manner of sorting the plurality of replacement candidate results described above.

Therefore, the optimal candidate replacement result can be obtained through comparison of fewer times, and the candidate result information corresponding to the optimal candidate replacement result is further obtained, so that the voice recognition result is obtained.

Based on the relevant content of the above steps S201 to S206, in the embodiment of the present application, on one hand, the two classification models are used to determine the quality of the two candidate replacement results in the combination to be classified, and since the two classification models are obtained by training for comparing the quality between the candidate result information, the quality between the two candidate replacement results in each combination to be classified obtained through the two classification models is more accurate. And further, the sequencing result of the candidate result information is replaced to obtain a more accurate sequencing result of the candidate result information, so that an accurate voice recognition result is obtained. On the other hand, the common error words combined by the candidate result information and the word information of the common error words are replaced, so that the influence of the common error words on the judgment of the quality between the candidate replacement results can be reduced, the accuracy of the judgment of the quality of the candidate replacement results is improved, and the finally determined voice recognition result is more accurate.

In a possible implementation manner, the step S203 of identifying the common error word included in the candidate combination may specifically include the following three steps:

a1: and acquiring the occurrence probability value of each word included in the two candidate result information in the candidate combination.

The occurrence probability value of a word may be: and determining the position of the word in the candidate result information, and determining the probability value of the word appearing at the position in the candidate result information according to the position of the word appearing in a large amount of historical texts. In a possible implementation manner, the occurrence probability values of the words may be calculated through a language model or a neural network model, so as to obtain the occurrence probability values of the words included in the two candidate result information.

A2: and determining the words with the probability values lower than the threshold value in the candidate result information as error words, and determining the position of each error word in the corresponding candidate result information.

When the probability value of the occurrence of the word is lower than the threshold value, it indicates that the word itself has a smaller probability value of occurrence at the position in the candidate result information, in which the error word should belong. Wherein the threshold value may be a minimum value of the occurrence probability values of the correct word.

In the embodiment of the application, whether the probability value of the occurrence of the word is lower than the threshold value or not is judged, and the word with the probability value lower than the threshold value can be determined as the error word, so that the common error word can be determined in the following process.

A3: and comparing the error words respectively included by the two candidate result information in the candidate combination with the positions of the error words appearing in the corresponding candidate result information, and identifying the same error words positioned at the same positions as the common error words corresponding to the candidate combination.

It should be noted that, after obtaining the error words included in the two candidate result information in the candidate combination and the positions of the error words appearing in the corresponding candidate result information, it is necessary to identify whether the same error word appears at the same position of the two candidate result information. The same position of two pieces of candidate result information may refer to a position where the order of words is the same in the candidate result information. It is understood that the number of words and word lengths of candidate result information obtained for the same piece of speech information may vary. When the number and length of each word in the two candidate result information pieces are the same, that is, the structure of the candidate result information pieces and the number of words of the two candidate result information pieces coincide, the same position may be a position in which the order of the number of words in the candidate result information pieces is the same. When the number of words or the length of words of two candidate result information are different, that is, the structures of the two candidate result information or the number of words of the candidate result information are different, the same position may be a position having the same structural role in the candidate result information.

For example, each of the two candidate result information pieces is composed of the structures of noun 1, verb and noun 2, but the lengths of corresponding noun 1 and noun 2 words in the two candidate result information pieces are not the same, and the word number lengths of the two candidate result information pieces are different. The same position may be a verb position having the same structural role, or a noun 1 position having the same structural role, or a noun 2 position having the same structural role, etc. For another example, one candidate result information is composed of the structures of noun 1, verb, and noun 2, and the other candidate result information is composed of the structures of noun 1, verb, preposition, and noun 2. At this time, the word number lengths of the two candidate result information may be different, and the structures may also be different. The same position may be a position having the same structural role, that is, the corresponding position of noun 1, verb and noun 2.

In a possible implementation manner, the error words in the two candidate result information may be obtained first, and then whether the error words in the candidate result information are the error words commonly owned by the two candidate result information may be determined one by one. If the candidate result information has the commonly owned error word, whether the positions of the commonly owned error words are the same or not is judged, and if the positions of the commonly owned error words are the same, the commonly owned error words of the two candidate result information are determined.

The same error word at the same position in the candidate result information is used as a common error word, the common error word is replaced to generate a replacement candidate result, the common error word can be processed before the two replacement candidate results in the combination to be classified are determined to be good or bad, and the influence of the common error word on the good or bad of the replacement candidate results is reduced.

And after the information of the marker words and the identification words is replaced, a replacement candidate result corresponding to each candidate result information is obtained, so that a combination to be classified is obtained, the combination to be classified is used for executing the step S203, and the two classification models obtained by pre-training are used for obtaining the quality of the two replacement candidate results in each combination to be classified. By replacing the common error word with the marker word, the influence of the common error word on other words can be eliminated, so that the goodness and badness of more accurate candidate result information can be obtained.

In one possible embodiment, since the replacement candidate result includes the recognition text word sequence and the word information for recognizing each word in the text word sequence, the feature of the replacement candidate result can be extracted from the perspective of the word sequence and the perspective of the word information, respectively. Therefore, the determination of the degree of superiority of the replacement candidate result can be made by the word sequence and the word information.

Based on this, the embodiment of the present application further provides an implementation manner for obtaining the quality of the replacement candidate results in the to-be-classified combination, that is, in S205, an implementation manner for obtaining the quality of two replacement candidate results in each to-be-classified combination by using the pre-trained binary classification model. Referring to fig. 3, which is a flowchart of a method for obtaining the goodness of a candidate replacement result according to the embodiment of the present application, specifically, the method may include S301 to S304:

s301: and acquiring a first feature vector corresponding to the recognized text word sequence in the target replacement candidate result, wherein the target replacement candidate result is each replacement candidate result in the combination to be classified.

And respectively taking each replacement candidate result in the combination to be classified as a target replacement candidate result to extract the features. And extracting a corresponding first feature vector from the recognized text word sequence in the target replacement candidate result. The first feature vector may be a word vector of each word corresponding to the recognized text word sequence in the target replacement candidate result. By extracting the first feature vector from the recognized text word sequence, the features of the words of the target replacement candidate result in a semantic angle can be obtained.

In a possible implementation manner, the recognized text word sequence in the target replacement candidate result may be input into the Embedding module, and a word vector corresponding to each word in the recognized text word sequence is output. The Embedding module is used for outputting word vectors corresponding to the words according to the input word sequence, and the output word vectors are characteristic representations of the words in the aspect of semantics.

S302: and acquiring a second feature vector corresponding to the word information of each word in the target replacement candidate result.

And extracting corresponding second feature vectors from the word information of each word in the target replacement candidate result. The second feature vector may be a feature vector of word information, and the word features of the target replacement candidate result in the speech angle may be obtained by extracting the second feature vector from the word information of each word.

The word information may include one or more of an acoustic model score, a speech model score, a duration, a confidence level, etc. of the information the word itself has. The acoustic model score and the voice model score are probability scores of the word obtained through the acoustic model and the voice model, and the value range can be [0,1 ]. The duration is the time length of the pronunciation voice corresponding to the word, and can be represented by the probability value corresponding to the time length corresponding to the word in the word time length distribution, and the value range of the probability value is [0,1 ]. The confidence is used to measure the degree of whether the word is a correct recognition result, and can be represented by a corresponding value in a value range of [0,1 ].

The obtaining of the second feature vector corresponding to the word information of each word in the target replacement candidate result may specifically include the following two steps B1-B2:

b1: and inputting the word information of each word in the target replacement candidate result into the full-connection network to obtain the feature vector of each word in the target replacement candidate result.

In the embodiment of the present application, the word information includes one or more of the information that the word itself has, such as the acoustic model score, the speech model score, the duration, the confidence coefficient, and the like. And inputting the word information of each word into the full-connection network to obtain the feature vector of each word, wherein the feature vector corresponds to different information in the word information. It should be noted that the fully connected network may be DNN (Deep Neural Networks).

As an example, when the word information of each word in the target replacement candidate result includes an acoustic model score, a speech model score, a duration and a confidence, the word information of each word in the target replacement candidate result is input into the full-connection network, and feature vectors corresponding to the acoustic model score, the speech model score, the duration and the confidence of each word are respectively obtained.

B2: and splicing the feature vectors of all words in the target replacement candidate result to generate a second feature vector corresponding to the word information of all words in the target replacement candidate result.

In the embodiment of the present application, the feature vector is extracted from word information corresponding to each word, the word information of the word may have one or more types of word information, and the corresponding feature vector may be one or more types. And splicing the feature vectors of the words to obtain a second feature vector corresponding to the word information of each word on the whole.

S303: and splicing the first feature vector and the second feature vector to generate feature representation of the target replacement candidate result.

The first feature vector and the second feature vector are feature vectors of the replacement candidate results extracted from different angles respectively, and feature representation corresponding to the target replacement candidate result can be obtained by splicing the first feature vector and the second feature vector. It should be noted that, for the convenience of stitching, the dimensions of the first feature vector and the second feature vector should be the same.

And then the obtained feature representation of the target replacement candidate result can be input into a two-classification model so as to determine the quality of the replacement candidate result.

Referring to fig. 4, the figure is a schematic diagram of generating a feature representation of a target replacement candidate result according to an embodiment of the present application.

And inputting the recognized text word sequence in the target replacement candidate result into an Embedding module, and outputting a word vector corresponding to each word in the recognized text word sequence, namely the first feature vector. Inputting the word information of each word in the target replacement candidate result, wherein the word information comprises an acoustic model score, a voice model score, duration and a confidence coefficient, into a full-connection network to obtain a feature vector of each word in the target replacement candidate result, and splicing the feature vectors of each word in the target replacement candidate result to generate a second feature vector corresponding to the word information of each word in the target replacement candidate result. And splicing the first feature vector and the second feature vector to obtain feature representation of the target replacement candidate result.

S304: and inputting the feature representation of the two replacement candidate results in the combination to be classified into a two-classification model obtained by pre-training to obtain the quality of the two replacement candidate results in each combination to be classified.

The binary model is obtained by pre-training and is used for determining the quality of two alternative candidate results in the input combination to be classified. The two classification models can represent the quality of the two replacement candidate results in the corresponding combination to be classified according to the characteristics of the two replacement candidate results in the input combination to be classified.

The method includes inputting a feature representation of two candidate replacement results in a combination to be classified into a pre-trained binary classification model to obtain the quality of the two candidate replacement results in each combination to be classified, and specifically includes the following two steps C1-C2:

c1: and representing the characteristics of the two replacement candidate results in the combination to be classified into an encoder part of the input converter model to obtain a first hidden layer vector output by the encoder part of the input converter model.

The encoder portion of the converter may be comprised of a self-attentive mechanism module and a fully connected module. By inputting the feature representations of the two alternative candidate results in the combination to be classified into the encoder part of the converter, the feature representations of the two alternative candidate results can be both associated with a first hidden vector output by the encoder part of the converter, and the first hidden vector contains related information of the feature representations of the two alternative candidate results. The first hidden vector can be input into the binary model to obtain the quality of the two candidate replacement results.

C2: and inputting the first hidden layer vector into a pre-trained binary model to obtain the quality of two alternative candidate results in each to-be-classified combination.

Because the first hidden layer vector contains information related to the feature representation of the two replacement candidate results, the advantages and the disadvantages of the two replacement candidate results in the combination to be classified can be obtained by inputting the first hidden layer vector into the two classification models.

The two classification models may be formed by a fully connected network, and in one possible implementation, the degrees of superiority and inferiority of the two replacement candidate results may be determined by the output results of the two classification models. For example, when two candidate result information in the combination to be classified have a precedence order, when the output of the binary model is "1", the first candidate result may be considered to be better, and when the output of the binary model is "0", the second candidate result may be considered to be worse.

Based on the above, in the embodiment of the present application, the feature representation of the target replacement candidate result is obtained by obtaining the first feature vector corresponding to the recognition text word sequence in the target replacement candidate result and the second feature vector corresponding to the word information of each word, and splicing the first feature vector and the second feature vector. And inputting the obtained feature representation of the target replacement candidate result into a two-classification model to obtain the quality of the two replacement candidate results in the combination to be classified. By acquiring the first feature vector and the second feature vector of the target replacement candidate result and taking the feature representation obtained by splicing the first feature vector and the second feature vector as the input of the two classification models, the features in the target replacement candidate result can be extracted from two angles respectively, so that the quality of the two replacement candidate results in the combination to be classified obtained through the two classification models is more accurate.

In order to facilitate understanding and explanation of the above implementation method for obtaining the goodness and badness of the two candidate replacement results in each to-be-classified combination by using the pre-trained binary model, the following description is made with reference to an example.

Referring to fig. 5, a schematic diagram of a binary classification model provided in an embodiment of the present application is shown. Two candidate result information W_iAnd W_jIn W_i＝[w_i1,w_i2,w_i3,w_i4,w_i5,w_i6,w_i7]And W_j＝[w_j1,w_j2,w_j3,w_j4,w_j5,w_j6,w_j7]Middle w_i3And w_j3For the common error word, the common error word is replaced with the marker word MARK. The obtained combination W to be classified₂＝[[CLS],w_i1,w_i2,MARK,w_i4,w_i5,w_i6,w_i7,[SEQ],w_j1,w_j2,MARK,w_j4,w_j5,w_j6,w_j7,[SEQ]]. W is to be₂Inputting the input into an Embedding module to obtain W₂Corresponding first feature vector, W₂Inputting into a fully connected network to obtain W₂A corresponding second feature vector. Splicing the first feature vector and the second feature vector to obtain a feature representation H₂＝[h_[CLS],h_i1,h_i2,h_MARK,h_i4,h_i5,h_i6,h_i7,h_[SEQ],h_j1,h_j2,h_MARK,h_j4,h_j5,h_j6,h_j7,h_[SEQ]]. Inputting the feature representation into an encoder portion of the converter model to obtain [ CLS]The concealment vector of (1). [ CLS]Has information related to other feature representations, will [ CLS ]]The hidden vector is input into the binary model to obtain an output result of the binary model, and the quality of the replacement candidate result is obtained according to the output result. By replacing the common error word and replacing the word information of the common error word with the flag word information, the influence of the common error word on the judgment of the quality of the two replacement candidate results can be reduced.

Based on the above, the advantages and the disadvantages of the two candidate replacement results in the combination to be classified can be obtained through the two-classification model, wherein the two-classification model is obtained through pre-training. In view of this, the embodiment of the present application provides a training method for a two-class model, which specifically includes the following five steps D1-D5:

d1: and acquiring voice sample information and a standard identification text corresponding to the voice sample information.

The speech sample information may be more standard speech information that may be used as a speech sample, for example, speech sample information in an acoustic model training set. And acquiring a standard recognition text corresponding to the voice sample information for determining more accurate standard training candidate result information.

The voice sample information is used for voice recognition, on one hand, the voice sample information is relatively standard, a relatively good recognition result can be obtained, and training candidate result information with relatively high accuracy can be obtained; on the other hand, the voice sample information has a corresponding standard recognition text, and the standard recognition text can be used for selecting more standard training candidate result information from the training candidate result information and carrying out corresponding marking, so that a two-classification model with higher performance can be obtained through training.

D2: and performing voice recognition on the voice sample information to obtain training candidate result information corresponding to the voice sample information, wherein the training candidate result information comprises a training recognition text word sequence and word information of each word in the training recognition text word sequence.

By performing speech recognition on the speech sample information, corresponding training candidate result information can be obtained, including the training recognition text word sequence and the word information of each word in the training recognition text word sequence. Wherein the training recognition text word sequence may be a sequential sequence of words constituting the training recognition text in the training candidate result. The word information for training each word in the recognized text word sequence may include acoustic model score, language model score, duration, confidence, and other relevant information about the word itself. The obtained training candidate result information can form a combination to be trained, and the combination can be used as training data of the two classification models.

In the embodiment of the present application, a method for performing speech recognition on speech sample information to obtain corresponding training candidate result information is not limited, and in a possible implementation manner, a speech recognition system based on a hidden markov model may be used to perform speech recognition.

D3: and determining the training candidate result information with the highest similarity to the standard recognition text as the standard training candidate result information.

And performing voice recognition on the voice sample information, wherein the obtained training candidate result information corresponding to the voice sample information may have a plurality of pieces, and the standard recognition text with the highest similarity in the obtained training candidate result information corresponding to the voice sample information is used as standard training candidate result information. The similarity may be the similarity between the training candidate result information and the standard recognition text in terms of word information, word sequence, and the like. In the embodiment of the present application, the similarity between the standard recognition text and the training candidate result information may be calculated respectively, and the training candidate result information with the highest similarity in the calculation result may be used as the standard candidate result information.

By obtaining the standard candidate result information, label marking of the candidate result information with good or bad degree can be correspondingly carried out, so that the related data for training the binary model can be obtained.

D4: and respectively combining the standard training candidate result information with other training candidate result information to generate at least one combination to be trained.

The standard training candidate result information is the training candidate result information with the highest similarity to the standard recognition text among the training candidate result information, and may be used as the training candidate result information with a better rank than other training candidate result information.

And combining the standard training candidate result information with other training candidate result information to obtain at least one combination to be trained. In the embodiment of the application, the combination to be trained can be obtained by splicing the standard training candidate result information and other candidate result information. After the combination to be trained is obtained, the label with better standard training candidate result information can be marked to indicate that the standard training candidate result information in the combination to be trained is the training candidate result information with better quality.

D5: and training to obtain the two classification models by utilizing the combination to be trained and the label of which the standard training candidate result information is superior to the training candidate result information in the combination to be trained.

And performing training of a binary classification model by using a combination to be trained obtained by combining the standard training candidate result information with other training candidate result information and a label used for representing that the standard training candidate result information is superior to the training candidate result information as training data, wherein the obtained binary classification model can output the goodness and badness of the candidate result information in the combination to be classified according to the input combination to be classified. In the embodiment of the present application, the binary model may be composed of a fully connected network.

By using the combination to be trained and the label that the standard training candidate result information in the combination to be trained is better than the training candidate result information, the training to obtain the binary model specifically comprises the following four steps E1-E4:

e1: and acquiring a third feature vector corresponding to the recognition text word sequence in target training candidate result information, wherein the target training candidate result information is standard training candidate result information and training candidate result information in the combination to be trained respectively.

And respectively taking the standard training candidate result information and the training candidate result information in the combination to be trained as target training candidate result information to extract the characteristics. And performing feature extraction on the recognized text word sequence in the target training candidate result information to obtain a third feature vector. The third feature vector may be a word vector identifying each word corresponding to the text word sequence in the target training candidate result information.

In the embodiment of the application, the third feature vector corresponding to the recognition text word sequence in the target training candidate result information can be obtained by inputting the recognition text word sequence in the target training candidate result information into the Embedding module.

E2: and acquiring a fourth feature vector corresponding to the word information of each word in the target training candidate result information.

The fourth feature vector is a feature vector obtained by extracting word information of the word, and the word information may include one or more of information of the word itself, such as an acoustic model score, a speech model score, a duration, and a confidence coefficient.

It should be noted that, since the word information may include one or more kinds of information, in one possible implementation, the word information of each word in the target training candidate result information may be input into the full-connection network to obtain the feature vector of each word in the target training candidate result information. The feature vector is extracted from one of the word information corresponding to the word. And splicing the feature vectors of all words in the target training candidate result information to generate a fourth feature vector corresponding to the word information of all words in the target training candidate result information, namely the feature vector of the word information corresponding to all words.

E3: and splicing the third feature vector and the fourth feature vector to generate feature representation of the target training candidate result information.

The third feature vector and the fourth feature vector are feature vectors of training candidate result information extracted from different angles respectively, and feature representation corresponding to target training candidate result information can be obtained by splicing the third feature vector and the fourth feature vector.

E4: and training to obtain a binary model by utilizing the characteristic representation of the result information of the standard training candidate in the combination to be trained, the characteristic representation of the result information of the training candidate in the combination to be trained and the label of the result information of the standard training candidate in the combination to be trained, which is superior to the result information of the training candidate.

And training by using a large amount of training data to obtain a binary model by taking the obtained feature representation of the standard training candidate result information in the combination to be trained, the feature representation of the training candidate result information in the combination to be trained and a label of which the standard training candidate result information is superior to the training candidate result information as training data. The binary model may be constructed from a fully connected network. The two classification models obtained by training can output the quality of the two replacement candidate results in the combination to be classified according to the feature representation of the two replacement candidate results in the input combination to be classified.

In the embodiment of the application, the training candidate result information corresponding to the voice sample information is obtained by performing voice recognition on the voice sample information. And determining the highest standard recognition text similarity corresponding to the voice sample information in the training candidate result information as standard training candidate result information, and respectively combining the standard training candidate result information and other training candidate result information to generate at least one combination to be trained. And training the binary classification model by using the obtained combination to be trained and the label of which the standard training candidate result information is superior to the training candidate result information to obtain the binary classification model which can be used for outputting the goodness and badness of the two replacement candidate results in the combination to be classified according to the feature representation of the two replacement candidate results in the input combination to be classified.

To facilitate understanding and explanation of the above-described method for training the binary model, the following description is given with reference to examples.

Obtaining voice sample information, carrying out voice recognition on the voice sample information to obtain training candidate result information corresponding to the voice sample information, determining the training candidate result information with the highest similarity to a standard recognition text as standard training candidate result information, and respectively combining the standard training candidate result information with other training candidate result information to obtain at least one combination to be trained. For example, one of the combinations to be trained may be derived from the standard training candidate result information W_a＝[w_a1,w_a2,w_a3,w_a4,w_a5,w_a6,w_a7]And training candidate result information W_b＝[w_b1,w_b2,w_b3,w_b4,w_b5,w_b6,w_b7]And (4) combining to obtain the compound. And adding [ CLS ] at the beginning of the combination to be trained when combining]Identification for subsequent use [ CLS]The two classification models are trained. At W_aAnd W_bAddition of [ SEQ ID ] at the end of]And marking the end of the standard training candidate result information and the end of the training candidate result information. The obtained combination W to be trained₃＝[[CLS],w_a1,w_a2,w_a3,w_a4,w_a5,w_a6,w_a7,[SEQ],w_b1,w_b2,w_b3,w_b4,w_b5,w_b6,w_b7,[SEQ]]. W is to be₃Input to Embedding module to obtain W₃A corresponding third feature vector. W is to be₃Inputting into a fully connected network to obtain W₃The corresponding fourth feature vector is spliced with the third feature vector to obtain a feature representation H₃＝[h_[CLS],h_a1,h_a2,h_a3,h_a4,h_a5,h_a6,h_a7,h_[SEQ],h_b1,h_b2,h_b3,h_b4,h_b5,h_b6,h_b7,h_[SEQ]]. Inputting the feature representation into an encoder portion of the converter model to obtain [ CLS]The concealment vector of (1). [ CLS]Has information related to other feature representations. The obtained [ CLS]Hidden vector and W in combination to be trained_aIs superior to W_bAs training data for training the two-class model.

Based on the above method embodiment, the present application further provides a device for sorting speech recognition candidate results, which will be described below with reference to the accompanying drawings.

Referring to fig. 6, which is a structural diagram of an apparatus for sorting speech recognition candidate results according to an embodiment of the present application, as shown in fig. 6, the apparatus may include:

an acquisition unit 601 configured to acquire candidate result information of speech recognition, each of the candidate result information including a recognition text word sequence and word information of each word in the recognition text word sequence;

a combining unit 602, configured to combine every two candidate result information obtained by identification to generate at least one candidate combination;

a recognition unit 603 configured to recognize a common error word of the candidate combination, where the common error word is an error word that is common to two candidate result information included in the candidate combination and has a same position;

a replacing unit 604, configured to replace a common error word included in each candidate result information in the candidate combination with a flag word, replace word information of the common error word with flag word information, obtain a replacement candidate result corresponding to each candidate result information, and obtain a to-be-classified combination;

a goodness obtaining unit 605, configured to obtain, by using a pre-trained binary model, goodness of two replacement candidate results in each to-be-classified combination;

and the sorting unit 606 is configured to sort the candidate result information according to the goodness of the two replacement candidate results in each to-be-classified combination.

Optionally, the identifying unit 603 includes:

Optionally, the quality obtaining unit 605 includes:

Optionally, the word information includes one or more of an acoustic model score, a speech model score, a duration, and a confidence;

the second acquisition subunit includes:

Optionally, the quality obtaining subunit includes:

Optionally, the training process of the two-classification model includes:

Optionally, the training by using the combination to be trained and the label of which the standard training candidate result information in the combination to be trained is better than the training candidate result information to obtain a two-class model, including:

Fig. 7 shows a block diagram of a client 1200. For example, client 1200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so forth.

Referring to fig. 7, client 1200 may include one or more of the following components: processing component 1202, memory 1204, power component 1206, multimedia component 1208, audio component 1210, input/output (I/O) interface 1212, sensor component 1214, and communications component 1216.

The processing component 1202 generally controls overall operation of the client 1200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1202 may include one or more processors 1220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1202 can include one or more modules that facilitate interaction between the processing component 1202 and other components. For example, the processing component 1202 can include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.

The memory 1204 is configured to store various types of data to support operations at the client 1200. Examples of such data include instructions for any application or method operating on client 1200, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1204 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 1206 provides power to the various components of the client 1200. The power components 1206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the client 1200.

The multimedia components 1208 include screens between the client 1200 and the user that provide an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1208 includes a front facing camera and/or a rear facing camera. When the client 1200 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 1210 is configured to output and/or input audio signals. For example, audio component 1210 includes a Microphone (MIC) configured to receive external audio signals when client 1200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1204 or transmitted via the communication component 1216. In some embodiments, audio assembly 1210 further includes a speaker for outputting audio signals.

The I/O interface provides an interface between the processing component 1202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 1214 includes one or more sensors for providing various aspects of state assessment for the client 1200. For example, sensor assembly 1214 may detect an open/closed state of device 1200, the relative positioning of components, such as a display and keypad of client 1200, sensor assembly 1214 may also detect a change in position of client 1200 or a component of client 1200, the presence or absence of user contact with client 1200, client 1200 orientation or acceleration/deceleration, and a change in temperature of client 1200. The sensor assembly 1214 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communications component 1216 is configured to facilitate communications between the client 1200 and other devices in a wired or wireless manner. The client 1200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1216 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the client 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the following methods:

Optionally, the identifying the common error word of the candidate combination includes:

Optionally, the obtaining, by using a pre-trained binary classification model, the quality of the two candidate replacement results in each to-be-classified combination includes:

Optionally, the step of inputting the feature representation of the two candidate replacement results in the combination to be classified into a pre-trained binary classification model to obtain the quality of the two candidate replacement results in each combination to be classified includes:

Optionally, the training process of the two-classification model includes:

Fig. 8 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1000 may have relatively large differences in configuration or performance, and may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and memory 1032, one or more storage media 1030 (e.g., one or more mass storage devices) storing applications 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1022 may be disposed in communication with the storage medium 1030, and configured to execute a series of instruction operations in the storage medium 1030 on the server 1000.

The server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1056, one or more keyboards 1056, and/or one or more operating systems 1041, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In addition, the embodiment of the present application also provides a computer readable medium, on which instructions are stored, which when executed by one or more processors, cause an apparatus to perform the above-mentioned method for ranking speech recognition candidate results.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for ranking speech recognition candidate results, the method further comprising:

2. The method of claim 1, wherein the identifying the common error word for the candidate combination comprises:

3. The method according to claim 1 or 2, wherein the obtaining the goodness of the two alternative candidate results in each to-be-classified combination by using a pre-trained binary classification model comprises:

4. The method of claim 3, wherein the word information includes one or more of an acoustic model score, a speech model score, a duration, a confidence level;

5. The method according to claim 3, wherein the inputting the feature representation of the two alternative candidate results in the combination to be classified into a pre-trained binary classification model to obtain the goodness of the two alternative candidate results in each combination to be classified comprises:

6. The method of claim 1, wherein the training process of the two-class model comprises:

7. The method of claim 6, wherein the training using the combination to be trained and the label of the standard training candidate result information in the combination to be trained better than the training candidate result information to obtain the binary model comprises:

8. An apparatus for ranking speech recognition candidate results, the apparatus comprising:

9. An apparatus for ranking speech recognition candidates comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for:

10. A computer-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform the method of ranking speech recognition candidates according to any of claims 1 to 7.