CN120071903B

CN120071903B - Voice transcription system evaluation method and device, related equipment and program product

Info

Publication number: CN120071903B
Application number: CN202510534862.0A
Authority: CN
Inventors: 刘思坤; 殷兵; 朱菊霞; 万根顺; 熊世富; 高建清; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2025-04-27
Filing date: 2025-04-27
Publication date: 2025-08-01
Anticipated expiration: 2045-04-27
Also published as: CN120071903A

Abstract

The present application discloses a speech transcription system evaluation method, apparatus, related equipment and program product, and relates to the field of speech recognition technology. When evaluating the speech transcription system, the present application introduces a semantic consistency index, which can measure the semantic consistency between the first transcribed text of the test audio and the reference text, that is, it measures the degree to which the first transcribed text and the reference text express the same meaning at the semantic level, and can accurately measure the speech transcription system's ability to transmit the semantics of the test audio, and then determine the evaluation results of the speech transcription system at least based on the semantic consistency index score, thereby improving the objectivity of the evaluation results. At the same time, the semantic consistency evaluation index is more in line with the real reading experience of humans, so the evaluation results obtained are also more in line with the real feelings of humans.

Description

Voice transcription system evaluation method and device, related equipment and program product

Technical Field

The present application relates to the field of speech recognition technology, and in particular, to a method and apparatus for evaluating a speech transcription system, a related device, and a program product.

Background

The important case of speech transcription as an artificial intelligence technology is widely used in various fields. It is important to evaluate the effect of the speech transcription system accurately.

The current mainstream speech transcription system in the industry generally only considers the voice word transcription accuracy of the speech transcription system, and if the voice word transcription accuracy is high, the speech transcription system is considered to have good transcription effect. The accuracy of the transcription of the voice and the word is high in the actual business scene, but the problems of unclear meaning and deviation in understanding still occur when people read the transcription text. That is, the evaluation result of the current speech transcription system evaluation method is not objective enough and inconsistent with the real feeling of human beings.

Disclosure of Invention

In view of the foregoing, the present application is directed to a method, an apparatus, a related device, and a program product for evaluating a speech transcription system, so as to improve the objectivity of the evaluation result of the speech transcription system and the consistency with the real human feeling. The specific scheme is as follows:

in a first aspect, a method for evaluating a speech transcription system is provided, including:

acquiring test audio and a corresponding reference text thereof;

Acquiring a first transcription text of the test audio, wherein the first transcription text is obtained by transcribing the test audio through a voice transcription system to be evaluated;

determining semantic consistency between the first transcription text and the reference text to obtain a semantic consistency index score;

and determining an evaluation result of the voice transcription system according to the score of the set evaluation index, wherein the set evaluation index at least comprises the semantic consistency index.

In another implementation manner of the first aspect of the embodiment of the present application, in one possible design, the method further includes:

Acquiring a noise adding test audio, wherein the noise adding test audio is obtained by carrying out noise adding processing on the test audio;

Acquiring a second transcription text of the noise-added test audio, wherein the second transcription text is obtained by transcribing the noise-added test audio through the voice transcription system;

determining a noise robustness index score based on the first transcription text, the second transcription text and the reference text, wherein the noise robustness index is used for measuring stability of the transcription text of the audio before and after noise addition;

The set evaluation index further comprises the noise robustness index.

Calculating an identification rate index score according to the reference text and the first transcription text, wherein the identification rate index represents the accuracy rate of transcription of the voice characters;

The set evaluation index further comprises the identification rate index.

In another implementation manner of the first aspect of the embodiment of the present application, the determining the semantic consistency between the first transcription text and the reference text to obtain the semantic consistency indicator score includes:

And calling the large model to instruct the large model to evaluate the semantic consistency between the first transcription text and the reference text, and outputting a semantic consistency index score.

In another implementation form of the first aspect of the embodiments of the present application, the determining the noise robustness index score based on the first transcribed text, the second transcribed text and the reference text comprises:

And calculating SARI index scores serving as noise robustness index scores based on the source text, the generated text and the reference text by taking the first transcription text as a source text and the second transcription text as a generated text.

In another implementation manner of the first aspect of the embodiment of the present application, in one possible design, the process of calculating the recognition rate indicator score according to the reference text and the first transcription text includes:

performing editing distance alignment on the reference text and the first transcription text;

calculating a correct recognition rate COR index score and an accuracy rate ACC index score based on the alignment result;

and integrating the COR index score and the ACC index score to obtain an identification rate index score.

In a possible design, in another implementation manner of the first aspect of the embodiment of the present application, the process of integrating the COR indicator score and the ACC indicator score to obtain the recognition rate indicator score includes:

performing weighted addition on the COR index score and the ACC index score to obtain an identification rate index score;

under the condition that the test audio is from a short audio test set, the first weight corresponding to the COR index is smaller than the second weight corresponding to the ACC index, and the duration of the test audio contained in the short audio test set is smaller than a first set duration threshold;

and under the condition that the test audio is derived from a long audio test set, the first weight corresponding to the COR index is greater than the second weight corresponding to the ACC index, and the duration of the test audio contained in the long audio test set is greater than a second set duration threshold.

In a second aspect, a speech transcription system evaluation device is provided, including:

the test audio and reference text acquisition unit is used for acquiring the test audio and the corresponding reference text;

The first transcription text acquisition unit is used for acquiring a first transcription text of the test audio, and the first transcription text is obtained by transcribing the test audio through a voice transcription system to be evaluated;

The semantic consistency index calculation unit is used for determining semantic consistency between the first transcription text and the reference text to obtain a semantic consistency index score;

and the evaluation result determining unit is used for determining the evaluation result of the voice transcription system according to the score of the set evaluation index, wherein the set evaluation index at least comprises the semantic consistency index.

In a third aspect, an electronic device is provided that includes a memory and a processor;

the memory is used for storing programs;

The processor is configured to execute the program to implement each step of the method for evaluating a speech transcription system described in any one of the foregoing first aspects of the present application.

In a fourth aspect, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech transcription system evaluation method described in any one of the foregoing first aspects of the present application.

In a fifth aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the speech transcription system evaluation method described in any of the foregoing first aspects of the application.

By means of the technical scheme, when the voice transcription system is evaluated, the semantic consistency index is introduced, the semantic consistency index can measure the semantic consistency between the first transcription text of the test audio and the reference text, namely the degree of meaning consistency of the first transcription text and the reference text expressed in the semantic layer can be measured, the capability of the voice transcription system for the semantic transfer of the test audio can be accurately measured, and further the evaluation result of the voice transcription system is determined at least according to the semantic consistency index score, so that the objectivity of the evaluation result is improved. Meanwhile, the semantic consistency evaluation index is more in line with the real reading experience of human beings, so that the obtained evaluation result is more in line with the real experience of human beings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 provides a schematic alignment of a reference text with several different transcribed text;

FIG. 2 illustrates a process flow diagram of a speech transcription evaluation system integrated with a speech transcription evaluation system;

FIG. 3 is a schematic flow chart of a method for evaluating a speech transcription system according to an embodiment of the present application;

FIG. 4 illustrates a schematic diagram of an identification rate indicator calculation process;

FIG. 5 provides a schematic illustration of edit distance alignment of a reference text and a first transcribed text;

FIG. 6 illustrates a schematic diagram of a human-machine consistency effect evaluation flow for a speech transcription evaluation system;

FIG. 7 illustrates a schematic diagram of a human-machine consistency effect evaluation flow for another speech transcription evaluation system;

FIG. 8 is a schematic structural diagram of an evaluation device of a speech transcription system according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The current mainstream speech transcription system in the industry generally uses the Accuracy of transcription of the voice word to measure the transcription effect, and examples thereof are COR (Correct Recognition Rate ) or ACC (Accuracy) index to reflect the quality of the speech transcription system. The higher the COR index or ACC index score, the higher the transcription effect of the speech transcription system is considered. The prior art does not consider whether the actual semantics of the transcribed text and the audio remain consistent or not, nor does the prior art consider the robustness of the transcription system in various complex noise scenarios. Therefore, there is a problem that COR index and ACC index are high, but when a downstream task is performed by reading a transcribed text or using a transcribed text, there is a problem that deviation in information transfer or failure of the downstream cascade task is frequently caused by a transcription error.

The COR index strictly counts the font consistency between the transcribed text and the reference text, and ignores the voice-word conversion errors of the insert, delete and replace classes. The index is calculated by the formula (1):

COR=H/N×100% (1)

Where N represents the total number of words of the reference text and H represents the number of words correctly transcribed.

The ACC index represents the overall ratio ‌ of the correct portion of ‌ in the transcribed text, while accounting for the effects of insert, delete, and replace errors. The specific calculation formulas are shown in (2) and (3) below.

WER=(S+I+D)/N(2)

ACC=1-WER(3)

Referring to fig. 1, an example is provided. The transcription results of the same audio in the three voice transcription systems are REC1, REC2, REC3, respectively, and the alignment results of the reference text LAB are shown in FIG. 1.

The audio reference text has 15 words, the three transcription systems have no insertion and deletion errors, the replacement errors are 1, and according to the formulas (1), (2) and (3), the COR index and the ACC index of the three systems can be calculated to be 93.33%. According to the prior art, the three systems can be obtained without quality scores.

But by analyzing the transfer results of the three systems, it can be found that:

The errors in REC1 are word transcription errors inside entity words, with the greatest impact on correct semantic delivery. The error in REC2 is a word of speech error, with substantially no loss of semantic communication in the present context. Verb transcription errors are in REC3, which have some influence on semantic transfer in the current context, but to a lesser extent than entity word errors in REC 1. Therefore, although the transcription accuracy indexes of the three systems are consistent, the effect of REC2 is best, and the effect of REC1 is worst from an objective point of view and combining subjective feelings of people, REC3 times. Therefore, the transcription system is not robust enough only by using COR and ACC indexes, but only the literal errors, including insertion, deletion and replacement errors, are concerned, the use of words in specific contexts is completely not considered, and in some occasions, even if the literal errors are fewer, the local errors can cause the improper contexts, so that the information transmission errors and the subjective feelings of people have certain deviation.

In addition, the noise type and intensity have a significant influence on the accuracy of speech transcription. A robust speech transcription system should remain stable for the same audio transcribed text in various noise scenarios. The current evaluation method of the voice transcription system does not consider the influence of the voice transcription system on noise.

In view of the above, the present application provides a method for evaluating a speech transcription system, which can at least solve some of the defects existing in the prior art. The evaluation method of the voice transcription system can be suitable for evaluating the transcription effect of the voice transcription system in various industries and scenes.

The application provides an evaluation system of a voice transcription system (called voice transcription evaluation system for short), which can be deployed on a terminal or a server. For example, the speech transcription evaluation system may be deployed on the same device as the speech transcription system, or both may be deployed on different devices.

Fig. 2 illustrates a process flow of a speech transcription system integrated with a speech transcription evaluation system.

The test audio is transcribed by a speech transcription system (ASR system) to obtain transcribed text (defined as first transcribed text) of the test audio.

Further, a reference text of the test audio is acquired, the reference text being a standard text corresponding to the test audio content. The test audio and the reference text can be obtained through a published data set, or the expert marks the test audio to obtain the reference text. In addition, the reference text may be synthesized by speech, and the synthesized speech may be used as the test audio.

The reference text and the first transcription text are sent to a voice transcription evaluation system, a specific evaluation process is executed, for example, semantic consistency between the first transcription text and the reference text is determined, a semantic consistency index score is obtained, and an evaluation result of the voice transcription system is obtained based on the semantic consistency index score, for example, the evaluation score, the evaluation grade and other forms of the voice transcription system are obtained.

In some possible implementations, the noisy test audio may also be obtained by an automatic noise-adding process for the test audio, as shown by the right branch in fig. 2. And further performing transcription through an ASR system to obtain a transcription text (defined as a second transcription text) of the noisy test audio. On the basis, the second transcription text can also be sent to a voice transcription evaluation system for the evaluation system to calculate the noise robustness index score. The evaluation result of the voice transcription system can be determined based on the scores of the set evaluation indexes such as the semantic consistency index score, the noise robustness index score and the like.

Next, from the perspective of a speech transcription evaluation system, the embodiment of the application provides a speech transcription evaluation method. Referring to fig. 3, the method for evaluating the speech transcription system specifically includes the following steps:

step S100, test audio and corresponding reference text are acquired.

Specifically, when the voice transcription system is tested, a target scene (field) to be tested can be specified, namely the test task is to test the transcription effect of the voice transcription system under the target scene.

The test audio acquired in this step may be from a set of audio tests related to the target scene. For example, if the target scene is a human-machine interaction scene, the test audio may be from a short audio test set associated with the human-machine interaction scene. As yet another example, where the target scene is a conference scene, the test audio may be a long audio test set from a conference transcription class associated with the conference scene.

The reference text corresponding to the test audio is a standard text corresponding to the test audio content. The test audio and the reference text can be obtained through a published data set, or the expert marks the test audio to obtain the reference text. In addition, the reference text may be synthesized by speech, and the synthesized speech may be used as the test audio.

Step S110, a first transcription text of the test audio is obtained, and the first transcription text is obtained by transcribing the test audio through a voice transcription system to be evaluated.

And step S120, determining semantic consistency between the first transcription text and the reference text to obtain a semantic consistency index score.

In this embodiment, the semantic consistency index is used as an evaluation index for evaluating the effect of the voice transcription system. The semantic consistency index is used for measuring semantic consistency between the first transcription text and the reference text, namely measuring the capability of the voice transcription system for testing audio semantic transfer.

The semantic consistency index score represents the degree of semantic consistency between the first transcription text and the reference text, and the higher the semantic consistency index score, the higher the semantic consistency degree between the first transcription text and the reference text, and the better the transcription effect of the speech transcription system.

And step S130, determining an evaluation result of the voice transcription system according to the score of the set evaluation index, wherein the set evaluation index at least comprises a semantic consistency index.

The application can predefine one or more evaluation indexes for evaluating the voice transcription system, wherein the evaluation indexes at least comprise semantic consistency indexes.

According to the score of the set evaluation index, the evaluation effect of the voice transcription system can be calculated. The evaluation effect may be embodied in various forms, for example, the evaluation effect of the speech transcription system is represented in the form of an evaluation score, an evaluation level, and the like.

According to the evaluation method for the voice transcription system, when the voice transcription system is evaluated, the semantic consistency index is introduced, the semantic consistency between the first transcription text of the test audio and the reference text can be measured, namely, the degree of meaning consistency of the first transcription text and the reference text expressed in the semantic layer can be measured, the capability of the voice transcription system for semantic transfer of the test audio can be accurately measured, further, the evaluation result of the voice transcription system is determined at least according to the semantic consistency index score, and the objectivity of the evaluation result is improved. Meanwhile, the semantic consistency evaluation index is more in line with the real reading experience of human beings, so that the obtained evaluation result is more in line with the real experience of human beings.

In some possible implementations, the evaluation index of the speech transcription system may only include the semantic consistency index, and then the calculated semantic consistency index score may be directly used as the evaluation score of the speech transcription system.

In other possible implementations, the evaluation index of the speech transcription system may further increase a recognition rate index, where the recognition rate index characterizes the accuracy of transcription of the voice word, that is, the accuracy of testing the audio transcription into characters. On this basis, the method of the application can further comprise:

A recognition rate indicator score is calculated based on the reference text and the first transcription text.

The evaluation score of the voice transcription system can be comprehensively obtained according to the recognition rate index score and the semantic consistency index score. For example, the recognition rate index score and the semantic consistency index score are weighted and added, and the result is used as an evaluation score T of the voice transcription system:

T=s1×wgt1+s2×wgt2;

wherein s1 represents the recognition rate index score, s2 represents the semantic consistency index score, wgt1 and wgt2 are two weight values, and the weight values can be flexibly adjusted according to the attention emphasis of a user on the recognition rate index and the semantic consistency index, and the sum of the two weight values is equal to 1.

An alternative calculation process for the identification rate indicator score is described in this embodiment.

Specifically, the identification rate index score may be determined based on the correct identification rate COR index score, and/or the accuracy rate ACC index score. For example, the correct recognition rate COR index score or the accuracy rate ACC index score is used as the recognition rate index score. Or the COR index score and the ACC index score can be integrated to obtain the identification rate index score. The following formula illustrates one way of identifying rate indicator score calculation:

s1=α×COR+β×ACC;

Where α and β represent two weight values, the sum of which is equal to 1, cor and ACC represent two index scores, respectively.

The COR index and the ACC index each have a bias towards applicable scenarios.

The COR index is used for focusing on the correct proportion of the character level, ignoring the word transfer errors of the insert, delete and replace classes and reflecting the local correctness. And thus are more suitable for long audio, examples being conference recording transfer text, long lecture or blog content generation, tasks requiring statistical local accuracy (e.g. subtitle generation), etc.

The ACC index is only considered correct when the transcribed text is exactly identical to the reference text, ensuring absolute accuracy of the text. While short audio typically corresponds to short text (e.g., "turn on air conditioner", "navigate to company", etc.), sentences are short and semantically critical. Minor errors can lead to serious consequences such as "turn off" being wrongly transcribed into "turn on" even if only one character is replaced, possibly completely changing the intent. Thus, ACC metrics are more applicable to short audio, examples being voice assistant instructions (e.g., smart home control), key information extraction (e.g., phone numbers, verification codes), phrase translation or command execution, etc. The ACC index better meets the "zero fault tolerance" requirement of short audio.

According to application scenes to which the COR index and the ACC index are respectively biased, the evaluation method of the voice transcription system can set weights of different sizes for the COR index and the ACC index when testing the performance of the voice transcription system under different scenes. Examples are:

Under the condition that the test audio is from a short audio test set, a first weight alpha corresponding to the COR index is smaller than a second weight beta corresponding to the ACC index;

in the case that the test audio is derived from a long audio test set, the first weight α corresponding to the COR index is greater than the second weight β corresponding to the ACC index.

The duration of the test audio contained in the short audio test set is smaller than a first set duration threshold, and the duration of the test audio contained in the long audio test set is larger than a second set duration threshold. The first set duration threshold value and the second set duration threshold value can be the same or different, and when the values of the first set duration threshold value and the second set duration threshold value are different, the first set duration threshold value is smaller than the second set duration threshold value.

Referring to fig. 4, a calculation flow of an identification rate index is described, which specifically includes:

for test audio, its reference text is obtained, and the first transcribed text of the test audio is obtained by a speech transcription system (ASR system).

Further, the reference text and the first transcription text are edit distance aligned.

A COR index score and an ACC index score are calculated based on the alignment results.

And carrying out weighted addition on the COR index score and the ACC index score to obtain the recognition rate score.

An example of the alignment of the edit distances of the reference text LAB and the first transcription text REC is provided in connection with fig. 5.

The statistical reference text has 16 words in common, i.e. n=16. When 13 words, i.e., h=13, are correctly transcribed, the number of inserted words is 2, i=2, the substitution error is 2, i.e., s=2, and the deletion error is 1, i.e., d=1, COR is calculated to be 81.25% and ACC is calculated to be 68.75% according to the above formulas (1), (2), and (3). Defining α=0.2, β=0.8, the recognition rate index score s1=71.25 is calculated.

In some embodiments of the application, the determination of the semantic consistency index score is introduced.

In one possible implementation, a semantic consistency discriminant model may be pre-trained, which may employ a deep neural network model structure. Training data is first obtained, the training data comprising text pairs, and the text pairs are labeled with semantic consistency scores. Training the semantic consistency discrimination model by using the training data to obtain a trained semantic consistency discrimination model. Furthermore, the first transcription text and the reference text can be processed by using the semantic consistency discrimination model, and the semantic consistency index score of the first transcription text and the reference text can be output.

In another possible implementation, the present embodiment may invoke a large model, outputting the semantic consistency indicator scores of the reference text and the first transcription text with the natural language understanding and processing capabilities of the large model. That is, in this embodiment, the large model may be invoked to instruct the large model to evaluate semantic consistency between the first transcription text and the reference text, and output a semantic consistency index score.

When the large model is called to evaluate the semantic consistency between the first transcription text and the reference text, the first transcription text and the reference text can be spliced with a preset prompt word to form a prompt, and the prompt is sent to the large model to obtain the semantic consistency index score output by the large model. In the prompt word, a thinking way of the large model can be specified to guide the large model to evaluate the semantic consistency index score more accurately, and the following examples of part of the content of the prompt word are illustrated:

In the task of speech transcription, the semantic change condition of the text after speech transcription relative to the reference text needs to be evaluated, and the output is the degree of semantic change (an integer between 0 and 100, and the higher the degree is, the larger the value is).

When judging whether the semantics change, the following steps can be considered:

step1, firstly judging that only repeated characters, spoken characters and redundant character expressions of a reference text are lost, added or corrected after voice recognition, basically, and outputting a mark { "no semantic change": 0}, wherein the repeated characters, the spoken characters and the redundant character expressions basically have no semantic change, for example, "can also be o", can be y ", and ending the judgment.

Step2 if the condition of step1 is not satisfied, then try to determine whether the following semantic inconsistency exists:

(1) "sentence change":

...

Output format:

{

"reason": "< scoring reason >";

"score":""

}

Please evaluate the following data strictly in the above format:

Reference text [ reference text ]

Phonetic transcription text [ first transcription text ].

According to the method and the device, the capability of the large model is called, a pre-training model is not needed, and the more accurate semantic consistency index score can be generated by means of natural language understanding and processing capability of the large model.

In some embodiments of the present application, the speech transcription evaluation system may further have the capability of evaluating the reliability of the semantic consistency index score given by the large model, that is, may receive the calibration result of the semantic consistency index score between the reference text and the first transcription text by the expert, further compare the expert calibration result with the output result of the large model, and give a comparison result of the human-computer consistency effect, where the comparison result may further verify whether the large model may directly replace the expert to perform the evaluation of semantic consistency. Before the voice transcription evaluation system is formally applied, the processing process can be used for verifying whether the similarity between the semantic consistency evaluation result given by the large model and the result marked by an expert is high enough (for example, the similarity exceeds a set threshold), if the similarity exceeds the set threshold, the output result of the large model is reliable enough, the voice transcription evaluation system can be applied to an evaluation stage, and if the similarity does not exceed the set threshold, the output result of the large model is unreliable, network parameters and prompt words of the large model can be further optimized until the output result of the large model is reliable enough.

Referring to fig. 6, a process for evaluating the human-computer consistency effect of a speech transcription evaluation system is described. One possible implementation process may include:

s1, obtaining a corresponding first transcription text after the test audio passes through a voice transcription system (ASR system), and obtaining a reference text corresponding to the test audio.

S2, forming a parallel sentence pair by the reference text and the first transcription text of the same test audio, and constructing a parallel sentence pair set for a plurality of test audio in the test case.

S3, calling a large model, and respectively scoring the machine semantic consistency indexes of each parallel sentence pair in the parallel sentence pair set to finally obtain a machine scoring vector A of the dimension of the test case.

And S4, providing the parallel sentence pair set obtained in the S2 for a language expert to perform artificial semantic consistency index scoring, and finally obtaining an artificial scoring vector B of the dimension of the test case.

S5, calculating the similarity between the vector A and the vector B, and obtaining an evaluation result of the consistency of the human-computer semantic scoring.

In this step, the similarity between the two score vectors may use pearson correlation coefficients.

According to the method provided by the embodiment, the expert scoring result and the machine scoring result are compared to obtain the evaluation result of consistency of the man-machine scoring (the score for measuring the semantic consistency of the reference text and the first transcription text), the reliability of the large model can be analyzed by referring to the evaluation result, so that when the large model is found to be unreliable, the structure or the prompt word of the large model can be timely adjusted, the speech transcription evaluation system is applied to a specific evaluation task after the output result of the large model is guaranteed to be reliable enough, and the accuracy of the evaluation result of the speech transcription evaluation system is guaranteed.

In some embodiments of the present application, another speech transcription system evaluation method is presented. In the method of the embodiment, a noise robustness index can be further added into an evaluation index of the voice transcription system, and the noise robustness index is used for measuring stability of the transcription text of the audio before and after noise addition. The evaluation index of the speech transcription system may include both a semantic consistency index and a noise robustness index. Or the evaluation index of the voice transcription system can simultaneously comprise an identification rate index, a semantic consistency index and a noise robustness index.

By adding the noise robustness index to the evaluation index, the noise immunity of the speech transcription system can be evaluated.

Taking the case that the evaluation indexes comprise the recognition rate index, the semantic consistency index and the noise robustness index at the same time, the evaluation score of the voice transcription system can be comprehensively obtained according to the recognition rate index score, the semantic consistency index score and the noise robustness index score. For example, the recognition rate index score, the semantic consistency index score, and the noise robustness index score are weighted and added, and the result is used as an evaluation score T of the speech transcription system:

T=s1×wgt1+s2×wgt2+ s3×wgt3。

Wherein s1 represents the recognition rate index score, s2 represents the semantic consistency index score, s3 represents the noise robustness index score, wgt1, wgt2 and wgt3 are three weight values, and the sum of the three weight values is equal to 1, and the weight values can be flexibly adjusted according to the focus of a user on the recognition rate index, the semantic consistency index and the noise robustness index. In general, the maximum value of a single weight is not more than 0.5, so that the influence of a single index on an evaluation result is avoided.

According to the voice transcription system evaluating method, two indexes of semantic consistency and noise robustness are newly added on the basis of the recognition rate index, so that the limitation that the current voice transcription system evaluating method only pays attention to word alignment is effectively overcome. The evaluation method of the voice transcription system in the embodiment comprehensively evaluates the voice transcription system from a plurality of dimensions of the font, the semantics and the noise immunity of the transcription text, and the evaluation reliability is more robust than that of the existing scheme. Semantic consistency is more focused on subjective experiences of users, and noise robustness is more focused on special complex scenarios that speech transcription systems may face. The evaluation method of the voice transcription system provided by the embodiment of the application has the advantage that the evaluation of the voice transcription effect from the perspective of the user is more reasonable than the recognition rate.

In some embodiments of the application, a process for calculating a noise robustness index score is presented:

S1, acquiring a noise adding test audio, wherein the noise adding test audio is obtained after the test audio is subjected to noise adding processing.

The noise type and the noise intensity used during noise adding can be flexibly set, so that the noisy audio of a wider scene can be obtained through the noise adding processing of the test audio, the construction time and the cost expense of the noise adding test audio are shortened, the diversity of a test set is enriched to a certain extent, and the robustness of the effect evaluation index of the voice transcription system is improved.

S2, acquiring a second transcription text of the noise-adding test audio, and transcribing the noise-adding test audio by the second transcription text through a voice transcription system.

S3, determining a noise robustness index score based on the first transcription text, the second transcription text and the reference text.

In some possible implementations, the noise robustness index score may be a score derived from statistics of n-gram variations between the second transcribed text and the first transcribed text, the reference text.

Illustratively, SARI index scores may be employed as noise robustness index scores, or other statistical indices may be employed to calculate noise robustness index scores.

Taking the calculation process of SARI index scores as an example, taking the first transcription text as a source text, taking the second transcription text as a generated text, and calculating SARI index scores based on the source text, the generated text and the reference text.

SARI (‌ S ‌ entence-level ‌ A ‌ utomatic ‌ R ‌ ELEVANCE I ‌ ndicator) is an index for evaluating text generation tasks (such as text simplification and abstract generation), is mainly used for tasks such as text simplification, and comprehensively calculates scores of reserved, deleted and newly added operations by comparing ‌ n-gram editing operations ‌ between generated text and source text (source) and reference text (reference).

‌ SARI is the core to measure the accuracy of the following three types of editing operations:

‌ Retention (KEEP) ‌ the generated text correctly retains important content in the source text.

‌ DELETE (DELETE) ‌ generate text the correct DELETE of redundant content in the source text.

‌ New (ADD) ‌ generating text to rationally ADD new content that exists in the reference text.

The final score is the weighted average of the three, and the formula is SARI = (retention score + deletion score + new increment score)/3

One calculation procedure for SARI metrics is provided below:

‌ step 1 statistics n-gram ‌.

‌ Statistics n-gram set of source text (source): sgrams;

statistically generating an n-gram set of text (candate): cgrams;

n-gram set of statistical reference text (references) rgramsall.

‌ Step 2, counting word frequency.

Sgramcounter = Counter (sgrams) (source text word frequency);

cgramcounter = Counter (cgrams) (text word frequency is generated);

rgramcounter = Counter (rgramsall) (reference text word frequency).

And 3, calculating three types of editing operation scores ‌.

‌ (A) reservation (KEEP):

‌ retained n-gram ‌ generating an n-gram (intersection) of text common to the source text, namely text { keepgramcounter _rep } = sgram ∈ cgram.

‌ Correct reservation ‌ reserved n-gram appears in the reference text, namely:

text{keepgramcountergood_rep} = text{keepgramcounter_rep} ∩ rgramsall。

‌ (b) DELETE (DELETE) ‌:

‌ n-gram ‌ that should be deleted n-gram that exists in the source text but that generates text that is not preserved, namely text { delgramcounter _rep } = sgram-cgram.

‌ Correct deletion ‌. N-gram to be deleted does not appear in the reference text, i.e. text { delgramcountergood _rep } = text { delgramcounter _rep } -rgramsall.

‌ (C) New Addition (ADD):

‌ newly added n-gram ‌ generating an n-gram that exists in text but does not appear in the source text, namely:

addgramcounter=cgramcounter−sgramcounter。

‌ new n-gram is added ‌ correctly, the new n-gram appears in the reference text, namely:

addgramcountergood=addgramcounter∩rgramcounter。

the embodiment further provides a SARI index calculation example. ‌ A

‌ Source text (Source) ‌: [ "the", "cat", "sat" ], corresponding grams= {1: { the ', "cat'," sat }.

‌ Generates text (candidate) ‌: [ "a", "cat", "sat" ], corresponding cgrams = {1: { 'a', 'cat', 'sat', }.

‌ Reference text (reference) ‌: [ "the", "cat", "sat" ], corresponding rgrams = {1: { ' the ', ' cat ', ' sat }.

‌ Calculation procedure ‌:

retention (KEEP) ‌:

a1-gram common to the text and the source text is generated { ' cat ', ' sat ', '.

Proper reservation { 'cat', 'sat', (reserved all appear in the reference).

Precision P (Precision) =2/2=1.0, recall R (Recall) =2/3≡0.666.

F1 fraction= (2×p×r)/(p+r) ≡0.8.

‌ DELETE (DELETE) ‌:

1-gram to be deleted-none in the generated text in the source text { 'the', }.

Correct deletion { 'the' } (existing in the reference, actually should be preserved, so not valid deletion).

P=0/1=0, r=0/1=0, and f1 fraction=0.

‌ New (ADD) ‌:

Generating a new 1-gram, namely generating a part which is not existing in the source text in the text, wherein the part is { 'a'.

Correctly newly added { } (newly added part is not in reference).

P=0/1=0, r=0/1=0, and f1 fraction=0.

The F1 score for each edit operation was averaged and the result was taken as SARI score:

SARI=(0.8+0+0)/3≈0.267。

The calculation of SARI scores is described above by taking 1-gram as an example, and in addition, the n value in the n-gram can also be other values. Or when the N values include a plurality of values, a corresponding SARI score can be calculated for each N value, and finally, the SARI scores under all the N values are averaged to be used as a final SARI score, so that the text fluctuation quality, namely the noise robustness of the voice transcription system, can be measured more comprehensively.

SARI index quantifies the ability of the speech transcription system to maintain semantic consistency under noise interference by analyzing the rationality of the add, delete and reserve operations. The higher the SARI index score, the more robust the speech transcription system to noise.

In some embodiments of the present application, the speech transcription evaluation system may further have the capability of evaluating the reliability of the noise robustness index score (for example, the noise robustness index score obtained by calculating using SARI indexes) of the machine evaluation, that is, may receive the calibration result of the noise robustness index given by the expert based on the first transcription text, the second transcription text and the reference text, and further compare the calibration result of the expert with the machine evaluation result, and give a comparison result of the man-machine consistency effect, where the comparison result may further verify whether the machine evaluation noise robustness index score may be directly used to replace the expert calibration. Before the voice transcription evaluation system is formally applied, the noise robustness index score of the machine evaluation can be verified through the processing process, whether the similarity between the noise robustness index score and the result marked by an expert is high enough or not (if the similarity exceeds a set threshold), if the similarity exceeds the set threshold, the result of the machine evaluation is reliable enough, the voice transcription evaluation system can be applied to an evaluation stage, if the similarity does not exceed the set threshold, the result of the machine evaluation is unreliable, the noise robustness index evaluation algorithm can be further optimized, and the evaluation result obtained by the machine based on the optimized noise robustness index evaluation algorithm is reliable enough.

Referring to fig. 7, another evaluation flow of the human-computer consistency effect of the speech transcription evaluation system is described. One possible implementation process may include:

S2, carrying out noise adding processing on the test audio to obtain a loaded test audio, and obtaining a second transcription text through the ASR system.

S3, forming a triplet by the reference text, the first transcription text and the second transcription text of the same test audio, and constructing a triplet set for a plurality of test audio in the test case.

And S4, calculating a noise robustness index score of each triplet in the triplet set according to SARI algorithm, and finally obtaining a machine scoring vector C of the dimension of the test case.

And S5, providing the triplet set obtained in the S3 for a language expert to perform artificial noise robustness scoring, and finally obtaining an artificial scoring vector D of the dimension of the test case.

S5, calculating the similarity between the vector C and the vector D, and obtaining an evaluation result of the consistency of the robustness scoring of the man-machine noise.

According to the method provided by the embodiment, the expert scoring result and the machine scoring result are compared to obtain the evaluation result of consistency of the man-machine scoring (noise robustness index score), the reliability of a noise robustness index evaluation algorithm (for example, SARI algorithm) can be analyzed by referring to the evaluation result, so that the noise robustness index evaluation algorithm can be timely adjusted when the algorithm is found to be unreliable, the noise robustness index score output by the machine is guaranteed to be reliable enough, the voice transcription evaluation system is applied to a specific evaluation task, and the accuracy of the evaluation result of the voice transcription evaluation system is guaranteed.

The speech transcription system evaluation device provided by the embodiment of the application is described below, and the speech transcription system evaluation device described below and the speech transcription system evaluation method described above can be referred to correspondingly.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an evaluation device of a speech transcription system according to an embodiment of the present application.

As shown in fig. 8, the apparatus may include:

A test audio and reference text obtaining unit 11, configured to obtain a test audio and a corresponding reference text thereof;

a first transcription text obtaining unit 12, configured to obtain a first transcription text of the test audio, where the first transcription text is obtained by transcribing the test audio by a speech transcription system to be evaluated;

A semantic consistency index calculating unit 13, configured to determine semantic consistency between the first transcription text and the reference text, and obtain a semantic consistency index score;

And the evaluation result determining unit 14 is configured to determine an evaluation result of the speech transcription system according to a score of a set evaluation index, where the set evaluation index includes at least the semantic consistency index.

In a possible implementation, the apparatus of the present application further includes:

The noise adding frequency acquisition unit is used for acquiring noise adding test audio, wherein the noise adding test audio is obtained by carrying out noise adding processing on the test audio;

the second transcription text acquisition unit is used for acquiring a second transcription text of the noise-added test audio, and the second transcription text is obtained by transcribing the noise-added test audio through the voice transcription system;

The noise robustness index calculation unit is used for determining a noise robustness index score based on the first transcription text, the second transcription text and the reference text, and the noise robustness index is used for measuring stability of the transcription text of the audio before and after noise addition. On the basis, the evaluation index adopted by the evaluation result determining unit also comprises the noise robustness index.

And the recognition rate index calculation unit is used for calculating a recognition rate index score according to the reference text and the first transcription text, and the recognition rate index represents the accuracy rate of transcription of the voice characters. On the basis, the evaluation index adopted by the evaluation result determining unit also comprises the identification rate index.

In a possible implementation, the process of determining the semantic consistency between the first transcription text and the reference text by the semantic consistency index calculating unit to obtain a semantic consistency index score includes:

In a possible implementation, the noise robustness index calculation unit determines a noise robustness index score based on the first transcribed text, the second transcribed text and the reference text, and includes:

In a possible implementation, the process of calculating the recognition rate index score by the recognition rate index calculating unit according to the reference text and the first transcription text includes:

In a possible implementation, the process of integrating the COR index score and the ACC index score by the recognition rate index computing unit to obtain a recognition rate index score includes:

The embodiment of the application also provides electronic equipment. Referring to fig. 9, a schematic diagram of an electronic device suitable for use in implementing embodiments of the present application is shown. The electronic device in the embodiment of the application can include, but is not limited to, a terminal such as a mobile phone, a tablet computer, a translator, a server, and the like. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in fig. 9, the electronic device may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603, to implement the speech transcription system evaluation method of the foregoing embodiment of the present application. In the state where the electronic device is powered on, various programs and data necessary for the operation of the electronic device are also stored in the RAM 603. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, devices may be connected to I/O interface 605 including input devices 606, including for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 607, including for example, liquid Crystal Displays (LCDs), speakers, vibrators, etc., storage devices 608, including for example, memory cards, hard disks, etc., and communication devices 609. The communication means 609 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 shows an electronic device having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

The embodiment of the application also provides a computer program product, which comprises computer readable instructions, wherein when the computer readable instructions run on the electronic equipment, the electronic equipment is enabled to realize any one of the evaluation methods of the voice transcription system provided by the embodiment of the application.

The embodiment of the application also provides a computer readable storage medium, which carries one or more computer programs, and when the one or more computer programs are executed by the electronic equipment, the electronic equipment can realize any voice transcription system evaluating method provided by the embodiment of the application.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

Claims

1. A speech transcription system evaluation method, comprising:

Get the test audio and its corresponding reference text;

Obtaining a first transcribed text of the test audio, where the first transcribed text is obtained by transcribing the test audio using a speech transcription system to be evaluated;

Determining semantic consistency between the first transcribed text and the reference text to obtain a semantic consistency index score;

Determining an evaluation result of the speech transcription system according to scores of set evaluation indicators, wherein the set evaluation indicators at least include the semantic consistency indicator;

Also includes:

Acquire a noise-added test audio, where the noise-added test audio is the audio obtained by performing noise-adding processing on the test audio;

Obtaining a second transcribed text of the noisy test audio, where the second transcribed text is obtained by transcribing the noisy test audio through the speech transcription system;

Using the first transcribed text as the source text and the second transcribed text as the generated text, and calculating a SARI index score based on the source text, the generated text, and the reference text as a noise robustness index score, the noise robustness index being used to measure the stability of the transcribed text of the audio before and after noise addition;

The set evaluation index also includes the noise robustness index.

2. The method according to claim 1, further comprising:

Calculating a recognition rate index score based on the reference text and the first transcribed text, wherein the recognition rate index represents the accuracy of the phonetic-character transcription;

The set evaluation index also includes the recognition rate index.

3. The method according to claim 1, wherein the process of determining the semantic consistency between the first transcribed text and the reference text and obtaining a semantic consistency index score comprises:

The large model is called to instruct the large model to evaluate the semantic consistency between the first transcribed text and the reference text, and output a semantic consistency index score.

4. The method according to claim 2, wherein the process of calculating the recognition rate index score based on the reference text and the first transcribed text comprises:

Performing edit distance alignment on the reference text and the first transcribed text;

Calculate the correct recognition rate (COR) and accuracy rate (ACC) scores based on the alignment results;

The COR index score and the ACC index score are integrated to obtain a recognition rate index score.

5. The method according to claim 4, wherein the process of integrating the COR index score and the ACC index score to obtain the recognition rate index score comprises:

Performing weighted addition on the COR index score and the ACC index score to obtain a recognition rate index score;

When the test audio is from a short audio test set, the first weight corresponding to the COR indicator is less than the second weight corresponding to the ACC indicator, and the duration of the test audio included in the short audio test set is less than a first set duration threshold;

In the case where the test audio comes from a long audio test set, the first weight corresponding to the COR indicator is greater than the second weight corresponding to the ACC indicator, and the duration of the test audio contained in the long audio test set is greater than the second set duration threshold.

6. A speech transcription system evaluation device, comprising:

A test audio and reference text acquisition unit, used to acquire the test audio and its corresponding reference text;

A first transcribed text obtaining unit, configured to obtain a first transcribed text of the test audio, where the first transcribed text is obtained by transcribing the test audio through the speech transcription system to be evaluated;

a semantic consistency index calculation unit, configured to determine the semantic consistency between the first transcribed text and the reference text, and obtain a semantic consistency index score;

an evaluation result determining unit, configured to determine an evaluation result of the speech transcription system according to a score of a set evaluation indicator, wherein the set evaluation indicator includes at least the semantic consistency indicator;

The first transcribed text is used as the source text, and the second transcribed text is used as the generated text. Based on the source text, the generated text and the reference text, a SARI index score is calculated as a noise robustness index score.

7. An electronic device, comprising: a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program to implement the steps of the speech transcription system evaluation method according to any one of claims 1 to 5.

8. A readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the computer program implements the steps of the speech transcription system evaluation method according to any one of claims 1 to 5.

9. A computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements the steps of the speech transcription system evaluation method according to any one of claims 1 to 5.