[go: up one dir, main page]

CN119149675B - A text error correction method, device, computer equipment and storage medium - Google Patents

A text error correction method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN119149675B
CN119149675B CN202411651413.6A CN202411651413A CN119149675B CN 119149675 B CN119149675 B CN 119149675B CN 202411651413 A CN202411651413 A CN 202411651413A CN 119149675 B CN119149675 B CN 119149675B
Authority
CN
China
Prior art keywords
character
error correction
label
trained
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411651413.6A
Other languages
Chinese (zh)
Other versions
CN119149675A (en
Inventor
吴一超
蔡可妍
潘霖
卞豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jiangshu Technology Co ltd
Original Assignee
Suzhou Jiangshu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiangshu Technology Co ltd filed Critical Suzhou Jiangshu Technology Co ltd
Priority to CN202411651413.6A priority Critical patent/CN119149675B/en
Publication of CN119149675A publication Critical patent/CN119149675A/en
Application granted granted Critical
Publication of CN119149675B publication Critical patent/CN119149675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本公开提供了一种文本纠错方法、装置、计算机设备及存储介质,涉及计算机技术领域,该方法能够确定训练样本首选真值标签对应的次选真值标签,并基于首选真值标签和次选真值标签对文本纠错模型进行训练,从而使训练好的文本纠错模型确定的训练样本的次选真值标签作为真值结果的概率小于首选真值标签作为真值结果的概率、且大于非真值标签作为真值结果的概率,从而规避掉训练时正确标签单一带来的过拟合弊端,降低错检。

The present invention provides a text error correction method, apparatus, computer equipment and storage medium, which relate to the field of computer technology. The method can determine a secondary truth label corresponding to a primary truth label of a training sample, and train a text error correction model based on the primary truth label and the secondary truth label, so that the probability of the secondary truth label of the training sample determined by the trained text error correction model as a true value result is less than the probability of the primary truth label as a true value result, and greater than the probability of a non-true value label as a true value result, thereby avoiding the overfitting disadvantage caused by a single correct label during training and reducing false detection.

Description

Text error correction method and device, computer equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular relates to a text error correction method, a text error correction device, computer equipment and a storage medium.
Background
Text error correction refers to the process of automatically detecting and correcting errors in text. These errors may include spelling errors, grammar errors, word errors, and the like. The text error correction technology plays an important role in the field of natural language processing, and is widely applied to the fields of document editing, search engines, voice recognition and the like so as to improve text quality and user experience.
Text correction is usually based on a pre-trained model, and in training data of the model, a correct label of a sample has uniqueness, so that the model is easy to be fitted and error detection occurs.
Disclosure of Invention
The embodiment of the disclosure at least provides a text error correction method, a text error correction device, computer equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a text error correction method, including:
Constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
determining at least one secondary selection truth value tag matching the primary selection truth value tag of the training sample;
Training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
and obtaining a text to be corrected, inputting the text to be corrected into the trained text correction model, and obtaining a correction result of the text to be corrected.
In an alternative embodiment, the determining at least one secondary selection truth value tag that matches the preferred truth value tag of the training samples includes:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the generating the first character sequence containing the context information and the pending character based on the context information of the error character in the training sample sentence includes:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining a third character sequence from the plurality of second character sequences according to the occurrence frequency includes:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an optional implementation manner, the training the text error correction model to be trained by using the training sample carrying the preferred truth value tag and the sub-selected truth value tag corresponding to the training sample to obtain a trained text error correction model includes:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the determining the first loss according to the first difference between the probabilities corresponding to the sub-selection truth labels and the probabilities corresponding to the preferred truth labels in the prediction result includes:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the determining the second loss according to the second difference between probabilities corresponding to the sub-selection truth value label and probabilities corresponding to other target characters except the sub-selection truth value label and the preferred truth value label in the prediction result includes:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an optional implementation manner, the training the text error correction model to be trained by using the third loss to obtain a trained text error correction model includes:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
In a second aspect, an embodiment of the present disclosure further provides a text error correction apparatus, including:
The system comprises a construction module, a text error correction module and a judgment module, wherein the construction module is used for constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
a determining module for determining at least one secondary selection truth value tag matching the preferred truth value tag of the training sample;
The training module is used for training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
The error correction module is used for acquiring a text to be corrected, inputting the text to be corrected into the trained text error correction model, and obtaining an error correction result of the text to be corrected.
In an alternative embodiment, the determining module is specifically configured to:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the determining module is specifically configured to:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining module is specifically configured to:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, the training module is specifically configured to:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the training module is specifically configured to:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the training module is specifically configured to:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, the training module is specifically configured to:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
In a third aspect, an optional implementation manner of the disclosure further provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, where the machine-readable instructions, when executed by the processor, perform the steps in the first aspect, or any possible implementation manner of the first aspect, when executed by the processor.
In a fourth aspect, an alternative implementation of the present disclosure further provides a computer readable storage medium having stored thereon a computer program which when executed performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.
The description of the effects of the text error correction apparatus, the computer device, and the computer-readable storage medium is referred to the description of the text error correction method, and is not repeated here.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.
According to the text error correction method, device, computer equipment and storage medium, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability that the secondary selection truth value label of the training sample is used as the truth value result and is smaller than the probability that the preferred truth value label is used as the truth value result and is larger than the probability that the non-truth value label is used as the truth value result, the defect of overfitting caused by single correct label in training is avoided, and error detection is reduced.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.
FIG. 1 illustrates a flow chart of a text error correction method provided by some embodiments of the present disclosure;
FIG. 2 illustrates a schematic diagram of a text error correction apparatus provided by some embodiments of the present disclosure;
fig. 3 illustrates a schematic diagram of a computer device provided by some embodiments of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
According to research, in the process of training a text error correction model, the truth label of training data has uniqueness and is inconsistent with the actual situation, so that the model is easy to be excessively fitted, and false alarm is generated in actual deployment. For example, in the sample sentence "not matching the actual condition," the truth label corresponding to the character "brother" is "condition," but in the actual language environment, there are some other correct answers, such as "situation", "scenario". In general, the spelling error correction depth model only sets the "condition" as the true value label during training, and after training, the model is very easy to correct the "situation" and the "scene" as the "condition", so as to lead to false-positive error detection.
Based on the above study, the disclosure provides a text error correction method, a device, a computer device and a storage medium, which can determine a subselect truth value tag corresponding to a first-choice truth value tag of a training sample, and train a text error correction model based on the first-choice truth value tag and the subselect truth value tag, so that the probability that the subselect truth value tag of the training sample determined by the trained text error correction model is used as a truth value result is smaller than the probability that the first-choice truth value tag is used as the truth value result and is larger than the probability that a non-truth value tag is used as the truth value result, thereby avoiding the defect of overfitting caused by single correct tag during training and reducing false detection.
The defects of the scheme are all results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions to the above problems set forth hereinafter by the present disclosure should be all contributions of the inventors to the present disclosure during the course of the present disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
For the sake of understanding the present embodiment, first, a detailed description will be given of a text error correction method disclosed in the present embodiment, where an execution subject of the text error correction method provided in the present embodiment is generally a computer device with a certain computing capability, and the computer device includes, for example, a terminal device or a server or other processing device, and the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some possible implementations, the text error correction method may be implemented by way of a processor invoking computer readable instructions stored in a memory.
The text error correction method provided by the embodiment of the present disclosure is described below by taking an execution body as a terminal device as an example.
Referring to fig. 1, a flowchart of a text error correction method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S104, where:
s101, constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character.
In this step, a text error correction model may be constructed, and the model may be selected from a conventional machine learning method, such as a hidden markov model (Hidden Markov Model), a support vector machine (Support Vector Machine, SVM), etc., and a deep learning model, such as a Long Short-Term Memory (LSTM), transformers model, a bi-directional encoder representation model (Bidirectional Encoder Representations from Transformers, BERT), etc.
In general, the deep learning error correction model may use softmax and cross entropy to calculate loss, and then implement training of the deep learning error correction model by backward gradient propagation to update parameters of the model.
For example, in the example of the sample sentence "not matching the actual condition", it is assumed that in the dictionary file, the code id of the "condition" is t, the code id of the "shape" is j, and the score of the id of t in the predicted probability value logits (probability value of the model predicting that the correct character corresponding to the error character is the target character) for the "brother" word is. The actual calculated loss is:
Wherein i is the i-th predicted character corresponding to "brother". At this loss, the training goal of the model is to let As large as possible, but not consideredIt is equally feasible in sample sentences, resulting in the possibility that text such as "case" is also error corrected to "case".
Therefore, the embodiment of the disclosure provides concepts of the preferred truth value label and the sub-selected truth value label, and the text error correction model is trained through the preferred truth value label and the sub-selected truth value label, so that the model is prevented from being fitted excessively, and the probability of false detection is reduced.
The preferred truth value label may be the label that is most fit to the wrong character in the training sample, for example, in the training sample "not fit with the actual brothers", the preferred truth value label corresponding to the "condition" is determined to be "brothers".
For example, the preferred truth label of the wrong character in the training sample may be determined according to the context information in the training sample and the shape, pronunciation, meaning characteristics, and the like of the wrong character.
The training sample can be a text, and can be obtained by deforming a correct text or collecting an error sample.
Or counting massive public text data, and determining a preferred truth value label of the wrong character according to the quantity of the public data matched with the character segments in the training sample.
S102, determining at least one secondary selection truth value label matched with the preferred truth value label of the training sample.
In this step, at least one sub-selection value label corresponding to the error character in the training sample may be determined, and the number of sub-selection value labels corresponding to each error character may be determined according to actual situations, and exemplary values may be 2, 3, and 5.
In determining the sub-selection value tags, the determination may be based on a pre-deployed corpus data set. The corpus data set can be deployed according to the text type of which error correction is required. For example, if correction is required for a chinese text, a chinese corpus dataset may be deployed, if correction is required for an english text, an english corpus dataset may be deployed, and if correction is required for a chinese-english mixed text, a chinese-english mixed corpus dataset may be deployed.
The corpus data set can be generated based on massive correct texts, such as novels, papers, messages and the like.
When determining the secondary selection value label, determining the error character in the training sample, determining the context information of the error character in the training sample sentence, and generating a character sequence containing the context information and the undetermined character according to the context information. The first character sequence may represent a format of a segment of the training sample sentence that contains erroneous characters. The undetermined character may be a placeholder of the wrong character in the first character sequence, and the placeholder may be matched with any character when searching for a second character sequence matched with the first character sequence in the corpus data set.
After the first character sequence is obtained, a second character sequence matched with the first character sequence can be searched in the corpus data set, and when a second character sequence is searched, the searched second character sequence can be recorded and counted to obtain the occurrence frequency corresponding to the second character sequence.
And then, determining a third character sequence according to the occurrence frequency of each second character sequence, and taking the characters matched with the undetermined characters in the third character sequence as sub-selection truth value labels.
For example, the second character sequence having the frequency higher than the preset frequency may be determined as the third character sequence.
Specifically, the context information may include a first adjacent character before the location of the error character in the training sample sentence, and a second adjacent character after the location of the error character in the training sample sentence. The number of characters of the first adjacent character and the second adjacent character may be determined according to actual situations, for example, may be determined according to the number of characters of the phrase corresponding to the wrong character.
For example, in the training sample sentence of "not conforming to the actual condition" the phrase corresponding to the wrong character "brother" is "case", the number of characters is 2, where the number of wrong characters is 1, and the number of characters of the first adjacent character and the second adjacent character may be 1 if the number of remaining correct characters is 1.
Or in the above example, the number of characters of the "case" is 2, it may be determined that the number of characters of the first character sequence is 3 or 4.
In one possible implementation, the first and second character sequences may be extracted using an n-gram of n-tuples. For example, in the above example of "not matching with the actual condition" if the character digits of the first adjacent character and the second adjacent character are 1, the first character sequence may be extracted as "condition X not" by using a 3-gram method, where "condition" is the first adjacent character, "not" is the second adjacent character, and "X" is the undetermined character. The arrangement order of the characters in the first character sequence can be a first adjacent character, a pending character and a second adjacent character.
When a plurality of characters are included in the adjacent characters, the sequence of the characters in the adjacent characters may be kept inconvenient, for example, if the first adjacent character is "actual condition" and the second adjacent character is "disagreeable", the first character sequence may be "disagreeable with the actual condition X".
When the first character sequence is "emotion X not", the second character sequence extracted in the corpus data set by 3grams may include "situation not", "emotion not", and the like.
When determining the third character sequence, determining fourth character sequences with occurrence frequency greater than or equal to preset frequency from the second character sequences, and determining the preset number of fourth character sequences with the highest occurrence frequency as the third character sequence by sequencing the occurrence frequency of the fourth character sequences according to the occurrence frequency.
For example, if the occurrence frequencies of the second character sequences corresponding to "case no", "scene no" are 1506021, 1305055, 1131768, 983579, respectively, the preset frequency is 1000000, and the preset number is 2, where "case no" is the second character sequence corresponding to the preferred truth value label, the "case no" may be first removed from the second character sequence, and "please view no" with the occurrence frequency lower than 1000000 is removed from the second character sequence, so as to obtain the fourth character sequences "case no" and "case no", and then the "case no" and "case no" of the 2 names before the occurrence frequency are determined as the third character sequence.
And S103, training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result.
In the step, a training sample can be input into a text error correction model to be trained, and through the text error correction model, the prediction results of a plurality of target characters corresponding to the error characters can be determined, and the prediction results can represent the probability that the correct characters corresponding to the error characters are the target characters.
When the text error correction model works, a plurality of target characters can be determined first, then the probability that the correct character corresponding to the error character is the target character is determined, error correction is carried out according to the determined probability, and the correct character corresponding to the error character is judged.
After the prediction results of the target characters corresponding to the error characters are obtained, the probability corresponding to the sub-selection truth value label (the target characters contain the sub-selection truth value label) can be determined from the prediction results, and the first loss is determined according to the first difference value between the probability corresponding to the sub-selection truth value label and the probability corresponding to the preferred truth value label.
Meanwhile, the probabilities corresponding to other target characters except the secondary selection truth value label and the primary selection truth value label in the prediction result can be determined, a second difference value between the probabilities corresponding to the secondary selection truth value label is determined, and a second loss is determined according to the second difference value.
Thereafter, a third loss may be determined based on the sum of the first loss and the second loss.
And finally, training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In training with the third loss, the training target may be to minimize the value of the third loss. Since the third loss is determined based on the sum of the first loss and the second loss, the training goal is equivalent to minimizing the sum of the first loss and the second loss, i.e., minimizing the first difference and the second difference as much as possible.
Therefore, when the text error correction model obtained through training works, the probability that the subselect truth label of the training sample is used as the truth result is smaller than the probability that the first-choice truth label is used as the truth result and is larger than the probability that the non-truth label is used as the truth result, the first-choice truth label is used as the truth result preferentially, the importance degree of the subselect truth label is higher than that of other non-truth labels, the situation of fitting is avoided, and the probability of false detection is reduced.
Specifically, the first loss may be determined according to the following steps:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In this step, a first correction parameter and a second correction parameter may be introduced to correct the first difference. The first correction parameter can scale the first difference value so as to adjust the importance degree difference between the first selection truth value label and the second selection truth value label, and the second correction parameter can amplify or reduce the first difference value so as to enable the first loss to be closer to the actual needed state.
Accordingly, a third correction parameter and a fourth correction parameter may also be introduced, the second loss may be determined by:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In the step, the third correction parameter can scale the second difference value so as to adjust the importance degree difference between the sub-selection value label and other target characters, and the fourth correction parameter can amplify or reduce the second difference value so as to enable the second loss to be closer to the actual required state.
By way of example, the third loss may be determined by the following equation:
wherein loss is the third loss, and, The probability corresponding to the j-th sub-selection truth value label,For the probability that the preferred truth label corresponds to,For the first correction value,For the second correction value,For the sub-selection of the set to which the value tags correspond,For a set of other target characters,For the probability corresponding to the i-th other target character,For the third correction value, the correction value is,And is the fourth correction value.
The values of the first correction value, the second correction value, the third correction value and the fourth correction value can be determined according to actual conditions. For example, the first correction value may be 0.01, the second correction value may be 0.9, the third correction value may be 0.02, and the fourth correction value may be 1.1.
The learning objective of the third penalty may be to make the score (i.e., probability) of the preferred truth label greater than the score of the sub-selected truth label, which score is greater than the score of the other incorrect labels (i.e., other objective labels). And simultaneously, the importance degree of different learning targets during model learning is adjusted through super parameters (namely the first correction value, the second correction value, the third correction value and the fourth correction value).
In the training process, the third loss can be reduced by adjusting the model parameters, and when the third loss is reduced to a certain degree (for example, lower than a preset threshold value), or when the third loss converges (the third loss is basically unchanged after the parameters are adjusted), the training can be stopped, and a trained text error correction model is obtained.
S104, acquiring a text to be corrected, and inputting the text to be corrected into the trained text correction model to obtain a correction result of the text to be corrected.
After training the text correction model is completed, text correction can be performed by using the trained text correction model, the text to be corrected is input into the trained text correction model, a correction result output by the text correction model can be obtained, and the correction result can comprise the text after correction and/or correct characters corresponding to the error characters in the text to be corrected.
Therefore, the trained text error correction model can reduce the probability of over fitting as much as possible, improves the importance degree of the sub-selection value label, can correct the brothers into the situations in the text to be corrected which is not matched with the actual brothers, and can not correct the shapes into the situations in the text which is not matched with the actual situations because the importance degree of the situations is increased in the training process, thereby reducing the probability of error detection.
According to the text error correction method provided by the embodiment of the disclosure, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability of the secondary selection truth value label of the training sample determined by the trained text error correction model as a truth value result is smaller than that of the preferred truth value label as the truth value result and is larger than that of the non-truth value label as the truth value result, the defect of overfitting caused by single correct label during training is avoided, and error detection is reduced.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
Based on the same inventive concept, the embodiments of the present disclosure further provide a text error correction device corresponding to the text error correction method, and since the principle of solving the problem by the device in the embodiments of the present disclosure is similar to that of the text error correction method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 2, a schematic diagram of a text error correction apparatus according to an embodiment of the disclosure is shown, where the apparatus includes:
A building module 210, configured to build a text error correction model to be trained, and obtain a plurality of training samples of the text error correction model, where the training samples are sentences including at least one error character, and correct characters corresponding to the error character are preferred truth labels of the error character;
a determining module 220 for determining at least one secondary selection truth value tag matching the preferred truth value tag of the training samples;
The training module 230 is configured to train the text error correction model to be trained by using the training sample carrying the preferred truth label and the sub-selection truth label corresponding to the training sample to obtain a trained text error correction model, where the trained text error correction model makes the probability of the sub-selection truth label of the training sample as a truth result smaller than the probability of the preferred truth label as a truth result and greater than the probability of the non-truth label as a truth result;
And the correction module 240 is configured to obtain a text to be corrected, input the text to be corrected to the trained text correction model, and obtain a correction result of the text to be corrected.
According to the text error correction device provided by the embodiment of the disclosure, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability of the secondary selection truth value label of the training sample determined by the trained text error correction model as a truth value result is smaller than that of the preferred truth value label as the truth value result and is larger than that of the non-truth value label as the truth value result, the defect of overfitting caused by single correct label during training is avoided, and false detection is reduced.
In an alternative embodiment, the determining module 220 is specifically configured to:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the determining module 220 is specifically configured to:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining module 220 is specifically configured to:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, the training module 230 is specifically configured to:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the training module 230 is specifically configured to:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the training module 230 is specifically configured to:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, the training module 230 is specifically configured to:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.
The embodiment of the disclosure further provides a computer device, as shown in fig. 3, which is a schematic structural diagram of the computer device provided by the embodiment of the disclosure, including:
a processor 31 and a memory 32, said memory 32 storing machine readable instructions executable by the processor 31, the processor 31 for executing machine readable instructions stored in the memory 32, said machine readable instructions when executed by the processor 31, the processor 31 performing the steps of:
Constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
determining at least one secondary selection truth value tag matching the primary selection truth value tag of the training sample;
Training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
and obtaining a text to be corrected, inputting the text to be corrected into the trained text correction model, and obtaining a correction result of the text to be corrected.
In an alternative embodiment, in the instructions executed by the processor 31, the determining at least one secondary selection truth value tag that matches the preferred truth value tag of the training sample includes:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, in the instructions executed by the processor 31, the generating the first character sequence including the context information and the pending character based on the context information of the error character in the training sample sentence includes:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, in the instructions executed by the processor 31, the determining, according to the occurrence frequency, a third character sequence from the plurality of second character sequences includes:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, in the instructions executed by the processor 31, the training the text error correction model to be trained by using the training sample carrying the preferred truth value tag and the sub-selected truth value tag corresponding to the training sample to obtain a trained text error correction model includes:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, in the instruction executed by the processor 31, the determining the first penalty according to the first difference between the probabilities corresponding to the second-choice truth label and the first-choice truth label in the prediction result includes:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, in the instruction executed by the processor 31, the determining the second penalty according to the probability corresponding to the target character in the prediction result except for the secondary selection truth value tag and the preferred truth value tag and the second difference between the probabilities corresponding to the secondary selection truth value tag includes:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, in the instructions executed by the processor 31, the training the text error correction model to be trained using the third loss to obtain a trained text error correction model includes:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
The memory 32 includes a memory 321 and an external memory 322, and the memory 321 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 31 and data exchanged with the external memory 322 such as a hard disk, and the processor 31 exchanges data with the external memory 322 via the memory 321.
The specific execution process of the above instruction may refer to the steps of the text error correction method described in the embodiments of the present disclosure, which is not described herein.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text error correction method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform steps of the text error correction method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
It should be noted that the foregoing embodiments are merely specific implementations of the disclosure, and are not intended to limit the scope of the disclosure, and although the disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features described in the foregoing embodiments may be made or equivalents may be substituted for those within the scope of the disclosure without departing from the spirit and scope of the technical aspects of the embodiments of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1.一种文本纠错方法,其特征在于,包括:1. A text error correction method, characterized by comprising: 构建待训练的文本纠错模型,以及,获取所述文本纠错模型的多条训练样本,所述训练样本为包括至少一个错误字符的语句,所述错误字符对应的正确字符为所述错误字符的首选真值标签;Constructing a text error correction model to be trained, and obtaining a plurality of training samples of the text error correction model, wherein the training samples are sentences including at least one incorrect character, and the correct character corresponding to the incorrect character is the preferred truth value label of the incorrect character; 确定与所述训练样本的首选真值标签匹配的至少一个次选真值标签;Determining at least one secondary truth label that matches the primary truth label of the training sample; 利用携带有首选真值标签的所述训练样本,以及所述训练样本对应的所述次选真值标签,对所述待训练的文本纠错模型进行训练,得到训练好的文本纠错模型;所述训练好的文本纠错模型使得训练样本的次选真值标签作为真值结果的概率小于首选真值标签作为真值结果的概率、且大于非真值标签作为真值结果的概率;The text error correction model to be trained is trained using the training sample carrying the preferred true value label and the secondary true value label corresponding to the training sample to obtain a trained text error correction model; the trained text error correction model makes the probability that the secondary true value label of the training sample is the true value result less than the probability that the preferred true value label is the true value result, and greater than the probability that the non-true value label is the true value result; 获取待纠错文本,将所述待纠错文本输入至所述训练好的文本纠错模型,得到所述待纠错文本的纠错结果;Obtaining a text to be corrected, inputting the text to be corrected into the trained text correction model, and obtaining a correction result of the text to be corrected; 所述确定与所述训练样本的首选真值标签匹配的至少一个次选真值标签,包括:The determining of at least one secondary truth label that matches the primary truth label of the training sample comprises: 针对任一错误字符,基于所述错误字符在所述训练样本语句中的上下文信息,生成含有所述上下文信息以及待定字符的第一字符序列;所述待定字符为所述错误字符在所述第一字符序列中的占位符;For any erroneous character, based on context information of the erroneous character in the training sample sentence, generate a first character sequence containing the context information and a pending character; the pending character is a placeholder for the erroneous character in the first character sequence; 从语料数据集中查找与所述第一字符序列匹配的多种第二字符序列,并确定每种第二字符序列对应的出现频次;Searching for multiple second character sequences matching the first character sequence from the corpus data set, and determining the occurrence frequency corresponding to each second character sequence; 根据所述出现频次,从所述多种第二字符序列中确定出第三字符序列;Determining a third character sequence from the plurality of second character sequences according to the occurrence frequencies; 确定所述第三字符序列中,与所述待定字符匹配的字符为所述次选真值标签。Determine in the third character sequence that the character that matches the pending character is the secondary selected truth value label. 2.根据权利要求1所述的方法,其特征在于,所述基于所述错误字符在所述训练样本语句中的上下文信息,生成含有所述上下文信息以及待定字符的第一字符序列,包括:2. The method according to claim 1, characterized in that the step of generating a first character sequence containing the context information and the undetermined character based on the context information of the erroneous character in the training sample sentence comprises: 确定所述训练样本语句中所述错误字符所在位置之前的第一相邻字符,以及所述训练样本语句中所述错误字符所在位置之后的第二相邻字符;Determine a first adjacent character before the position where the erroneous character is located in the training sample sentence, and a second adjacent character after the position where the erroneous character is located in the training sample sentence; 基于所述第一相邻字符以及所述第二相邻字符,生成第一字符序列;所述第一字符序列包括顺序排列的所述第一相邻字符、待定字符以及所述第二相邻字符。A first character sequence is generated based on the first adjacent character and the second adjacent character; the first character sequence includes the first adjacent character, the to-be-determined character, and the second adjacent character arranged in sequence. 3.根据权利要求1所述的方法,其特征在于,所述根据所述出现频次,从所述多种第二字符序列中确定出第三字符序列,包括:3. The method according to claim 1, wherein determining a third character sequence from the plurality of second character sequences according to the occurrence frequencies comprises: 从所述第二字符序列中,确定所述出现频次大于或等于预设频次的第四字符序列;Determine, from the second character sequence, a fourth character sequence whose occurrence frequency is greater than or equal to a preset frequency; 根据所述出现频次,对所述第四字符序列进行排序;sorting the fourth character sequence according to the occurrence frequency; 确定所述出现频次最多的前预设数量种第四字符序列为所述第三字符序列。The fourth character sequences of the first preset number with the highest occurrence frequency are determined as the third character sequence. 4.根据权利要求1所述的方法,其特征在于,所述利用携带有首选真值标签的所述训练样本,以及所述训练样本对应的所述次选真值标签,对所述待训练的文本纠错模型进行训练,得到训练好的文本纠错模型,包括:4. The method according to claim 1, characterized in that the training sample carrying the first true value label and the second true value label corresponding to the training sample are used to train the text error correction model to obtain the trained text error correction model, comprising: 将所述训练样本输入至所述待训练的文本纠错模型,得到所述待训练的文本纠错模型确定的所述错误字符对应的多个目标字符的预测结果;所述预测结果表示所述错误字符对应的正确字符为所述目标字符的概率;Inputting the training sample into the text error correction model to be trained, obtaining prediction results of multiple target characters corresponding to the erroneous character determined by the text error correction model to be trained; the prediction results represent the probability that the correct character corresponding to the erroneous character is the target character; 根据所述预测结果中所述次选真值标签对应的概率,与所述首选真值标签对应的概率之间的第一差值,确定第一损失;Determine a first loss according to a first difference between a probability corresponding to the second selected true value label in the prediction result and a probability corresponding to the first selected true value label; 根据所述预测结果中除所述次选真值标签和所述首选真值标签以外的其他目标字符对应的概率,与所述次选真值标签对应的概率之间的第二差值,确定第二损失;Determine a second loss according to a second difference between the probabilities corresponding to other target characters in the prediction result except the second selected true value label and the first selected true value label and the probability corresponding to the second selected true value label; 根据所述第一损失以及所述第二损失之和,确定第三损失;determining a third loss according to the sum of the first loss and the second loss; 利用所述第三损失对所述待训练的文本纠错模型进行训练,得到训练好的文本纠错模型。The text error correction model to be trained is trained using the third loss to obtain a trained text error correction model. 5.根据权利要求4所述的方法,其特征在于,所述根据所述预测结果中所述次选真值标签对应的概率,与所述首选真值标签对应的概率之间的第一差值,确定第一损失,包括:5. The method according to claim 4, characterized in that the determining the first loss according to a first difference between the probability corresponding to the second selected true value label in the prediction result and the probability corresponding to the first selected true value label comprises: 针对每个所述次选真值标签对应的概率,基于该次选真值标签对应的概率对应的所述第一差值以及第一修正参数之和,确定该次选真值标签对应的概率的第一修正值;For each probability corresponding to the second selected truth value label, determine a first correction value of the probability corresponding to the second selected truth value label based on the first difference corresponding to the probability corresponding to the second selected truth value label and the sum of the first correction parameter; 基于所述第一修正值与第二修正参数之间的乘积,确定第二修正值;determining a second correction value based on a product of the first correction value and a second correction parameter; 基于各个所述次选真值标签对应的概率的第二修正值之和,确定所述第一损失。The first loss is determined based on the sum of the second corrected values of the probabilities corresponding to the second-selected true value labels. 6.根据权利要求4所述的方法,其特征在于,所述根据所述预测结果中除所述次选真值标签和所述首选真值标签以外的其他目标字符对应的概率,与所述次选真值标签对应的概率之间的第二差值,确定第二损失,包括:6. The method according to claim 4, characterized in that the determining the second loss according to a second difference between the probabilities corresponding to other target characters in the prediction result except the second selected true value label and the first selected true value label and the probability corresponding to the second selected true value label comprises: 针对每个所述其他目标字符对应的概率,基于该其他目标字符对应的概率对应的第二差值以及第三修正参数之和,确定该其他目标字符对应的第三修正值;For each probability corresponding to the other target character, determine a third correction value corresponding to the other target character based on the second difference corresponding to the probability corresponding to the other target character and the sum of the third correction parameter; 基于所述第三修正值与第四修正参数之间的乘积,确定第四修正值;determining a fourth correction value based on a product between the third correction value and a fourth correction parameter; 基于各个所述其他目标字符对应的概率的第四修正值之和,确定所述第二损失。The second loss is determined based on the sum of the fourth correction values of the probabilities corresponding to the other target characters. 7.根据权利要求4所述的方法,其特征在于,所述利用所述第三损失对所述待训练的文本纠错模型进行训练,得到训练好的文本纠错模型,包括:7. The method according to claim 4, characterized in that the step of training the text error correction model to be trained by using the third loss to obtain a trained text error correction model comprises: 对所述待训练的文本纠错模型的参数进行调整,使所述第三损失低于或等于目标损失阈值,得到训练好的文本纠错模型。The parameters of the text error correction model to be trained are adjusted so that the third loss is lower than or equal to the target loss threshold, thereby obtaining a trained text error correction model. 8.一种文本纠错装置,其特征在于,包括:8. A text error correction device, comprising: 构建模块,用于构建待训练的文本纠错模型,以及,获取所述文本纠错模型的多条训练样本,所述训练样本为包括至少一个错误字符的语句,所述错误字符对应的正确字符为所述错误字符的首选真值标签;A construction module is used to construct a text error correction model to be trained, and obtain multiple training samples of the text error correction model, wherein the training samples are sentences including at least one incorrect character, and the correct character corresponding to the incorrect character is the preferred truth value label of the incorrect character; 确定模块,用于确定与所述训练样本的首选真值标签匹配的至少一个次选真值标签;A determination module, configured to determine at least one secondary truth label that matches the primary truth label of the training sample; 训练模块,用于利用携带有首选真值标签的所述训练样本,以及所述训练样本对应的所述次选真值标签,对所述待训练的文本纠错模型进行训练,得到训练好的文本纠错模型;所述训练好的文本纠错模型使得训练样本的次选真值标签作为真值结果的概率小于首选真值标签作为真值结果的概率、且大于非真值标签作为真值结果的概率;A training module, used to train the text error correction model to be trained using the training samples carrying the preferred true value labels and the secondary true value labels corresponding to the training samples, to obtain a trained text error correction model; the trained text error correction model makes the probability of the secondary true value label of the training sample being the true value result less than the probability of the preferred true value label being the true value result, and greater than the probability of the non-true value label being the true value result; 纠错模块,用于获取待纠错文本,将所述待纠错文本输入至所述训练好的文本纠错模型,得到所述待纠错文本的纠错结果;The error correction module is used to obtain the text to be corrected, input the text to be corrected into the trained text error correction model, and obtain the error correction result of the text to be corrected; 所述确定模块具体用于:The determination module is specifically used for: 针对任一错误字符,基于所述错误字符在所述训练样本语句中的上下文信息,生成含有所述上下文信息以及待定字符的第一字符序列;所述待定字符为所述错误字符在所述第一字符序列中的占位符;For any erroneous character, based on context information of the erroneous character in the training sample sentence, generate a first character sequence containing the context information and a pending character; the pending character is a placeholder for the erroneous character in the first character sequence; 从语料数据集中查找与所述第一字符序列匹配的多种第二字符序列,并确定每种第二字符序列对应的出现频次;Searching for multiple second character sequences matching the first character sequence from the corpus data set, and determining the occurrence frequency corresponding to each second character sequence; 根据所述出现频次,从所述多种第二字符序列中确定出第三字符序列;Determining a third character sequence from the plurality of second character sequences according to the occurrence frequencies; 确定所述第三字符序列中,与所述待定字符匹配的字符为所述次选真值标签。Determine in the third character sequence that the character that matches the pending character is the secondary selected truth value label. 9.一种计算机设备,其特征在于,包括:处理器、存储器,所述存储器存储有所述处理器可执行的机器可读指令,所述处理器用于执行所述存储器中存储的机器可读指令,所述机器可读指令被所述处理器执行时,所述处理器执行如权利要求1至7任一项所述的文本纠错方法的步骤。9. A computer device, characterized in that it comprises: a processor and a memory, wherein the memory stores machine-readable instructions executable by the processor, and the processor is used to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the processor executes the steps of the text correction method as described in any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被计算机设备运行时,所述计算机设备执行如权利要求1至7任意一项所述的文本纠错方法的步骤。10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a computer device, the computer device executes the steps of the text error correction method as described in any one of claims 1 to 7.
CN202411651413.6A 2024-11-19 2024-11-19 A text error correction method, device, computer equipment and storage medium Active CN119149675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411651413.6A CN119149675B (en) 2024-11-19 2024-11-19 A text error correction method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411651413.6A CN119149675B (en) 2024-11-19 2024-11-19 A text error correction method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN119149675A CN119149675A (en) 2024-12-17
CN119149675B true CN119149675B (en) 2025-04-04

Family

ID=93809640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411651413.6A Active CN119149675B (en) 2024-11-19 2024-11-19 A text error correction method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN119149675B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 A text error correction method, device and medium based on intent consistency
CN116579327A (en) * 2023-07-14 2023-08-11 匀熵智能科技(无锡)有限公司 Text error correction model training method, text error correction method, device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597754B (en) * 2020-12-23 2023-11-21 北京百度网讯科技有限公司 Text error correction methods, devices, electronic equipment and readable storage media
CN112861518B (en) * 2020-12-29 2023-12-01 科大讯飞股份有限公司 Text error correction method and device, storage medium and electronic device
CN114781386B (en) * 2022-05-17 2025-08-05 北京百度网讯科技有限公司 Method, device and electronic device for acquiring text error correction training corpus
CN114861637B (en) * 2022-05-18 2023-06-16 北京百度网讯科技有限公司 Spelling error correction model generation method and device, and spelling error correction method and device
CN116453134A (en) * 2023-03-31 2023-07-18 马上消费金融股份有限公司 Training method of character recognition model, character recognition method and related equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 A text error correction method, device and medium based on intent consistency
CN116579327A (en) * 2023-07-14 2023-08-11 匀熵智能科技(无锡)有限公司 Text error correction model training method, text error correction method, device and storage medium

Also Published As

Publication number Publication date
CN119149675A (en) 2024-12-17

Similar Documents

Publication Publication Date Title
US10255275B2 (en) Method and system for generation of candidate translations
Liu et al. A broad-coverage normalization system for social media language
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
US11010554B2 (en) Method and device for identifying specific text information
CN101133411B (en) Fault-tolerant romanized input method for non-roman characters
US9176936B2 (en) Transliteration pair matching
CN111462751B (en) Method, apparatus, computer device and storage medium for decoding voice data
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
CN113627158B (en) Chinese spelling error correction method and device based on multiple representations and multiple pre-training models
EP3948849A1 (en) Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
KR20090106937A (en) Spelling Error Correction System and Methods
KR102794379B1 (en) Learning data correction method and apparatus thereof using ensemble score
CN112966496A (en) Chinese error correction method and system based on pinyin characteristic representation
US10657203B2 (en) Predicting probability of occurrence of a string using sequence of vectors
CN116579327B (en) Text error correction model training method, text error correction method, device and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN111583910A (en) Model updating method and device, electronic equipment and storage medium
JP6718787B2 (en) Japanese speech recognition model learning device and program
CN111626059B (en) Information processing method and device
JP2019159118A (en) Output program, information processing device, and output control method
CN111090720B (en) Hot word adding method and device
CN119149675B (en) A text error correction method, device, computer equipment and storage medium
Saloot et al. Noisy text normalization using an enhanced language model
CN116013278B (en) Speech recognition multi-model result merging method and device based on pinyin alignment algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant