Disclosure of Invention
The embodiment of the disclosure at least provides a text error correction method, a text error correction device, computer equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a text error correction method, including:
Constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
determining at least one secondary selection truth value tag matching the primary selection truth value tag of the training sample;
Training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
and obtaining a text to be corrected, inputting the text to be corrected into the trained text correction model, and obtaining a correction result of the text to be corrected.
In an alternative embodiment, the determining at least one secondary selection truth value tag that matches the preferred truth value tag of the training samples includes:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the generating the first character sequence containing the context information and the pending character based on the context information of the error character in the training sample sentence includes:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining a third character sequence from the plurality of second character sequences according to the occurrence frequency includes:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an optional implementation manner, the training the text error correction model to be trained by using the training sample carrying the preferred truth value tag and the sub-selected truth value tag corresponding to the training sample to obtain a trained text error correction model includes:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the determining the first loss according to the first difference between the probabilities corresponding to the sub-selection truth labels and the probabilities corresponding to the preferred truth labels in the prediction result includes:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the determining the second loss according to the second difference between probabilities corresponding to the sub-selection truth value label and probabilities corresponding to other target characters except the sub-selection truth value label and the preferred truth value label in the prediction result includes:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an optional implementation manner, the training the text error correction model to be trained by using the third loss to obtain a trained text error correction model includes:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
In a second aspect, an embodiment of the present disclosure further provides a text error correction apparatus, including:
The system comprises a construction module, a text error correction module and a judgment module, wherein the construction module is used for constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
a determining module for determining at least one secondary selection truth value tag matching the preferred truth value tag of the training sample;
The training module is used for training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
The error correction module is used for acquiring a text to be corrected, inputting the text to be corrected into the trained text error correction model, and obtaining an error correction result of the text to be corrected.
In an alternative embodiment, the determining module is specifically configured to:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the determining module is specifically configured to:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining module is specifically configured to:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, the training module is specifically configured to:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the training module is specifically configured to:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the training module is specifically configured to:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, the training module is specifically configured to:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
In a third aspect, an optional implementation manner of the disclosure further provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, where the machine-readable instructions, when executed by the processor, perform the steps in the first aspect, or any possible implementation manner of the first aspect, when executed by the processor.
In a fourth aspect, an alternative implementation of the present disclosure further provides a computer readable storage medium having stored thereon a computer program which when executed performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.
The description of the effects of the text error correction apparatus, the computer device, and the computer-readable storage medium is referred to the description of the text error correction method, and is not repeated here.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.
According to the text error correction method, device, computer equipment and storage medium, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability that the secondary selection truth value label of the training sample is used as the truth value result and is smaller than the probability that the preferred truth value label is used as the truth value result and is larger than the probability that the non-truth value label is used as the truth value result, the defect of overfitting caused by single correct label in training is avoided, and error detection is reduced.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
According to research, in the process of training a text error correction model, the truth label of training data has uniqueness and is inconsistent with the actual situation, so that the model is easy to be excessively fitted, and false alarm is generated in actual deployment. For example, in the sample sentence "not matching the actual condition," the truth label corresponding to the character "brother" is "condition," but in the actual language environment, there are some other correct answers, such as "situation", "scenario". In general, the spelling error correction depth model only sets the "condition" as the true value label during training, and after training, the model is very easy to correct the "situation" and the "scene" as the "condition", so as to lead to false-positive error detection.
Based on the above study, the disclosure provides a text error correction method, a device, a computer device and a storage medium, which can determine a subselect truth value tag corresponding to a first-choice truth value tag of a training sample, and train a text error correction model based on the first-choice truth value tag and the subselect truth value tag, so that the probability that the subselect truth value tag of the training sample determined by the trained text error correction model is used as a truth value result is smaller than the probability that the first-choice truth value tag is used as the truth value result and is larger than the probability that a non-truth value tag is used as the truth value result, thereby avoiding the defect of overfitting caused by single correct tag during training and reducing false detection.
The defects of the scheme are all results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions to the above problems set forth hereinafter by the present disclosure should be all contributions of the inventors to the present disclosure during the course of the present disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
For the sake of understanding the present embodiment, first, a detailed description will be given of a text error correction method disclosed in the present embodiment, where an execution subject of the text error correction method provided in the present embodiment is generally a computer device with a certain computing capability, and the computer device includes, for example, a terminal device or a server or other processing device, and the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some possible implementations, the text error correction method may be implemented by way of a processor invoking computer readable instructions stored in a memory.
The text error correction method provided by the embodiment of the present disclosure is described below by taking an execution body as a terminal device as an example.
Referring to fig. 1, a flowchart of a text error correction method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S104, where:
s101, constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character.
In this step, a text error correction model may be constructed, and the model may be selected from a conventional machine learning method, such as a hidden markov model (Hidden Markov Model), a support vector machine (Support Vector Machine, SVM), etc., and a deep learning model, such as a Long Short-Term Memory (LSTM), transformers model, a bi-directional encoder representation model (Bidirectional Encoder Representations from Transformers, BERT), etc.
In general, the deep learning error correction model may use softmax and cross entropy to calculate loss, and then implement training of the deep learning error correction model by backward gradient propagation to update parameters of the model.
For example, in the example of the sample sentence "not matching the actual condition", it is assumed that in the dictionary file, the code id of the "condition" is t, the code id of the "shape" is j, and the score of the id of t in the predicted probability value logits (probability value of the model predicting that the correct character corresponding to the error character is the target character) for the "brother" word is. The actual calculated loss is:
Wherein i is the i-th predicted character corresponding to "brother". At this loss, the training goal of the model is to let As large as possible, but not consideredIt is equally feasible in sample sentences, resulting in the possibility that text such as "case" is also error corrected to "case".
Therefore, the embodiment of the disclosure provides concepts of the preferred truth value label and the sub-selected truth value label, and the text error correction model is trained through the preferred truth value label and the sub-selected truth value label, so that the model is prevented from being fitted excessively, and the probability of false detection is reduced.
The preferred truth value label may be the label that is most fit to the wrong character in the training sample, for example, in the training sample "not fit with the actual brothers", the preferred truth value label corresponding to the "condition" is determined to be "brothers".
For example, the preferred truth label of the wrong character in the training sample may be determined according to the context information in the training sample and the shape, pronunciation, meaning characteristics, and the like of the wrong character.
The training sample can be a text, and can be obtained by deforming a correct text or collecting an error sample.
Or counting massive public text data, and determining a preferred truth value label of the wrong character according to the quantity of the public data matched with the character segments in the training sample.
S102, determining at least one secondary selection truth value label matched with the preferred truth value label of the training sample.
In this step, at least one sub-selection value label corresponding to the error character in the training sample may be determined, and the number of sub-selection value labels corresponding to each error character may be determined according to actual situations, and exemplary values may be 2, 3, and 5.
In determining the sub-selection value tags, the determination may be based on a pre-deployed corpus data set. The corpus data set can be deployed according to the text type of which error correction is required. For example, if correction is required for a chinese text, a chinese corpus dataset may be deployed, if correction is required for an english text, an english corpus dataset may be deployed, and if correction is required for a chinese-english mixed text, a chinese-english mixed corpus dataset may be deployed.
The corpus data set can be generated based on massive correct texts, such as novels, papers, messages and the like.
When determining the secondary selection value label, determining the error character in the training sample, determining the context information of the error character in the training sample sentence, and generating a character sequence containing the context information and the undetermined character according to the context information. The first character sequence may represent a format of a segment of the training sample sentence that contains erroneous characters. The undetermined character may be a placeholder of the wrong character in the first character sequence, and the placeholder may be matched with any character when searching for a second character sequence matched with the first character sequence in the corpus data set.
After the first character sequence is obtained, a second character sequence matched with the first character sequence can be searched in the corpus data set, and when a second character sequence is searched, the searched second character sequence can be recorded and counted to obtain the occurrence frequency corresponding to the second character sequence.
And then, determining a third character sequence according to the occurrence frequency of each second character sequence, and taking the characters matched with the undetermined characters in the third character sequence as sub-selection truth value labels.
For example, the second character sequence having the frequency higher than the preset frequency may be determined as the third character sequence.
Specifically, the context information may include a first adjacent character before the location of the error character in the training sample sentence, and a second adjacent character after the location of the error character in the training sample sentence. The number of characters of the first adjacent character and the second adjacent character may be determined according to actual situations, for example, may be determined according to the number of characters of the phrase corresponding to the wrong character.
For example, in the training sample sentence of "not conforming to the actual condition" the phrase corresponding to the wrong character "brother" is "case", the number of characters is 2, where the number of wrong characters is 1, and the number of characters of the first adjacent character and the second adjacent character may be 1 if the number of remaining correct characters is 1.
Or in the above example, the number of characters of the "case" is 2, it may be determined that the number of characters of the first character sequence is 3 or 4.
In one possible implementation, the first and second character sequences may be extracted using an n-gram of n-tuples. For example, in the above example of "not matching with the actual condition" if the character digits of the first adjacent character and the second adjacent character are 1, the first character sequence may be extracted as "condition X not" by using a 3-gram method, where "condition" is the first adjacent character, "not" is the second adjacent character, and "X" is the undetermined character. The arrangement order of the characters in the first character sequence can be a first adjacent character, a pending character and a second adjacent character.
When a plurality of characters are included in the adjacent characters, the sequence of the characters in the adjacent characters may be kept inconvenient, for example, if the first adjacent character is "actual condition" and the second adjacent character is "disagreeable", the first character sequence may be "disagreeable with the actual condition X".
When the first character sequence is "emotion X not", the second character sequence extracted in the corpus data set by 3grams may include "situation not", "emotion not", and the like.
When determining the third character sequence, determining fourth character sequences with occurrence frequency greater than or equal to preset frequency from the second character sequences, and determining the preset number of fourth character sequences with the highest occurrence frequency as the third character sequence by sequencing the occurrence frequency of the fourth character sequences according to the occurrence frequency.
For example, if the occurrence frequencies of the second character sequences corresponding to "case no", "scene no" are 1506021, 1305055, 1131768, 983579, respectively, the preset frequency is 1000000, and the preset number is 2, where "case no" is the second character sequence corresponding to the preferred truth value label, the "case no" may be first removed from the second character sequence, and "please view no" with the occurrence frequency lower than 1000000 is removed from the second character sequence, so as to obtain the fourth character sequences "case no" and "case no", and then the "case no" and "case no" of the 2 names before the occurrence frequency are determined as the third character sequence.
And S103, training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result.
In the step, a training sample can be input into a text error correction model to be trained, and through the text error correction model, the prediction results of a plurality of target characters corresponding to the error characters can be determined, and the prediction results can represent the probability that the correct characters corresponding to the error characters are the target characters.
When the text error correction model works, a plurality of target characters can be determined first, then the probability that the correct character corresponding to the error character is the target character is determined, error correction is carried out according to the determined probability, and the correct character corresponding to the error character is judged.
After the prediction results of the target characters corresponding to the error characters are obtained, the probability corresponding to the sub-selection truth value label (the target characters contain the sub-selection truth value label) can be determined from the prediction results, and the first loss is determined according to the first difference value between the probability corresponding to the sub-selection truth value label and the probability corresponding to the preferred truth value label.
Meanwhile, the probabilities corresponding to other target characters except the secondary selection truth value label and the primary selection truth value label in the prediction result can be determined, a second difference value between the probabilities corresponding to the secondary selection truth value label is determined, and a second loss is determined according to the second difference value.
Thereafter, a third loss may be determined based on the sum of the first loss and the second loss.
And finally, training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In training with the third loss, the training target may be to minimize the value of the third loss. Since the third loss is determined based on the sum of the first loss and the second loss, the training goal is equivalent to minimizing the sum of the first loss and the second loss, i.e., minimizing the first difference and the second difference as much as possible.
Therefore, when the text error correction model obtained through training works, the probability that the subselect truth label of the training sample is used as the truth result is smaller than the probability that the first-choice truth label is used as the truth result and is larger than the probability that the non-truth label is used as the truth result, the first-choice truth label is used as the truth result preferentially, the importance degree of the subselect truth label is higher than that of other non-truth labels, the situation of fitting is avoided, and the probability of false detection is reduced.
Specifically, the first loss may be determined according to the following steps:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In this step, a first correction parameter and a second correction parameter may be introduced to correct the first difference. The first correction parameter can scale the first difference value so as to adjust the importance degree difference between the first selection truth value label and the second selection truth value label, and the second correction parameter can amplify or reduce the first difference value so as to enable the first loss to be closer to the actual needed state.
Accordingly, a third correction parameter and a fourth correction parameter may also be introduced, the second loss may be determined by:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In the step, the third correction parameter can scale the second difference value so as to adjust the importance degree difference between the sub-selection value label and other target characters, and the fourth correction parameter can amplify or reduce the second difference value so as to enable the second loss to be closer to the actual required state.
By way of example, the third loss may be determined by the following equation:
wherein loss is the third loss, and, The probability corresponding to the j-th sub-selection truth value label,For the probability that the preferred truth label corresponds to,For the first correction value,For the second correction value,For the sub-selection of the set to which the value tags correspond,For a set of other target characters,For the probability corresponding to the i-th other target character,For the third correction value, the correction value is,And is the fourth correction value.
The values of the first correction value, the second correction value, the third correction value and the fourth correction value can be determined according to actual conditions. For example, the first correction value may be 0.01, the second correction value may be 0.9, the third correction value may be 0.02, and the fourth correction value may be 1.1.
The learning objective of the third penalty may be to make the score (i.e., probability) of the preferred truth label greater than the score of the sub-selected truth label, which score is greater than the score of the other incorrect labels (i.e., other objective labels). And simultaneously, the importance degree of different learning targets during model learning is adjusted through super parameters (namely the first correction value, the second correction value, the third correction value and the fourth correction value).
In the training process, the third loss can be reduced by adjusting the model parameters, and when the third loss is reduced to a certain degree (for example, lower than a preset threshold value), or when the third loss converges (the third loss is basically unchanged after the parameters are adjusted), the training can be stopped, and a trained text error correction model is obtained.
S104, acquiring a text to be corrected, and inputting the text to be corrected into the trained text correction model to obtain a correction result of the text to be corrected.
After training the text correction model is completed, text correction can be performed by using the trained text correction model, the text to be corrected is input into the trained text correction model, a correction result output by the text correction model can be obtained, and the correction result can comprise the text after correction and/or correct characters corresponding to the error characters in the text to be corrected.
Therefore, the trained text error correction model can reduce the probability of over fitting as much as possible, improves the importance degree of the sub-selection value label, can correct the brothers into the situations in the text to be corrected which is not matched with the actual brothers, and can not correct the shapes into the situations in the text which is not matched with the actual situations because the importance degree of the situations is increased in the training process, thereby reducing the probability of error detection.
According to the text error correction method provided by the embodiment of the disclosure, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability of the secondary selection truth value label of the training sample determined by the trained text error correction model as a truth value result is smaller than that of the preferred truth value label as the truth value result and is larger than that of the non-truth value label as the truth value result, the defect of overfitting caused by single correct label during training is avoided, and error detection is reduced.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
Based on the same inventive concept, the embodiments of the present disclosure further provide a text error correction device corresponding to the text error correction method, and since the principle of solving the problem by the device in the embodiments of the present disclosure is similar to that of the text error correction method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 2, a schematic diagram of a text error correction apparatus according to an embodiment of the disclosure is shown, where the apparatus includes:
A building module 210, configured to build a text error correction model to be trained, and obtain a plurality of training samples of the text error correction model, where the training samples are sentences including at least one error character, and correct characters corresponding to the error character are preferred truth labels of the error character;
a determining module 220 for determining at least one secondary selection truth value tag matching the preferred truth value tag of the training samples;
The training module 230 is configured to train the text error correction model to be trained by using the training sample carrying the preferred truth label and the sub-selection truth label corresponding to the training sample to obtain a trained text error correction model, where the trained text error correction model makes the probability of the sub-selection truth label of the training sample as a truth result smaller than the probability of the preferred truth label as a truth result and greater than the probability of the non-truth label as a truth result;
And the correction module 240 is configured to obtain a text to be corrected, input the text to be corrected to the trained text correction model, and obtain a correction result of the text to be corrected.
According to the text error correction device provided by the embodiment of the disclosure, the secondary selection truth value label corresponding to the preferred truth value label of the training sample can be determined, and the text error correction model is trained based on the preferred truth value label and the secondary selection truth value label, so that the probability of the secondary selection truth value label of the training sample determined by the trained text error correction model as a truth value result is smaller than that of the preferred truth value label as the truth value result and is larger than that of the non-truth value label as the truth value result, the defect of overfitting caused by single correct label during training is avoided, and false detection is reduced.
In an alternative embodiment, the determining module 220 is specifically configured to:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, the determining module 220 is specifically configured to:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, the determining module 220 is specifically configured to:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, the training module 230 is specifically configured to:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, the training module 230 is specifically configured to:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, the training module 230 is specifically configured to:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, the training module 230 is specifically configured to:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.
The embodiment of the disclosure further provides a computer device, as shown in fig. 3, which is a schematic structural diagram of the computer device provided by the embodiment of the disclosure, including:
a processor 31 and a memory 32, said memory 32 storing machine readable instructions executable by the processor 31, the processor 31 for executing machine readable instructions stored in the memory 32, said machine readable instructions when executed by the processor 31, the processor 31 performing the steps of:
Constructing a text error correction model to be trained, and acquiring a plurality of training samples of the text error correction model, wherein the training samples are sentences comprising at least one error character, and the correct character corresponding to the error character is a preferred truth value label of the error character;
determining at least one secondary selection truth value tag matching the primary selection truth value tag of the training sample;
Training the text error correction model to be trained by using the training sample carrying the preferred truth value label and the sub-selection truth value label corresponding to the training sample to obtain a trained text error correction model, wherein the trained text error correction model enables the probability of the sub-selection truth value label of the training sample as a truth value result to be smaller than the probability of the preferred truth value label as the truth value result and larger than the probability of the non-truth value label as the truth value result;
and obtaining a text to be corrected, inputting the text to be corrected into the trained text correction model, and obtaining a correction result of the text to be corrected.
In an alternative embodiment, in the instructions executed by the processor 31, the determining at least one secondary selection truth value tag that matches the preferred truth value tag of the training sample includes:
Generating a first character sequence containing the context information and undetermined characters based on the context information of the error characters in the training sample sentence aiming at any error character, wherein the undetermined characters are placeholders of the error characters in the first character sequence;
searching a plurality of second character sequences matched with the first character sequences from a corpus data set, and determining the occurrence frequency corresponding to each second character sequence;
determining a third character sequence from the plurality of second character sequences according to the occurrence frequency;
and determining that the character matched with the undetermined character in the third character sequence is the sub-selection value label.
In an alternative embodiment, in the instructions executed by the processor 31, the generating the first character sequence including the context information and the pending character based on the context information of the error character in the training sample sentence includes:
determining a first adjacent character before the position of the error character in the training sample sentence and a second adjacent character after the position of the error character in the training sample sentence;
And generating a first character sequence based on the first adjacent character and the second adjacent character, wherein the first character sequence comprises the first adjacent character, the undetermined character and the second adjacent character which are sequentially arranged.
In an alternative embodiment, in the instructions executed by the processor 31, the determining, according to the occurrence frequency, a third character sequence from the plurality of second character sequences includes:
Determining a fourth character sequence with the occurrence frequency greater than or equal to a preset frequency from the second character sequence;
sorting the fourth character sequence according to the occurrence frequency;
and determining the fourth character sequences with the highest occurrence frequency and the preset number as the third character sequences.
In an alternative embodiment, in the instructions executed by the processor 31, the training the text error correction model to be trained by using the training sample carrying the preferred truth value tag and the sub-selected truth value tag corresponding to the training sample to obtain a trained text error correction model includes:
inputting the training sample into the text error correction model to be trained to obtain the prediction results of a plurality of target characters corresponding to the error characters, which are determined by the text error correction model to be trained, wherein the prediction results represent the probability that the correct characters corresponding to the error characters are the target characters;
Determining a first loss according to a first difference value between probabilities corresponding to the secondary selection truth value label and the preferred truth value label in the prediction result;
Determining a second loss according to a second difference value between probabilities corresponding to the secondary selection truth value labels and probabilities corresponding to other target characters except the secondary selection truth value labels and the primary selection truth value labels in the prediction result;
determining a third loss based on a sum of the first loss and the second loss;
and training the text error correction model to be trained by utilizing the third loss to obtain a trained text error correction model.
In an alternative embodiment, in the instruction executed by the processor 31, the determining the first penalty according to the first difference between the probabilities corresponding to the second-choice truth label and the first-choice truth label in the prediction result includes:
Determining a first correction value of the probability corresponding to the sub-selection value label based on the first difference value corresponding to the probability corresponding to the sub-selection value label and the sum of the first correction parameters according to the probability corresponding to each sub-selection value label;
determining a second correction value based on a product between the first correction value and a second correction parameter;
and determining the first loss based on the sum of second correction values of probabilities corresponding to the sub-selection value labels.
In an alternative embodiment, in the instruction executed by the processor 31, the determining the second penalty according to the probability corresponding to the target character in the prediction result except for the secondary selection truth value tag and the preferred truth value tag and the second difference between the probabilities corresponding to the secondary selection truth value tag includes:
determining a third correction value corresponding to the other target character according to the second difference value corresponding to the probability corresponding to the other target character and the sum of the third correction parameters aiming at the probability corresponding to the other target character;
Determining a fourth correction value based on a product between the third correction value and a fourth correction parameter;
And determining the second loss based on the sum of fourth correction values of probabilities corresponding to the other target characters.
In an alternative embodiment, in the instructions executed by the processor 31, the training the text error correction model to be trained using the third loss to obtain a trained text error correction model includes:
And adjusting parameters of the text error correction model to be trained to enable the third loss to be lower than or equal to a target loss threshold value, and obtaining the trained text error correction model.
The memory 32 includes a memory 321 and an external memory 322, and the memory 321 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 31 and data exchanged with the external memory 322 such as a hard disk, and the processor 31 exchanges data with the external memory 322 via the memory 321.
The specific execution process of the above instruction may refer to the steps of the text error correction method described in the embodiments of the present disclosure, which is not described herein.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text error correction method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform steps of the text error correction method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
It should be noted that the foregoing embodiments are merely specific implementations of the disclosure, and are not intended to limit the scope of the disclosure, and although the disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features described in the foregoing embodiments may be made or equivalents may be substituted for those within the scope of the disclosure without departing from the spirit and scope of the technical aspects of the embodiments of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.