CN112183072B - Text error correction method and device, electronic equipment and readable storage medium - Google Patents
Text error correction method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN112183072B CN112183072B CN202011110293.0A CN202011110293A CN112183072B CN 112183072 B CN112183072 B CN 112183072B CN 202011110293 A CN202011110293 A CN 202011110293A CN 112183072 B CN112183072 B CN 112183072B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- error correction
- corrected
- correction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the invention provides a text error correction method, a text error correction device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be corrected, sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first corrected text, and performing common word correction on the text to be corrected to obtain a second corrected text. And then obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected. In this way, the text with more error content and complex error types can be accurately corrected.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a text error correction method, a text error correction device, an electronic device, and a readable storage medium.
Background
Currently, in order to facilitate analysis of video resources, OCR (Optical Character Recognition ) recognition is often required for the speech or news headline information in the video resources, so that text contained in the video resources may be recognized. Among them, OCR is a technology of directly converting text on a picture into editable text.
The inventors have found that many recognition errors often exist in text obtained by OCR recognition and that the types of errors are complex in the course of implementing the present invention. For example, there are word recognition errors in the form of words, and there are recognition errors that are irregular and can be circulated, common word errors, and the like. However, at present, the text to be corrected can only be subjected to context semantic recognition by combining a natural language processing algorithm with a neural network algorithm, so that the text is corrected according to a semantic recognition result, but the text correction mode cannot accurately correct the text with more error contents and complex error types.
Disclosure of Invention
The embodiment of the invention aims to provide a text error correction method, a text error correction device, electronic equipment and a readable storage medium, so that texts with more error contents and complex error types can be accurately corrected. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a text error correction method, including:
obtaining a text to be corrected;
sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first corrected text;
performing common word correction on the text to be corrected to obtain a second correction text;
Obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected.
In one possible implementation manner, the performing shape near word correction and common word correction on the text to be corrected in sequence to obtain a first corrected text includes:
forward correcting the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list;
performing reverse error correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list;
determining a near word error correction text according to an error correction result obtained by forward error correction and an error correction result obtained by reverse error correction;
forward correcting the near-word correction text according to the sequence from the first word to the last word in the near-word correction text through a preset common word list;
performing reverse correction on the near-word correction text according to the sequence from the last word to the first word in the near-word correction text through the preset common word list;
And determining the first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text.
In one possible implementation manner, the performing the common word correction on the text to be corrected to obtain a second corrected text includes:
correcting the text to be corrected by sequentially passing through the first common word list, the second common word list and the third common word list to obtain a second corrected text; the number of words included in each common word in the first common word list is a first number, the number of words included in each common word in the second common word list is a second number, the number of words included in each common word in the third common word list is a third number, the first number is greater than the second number, and the second number is greater than the third number; and/or the number of the groups of groups,
and correcting the text to be corrected through a preset entity class common word list to obtain a second corrected text.
In one possible implementation manner, the correcting the text to be corrected by sequentially passing through a first common word list, a second common word list and a third common word list to obtain a second corrected text, which includes:
Word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common words in the first common word list, and if the segmentation words are matched with the common words in the first common word list, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a first correction threshold value, if so, taking the updated text to be corrected as a first common word correction text; if not, the text to be corrected is used as a first common word correction text;
matching the word segmentation included in the first common word error correction text with the common words in a second common word list, and if the word segmentation included in the first common word error correction text is matched with the common words in the second common word list, updating the first common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the first frequently used word error correction text and the confusion degree of the updated first frequently used word error correction text is larger than or equal to a second error correction threshold value, and if so, taking the updated first frequently used word error correction text as a second frequently used word error correction text; if not, the first common word error correction text is used as a second common word error correction text;
Matching the word segmentation included in the second common word error correction text with the common words in a third common word list, and if the word segmentation included in the second common word error correction text is matched with the common words in the third common word list, updating the third common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the second common word error correction text and the confusion degree of the updated second common word error correction text is larger than or equal to a third error correction threshold value, and if so, taking the updated second common word error correction text as the second correction text; if not, the second common word error correction text is used as the second correction text; wherein the third error correction threshold is greater than the first error correction threshold and the third error correction threshold is greater than the second error correction threshold.
In one possible implementation manner, correcting the text to be corrected by a preset entity class common word list to obtain a second corrected text, including:
word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common word list of the preset entity category, and if the segmentation words are matched with the common words in the common word list of the preset entity category, updating the text to be corrected through the matched common words;
Judging whether the difference value between the confusion degree of the text to be corrected and the confusion degree of the updated text to be corrected is larger than or equal to a fourth correction threshold value; if yes, taking the updated text to be corrected as the second correction text; and if not, taking the text to be corrected as the second correction text.
In one possible implementation manner, the forward error correction is performed on the text to be corrected by presetting a near word list according to the sequence from the first word to the last word in the text to be corrected, including:
searching a shape near word of a first word in the text to be corrected from the preset shape near word list;
respectively replacing a first word in the text to be corrected with each corresponding shape near word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-stage replacement text in the plurality of first-stage replacement texts, and selecting a first-stage replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of first-stage replacement texts according to the sequence of the confusion degree from small to large;
for each first-stage replacement text in the selected specified number of first-stage replacement texts, searching a shape near word of a second word in the first-stage replacement text from the preset shape near word list, respectively replacing the second word in the first-stage replacement text with each shape near word corresponding to the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
Selecting second-level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the plurality of second-level replacement texts according to the order of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the text to be corrected is subjected to word-shape and word-shape replacement, and obtaining the appointed number of forward correction texts;
and performing reverse correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list, wherein the method comprises the following steps:
searching the shape near word of the last word in the text to be corrected from the preset shape near word list;
respectively replacing the last word in the text to be corrected with each corresponding shape near word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
For each N-th level replacement text in the selected appointed number of N-th level replacement texts, searching a shape near word of the penultimate word in the N-th level replacement text from the preset shape near word list, respectively replacing the penultimate word in the N-th level replacement text with each shape near word corresponding to the penultimate word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the plurality of N-1-th level replacement texts;
selecting N-1 level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the N-1 level replacement texts according to the order of the confusion degrees from small to large;
processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the text to be corrected is subjected to word-shape and word-near replacement, and obtaining the appointed number of reverse correction texts;
the method for determining the near word error correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction comprises the following steps:
Searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the near-word error correction text;
and if the text to be corrected is not searched for, wherein the text to be corrected is used as the near word correction text.
In one possible implementation manner, the forward error correction is performed on the near word error correction text by presetting a common word list according to the sequence from the first word to the last word in the near word error correction text, including:
searching a common word matched with the first word in the near word error correction text from the preset common word list;
respectively replacing a first word in the near-shape word error correction text with each common word matched with the first word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-level replacement text in the plurality of first-level replacement texts, and selecting a first-level replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of first-level replacement texts according to the order of the confusion degree from small to large;
Searching for a common word matched with a second word in the first-stage replacement text from the preset common word list aiming at each first-stage replacement text in the selected appointed number of first-stage replacement texts, respectively replacing the second word in the first-stage replacement text with each common word matched with the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
selecting second-level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the second-level replacement texts according to the sequence of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the near word error correction text is replaced by a common word, so as to obtain the appointed number of forward error correction texts;
the method comprises the steps of performing reverse correction on the near word correction text according to the sequence from the last word to the first word in the near word correction text through a preset common word list, and comprises the following steps:
Searching a common word matched with the last word in the shape near word error correction text from the preset common word list;
respectively replacing the last word in the near-shape word error correction text with each common word matched with the last word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
searching a common word matched with the penultimate word in the N-th level replacement text from the preset common word list aiming at each N-th level replacement text in the selected appointed number of N-th level replacement texts, respectively replacing the penultimate word in the N-th level replacement text with each common word matched with the common word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts;
selecting N-1 level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the N-1 level replacement texts according to the sequence of the confusion degrees from small to large;
Processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the near word error correction text is replaced by a common word, and obtaining the appointed number of reverse error correction texts;
the determining the first corrected text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text comprises the following steps:
searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the first correction text;
and if the error correction text which is the same as the error correction text in the specified number of forward error correction texts is not searched, taking the near word error correction text as the first error correction text.
In a second aspect, an embodiment of the present invention provides a text error correction apparatus, including:
the obtaining module is used for obtaining the text to be corrected;
the first correction module is used for sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first correction text;
The second error correction module is used for correcting the common words of the text to be corrected to obtain a second corrected text;
and the determining module is used for obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected.
In one possible implementation manner, the first error correction module is specifically configured to:
forward correcting the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list;
performing reverse error correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list;
determining a near word error correction text according to an error correction result obtained by forward error correction and an error correction result obtained by reverse error correction;
forward correcting the near-word correction text according to the sequence from the first word to the last word in the near-word correction text through a preset common word list;
performing reverse correction on the near-word correction text according to the sequence from the last word to the first word in the near-word correction text through the preset common word list;
And determining the first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text.
In one possible implementation manner, the second error correction module is specifically configured to:
correcting the text to be corrected by sequentially passing through the first common word list, the second common word list and the third common word list to obtain a second corrected text; the number of words included in each common word in the first common word list is a first number, the number of words included in each common word in the second common word list is a second number, the number of words included in each common word in the third common word list is a third number, the first number is greater than the second number, and the second number is greater than the third number; and/or the number of the groups of groups,
and correcting the text to be corrected through a preset entity class common word list to obtain a second corrected text.
In one possible implementation manner, the second error correction module is specifically configured to:
word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common words in the first common word list, and if the segmentation words are matched with the common words in the first common word list, updating the text to be corrected through the matched common words;
Judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a first correction threshold value, if so, taking the updated text to be corrected as a first common word correction text; if not, the text to be corrected is used as a first common word correction text;
matching the word segmentation included in the first common word error correction text with the common words in a second common word list, and if the word segmentation included in the first common word error correction text is matched with the common words in the second common word list, updating the first common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the first frequently used word error correction text and the confusion degree of the updated first frequently used word error correction text is larger than or equal to a second error correction threshold value, and if so, taking the updated first frequently used word error correction text as a second frequently used word error correction text; if not, the first common word error correction text is used as a second common word error correction text;
matching the word segmentation included in the second common word error correction text with the common words in a third common word list, and if the word segmentation included in the second common word error correction text is matched with the common words in the third common word list, updating the third common word error correction text through the matched common words;
Judging whether the difference value between the confusion degree of the second common word error correction text and the confusion degree of the updated second common word error correction text is larger than or equal to a third error correction threshold value, and if so, taking the updated second common word error correction text as the second correction text; if not, the second common word error correction text is used as the second correction text; wherein the third error correction threshold is greater than the first error correction threshold and the third error correction threshold is greater than the second error correction threshold.
In one possible implementation manner, the second error correction module is specifically configured to:
word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common word list of the preset entity category, and if the segmentation words are matched with the common words in the common word list of the preset entity category, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the confusion degree of the updated text to be corrected is larger than or equal to a fourth correction threshold value; if yes, taking the updated text to be corrected as the second correction text; and if not, taking the text to be corrected as the second correction text.
In one possible implementation manner, the first error correction module is specifically configured to:
searching a shape near word of a first word in the text to be corrected from the preset shape near word list;
respectively replacing a first word in the text to be corrected with each corresponding shape near word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-stage replacement text in the plurality of first-stage replacement texts, and selecting a first-stage replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of first-stage replacement texts according to the sequence of the confusion degree from small to large;
for each first-stage replacement text in the selected specified number of first-stage replacement texts, searching a shape near word of a second word in the first-stage replacement text from the preset shape near word list, respectively replacing the second word in the first-stage replacement text with each shape near word corresponding to the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
Selecting second-level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the plurality of second-level replacement texts according to the order of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the text to be corrected is subjected to word-shape and word-shape replacement, and obtaining the appointed number of forward correction texts;
the first error correction module is specifically further configured to:
searching the shape near word of the last word in the text to be corrected from the preset shape near word list;
respectively replacing the last word in the text to be corrected with each corresponding shape near word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
For each N-th level replacement text in the selected appointed number of N-th level replacement texts, searching a shape near word of the penultimate word in the N-th level replacement text from the preset shape near word list, respectively replacing the penultimate word in the N-th level replacement text with each shape near word corresponding to the penultimate word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the plurality of N-1-th level replacement texts;
selecting N-1 level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the N-1 level replacement texts according to the order of the confusion degrees from small to large;
processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the text to be corrected is subjected to word-shape and word-near replacement, and obtaining the appointed number of reverse correction texts;
the first error correction module is specifically further configured to:
searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the near-word error correction text;
And if the text to be corrected is not searched for, wherein the text to be corrected is used as the near word correction text.
In one possible implementation manner, the first error correction module is specifically configured to:
searching a common word matched with the first word in the near word error correction text from the preset common word list;
respectively replacing a first word in the near-shape word error correction text with each common word matched with the first word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-level replacement text in the plurality of first-level replacement texts, and selecting a first-level replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of first-level replacement texts according to the order of the confusion degree from small to large;
searching for a common word matched with a second word in the first-stage replacement text from the preset common word list aiming at each first-stage replacement text in the selected appointed number of first-stage replacement texts, respectively replacing the second word in the first-stage replacement text with each common word matched with the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
Selecting second-level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the second-level replacement texts according to the sequence of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the near word error correction text is replaced by a common word, so as to obtain the appointed number of forward error correction texts;
the first error correction module is specifically further configured to:
searching a common word matched with the last word in the shape near word error correction text from the preset common word list;
respectively replacing the last word in the near-shape word error correction text with each common word matched with the last word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
Searching a common word matched with the penultimate word in the N-th level replacement text from the preset common word list aiming at each N-th level replacement text in the selected appointed number of N-th level replacement texts, respectively replacing the penultimate word in the N-th level replacement text with each common word matched with the common word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts;
selecting N-1 level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the N-1 level replacement texts according to the sequence of the confusion degrees from small to large;
processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the near word error correction text is replaced by a common word, and obtaining the appointed number of reverse error correction texts;
the first error correction module is specifically further configured to:
searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the first correction text;
And if the error correction text which is the same as the error correction text in the specified number of forward error correction texts is not searched, taking the near word error correction text as the first error correction text.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of the first aspects when executing a program stored on a memory.
In a fourth aspect, an embodiment of the present invention further provides a readable storage medium, in which a computer program is stored, which when executed by a processor implements the method steps of any of the first aspects.
In a fifth aspect, an embodiment of the present invention provides a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method steps of any of the first aspects.
By adopting the technical scheme, the text to be corrected can be subjected to shape near word correction and common word correction to obtain the first correction text, and the text to be corrected can be subjected to common word correction to obtain the second correction text. Therefore, the problem that the text obtained by OCR recognition has the error of the shape near-word recognition and the error of the common-word recognition can be solved by performing the error correction of the shape near-word and the error correction of the common-word, and the problem that the text obtained by OCR recognition has the error of the common-word can be solved by performing the error correction of the common-word of the text to be corrected. In addition, in the embodiment of the application, the correction text with the lowest confusion degree in the first correction text and the second correction text can be used as the correction text of the text to be corrected, so that the accuracy of text correction is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flowchart of a text error correction method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for determining a first corrected text according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for performing forward error correction of a shape-close word in a text error correction method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for performing near word reverse error correction in a text error correction method according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for performing normal word forward error correction in a text error correction method according to an embodiment of the present invention;
FIG. 6 is a flowchart of a method for performing common word reverse error correction in a text error correction method according to an embodiment of the present invention;
FIG. 7 is a flow chart of a method for determining a second codebook according to an embodiment of the present invention;
FIG. 8 is a flow chart of another method for determining a second codebook according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a text error correction model according to an embodiment of the present invention;
Fig. 10 is an exemplary schematic diagram of a text error correction method according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a text error correction device according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
In order to solve the problem that a text error correction mode in the related art cannot accurately correct texts with more error contents and complex error types, the embodiment of the invention provides a text error correction method, a text error correction device, electronic equipment and a readable storage medium.
The text error correction method provided by the embodiment of the invention is first described below.
Fig. 1 is a flowchart of a text error correction method according to an embodiment of the present invention, where the method may be applied to an electronic device. As shown in fig. 1, the text error correction method may include the steps of:
s101, obtaining a text to be corrected.
S102, performing shape near word correction and common word correction on the text to be corrected in sequence to obtain a first corrected text.
S103, performing common word correction on the text to be corrected to obtain a second correction text.
S104, obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected.
By adopting the technical scheme, the text to be corrected can be subjected to shape near word correction and common word correction to obtain the first correction text, and the text to be corrected can be subjected to common word correction to obtain the second correction text. Therefore, the problem that the text obtained by OCR recognition has the error of the shape near-word recognition and the error of the common-word recognition can be solved by performing the error correction of the shape near-word and the error correction of the common-word, and the problem that the text obtained by OCR recognition has the error of the common-word can be solved by performing the error correction of the common-word of the text to be corrected. In addition, in the embodiment of the application, the correction text with the lowest confusion degree in the first correction text and the second correction text can be used as the correction text of the text to be corrected, so that the accuracy of text correction is further improved.
For S101 described above, the text to be corrected in the embodiment of the present invention may be text obtained by OCR recognition. For example, the text to be corrected may be "the cell room elimination channel is blocked". Of course, the text to be corrected in the embodiment of the present invention may be any text that needs to be corrected, which is not limited in the embodiment of the present invention.
With respect to S104 described above, in one implementation, the present embodiment may calculate the confusion of text through an n-gram model. The n-gram model (n-gram model) is a language model for obtaining text confusion by a statistical and calculation conditional probability method in a large-scale corpus. The method is to count the conditional probability of the occurrence of the n+1th word under the condition that the first n words in the text occur. For a text, the higher the confusion of the text, the more unreasonable the text is indicated, that is, the greater the likelihood that the text contains miscords. The lower the confusion score for the text, the more reasonable the text is indicated, that is, the less likely that the text contains miscords.
In one embodiment of the present application, as shown in fig. 2, the step S102 of performing shape near word correction and common word correction on the text to be corrected in sequence to obtain a first corrected text may be specifically implemented as the following steps:
s1021, forward error correction is carried out on the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list.
The preset shape near word list may include shape near words of a plurality of pre-selected words. For example, the shape of the "ok" word may have "what" and "river" etc. Alternatively, the pre-set shape near word list may be generated by OCR recognized information such as stroke order, stroke, structure, radical, etc., but is not limited thereto.
S1022, performing reverse correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through a preset shape near word list.
The forward error correction and reverse error correction of the text to be corrected in S1021 and S1022 may be implemented by a bidirectional beam-search algorithm, and the specific error correction process will be described below.
S1023, determining the correction text of the near word according to the correction result obtained by forward correction and the correction result obtained by reverse correction.
S1024, forward error correction is carried out on the near-word error correction text according to the sequence from the first word to the last word in the near-word error correction text through a preset common word list.
The preset common word list includes words used by high frequency, and the preset common word list may be called a full word list. For example, the list of preset common words may include words commonly used such as "I", "you", "he". Because the result of OCR recognition may include some irregular miswords, in addition to the shape-near word errors, these miswords are usually some commonly used Chinese characters, so that some commonly used Chinese characters can be sun-dried through word frequency statistics, these Chinese characters can be used as a preset commonly used word list of each word, and the same preset commonly used word list can be used for all Chinese characters.
S1025, performing reverse correction on the near word correction text according to the sequence from the last word to the first word in the near word correction text through a preset common word list.
The forward error correction and reverse error correction of the near word error correction text in S1024 and S1025 may also be implemented through a bidirectional beam-search, and a specific error correction process will be described below.
S1026, determining a first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction on the near-word error correction text.
By adopting the technical scheme, the text to be corrected can be subjected to bidirectional correction through the preset shape near word list to obtain the shape near word correction text, and then the shape near word correction text is subjected to bidirectional correction through the preset common word list to obtain the first correction text. The correction is performed by a bidirectional correction method, so that the correction accuracy can be improved, because if one word in the text is divided into two single words, the latter of the two words is an error word, and if the correction is performed only in the front-to-back direction, the former word of the two words may be preferentially modified by the influence of the latter error word, thereby causing erroneous modification. Conversely, if the former of the two words is an erroneous word, the latter of the two words may be preferentially modified if error correction is performed only in the back-to-front direction, resulting in erroneous modification. In the embodiment of the application, a bidirectional error correction mode is adopted, so that the possibility of correcting the error word correctly is increased in one direction, the possibility of correctly recalling the error word is increased, and the error correction precision is improved.
The error correction process referred to in fig. 2 is described in detail below.
As shown in fig. 3, the step S1021 of forward error correction is performed on the text to be error corrected according to the sequence from the first word to the last word in the text to be error corrected through the preset shape near word list, and may specifically be implemented as the following steps:
s10211, searching a shape near word of a first word in the text to be corrected from a preset shape near word list.
S10212, respectively replacing the first word in the text to be corrected with each corresponding shape near word to obtain a plurality of first-stage replacement texts.
For example, if the first word in the text to be corrected has 10 near-words, the first word in the text to be corrected is replaced by the 10 near-words, so that 10 first-level replacement texts can be obtained.
S10213, calculating the confusion degree of each first-level replacement text in the plurality of first-level replacement texts, and selecting the first-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of first-level replacement texts according to the sequence of the confusion degree from small to large.
Among them, the lower the confusion degree, the more reasonable the explanatory text, so the purpose of correcting the text is to reduce the confusion degree of the text. If the confusion degree of the first-stage replacement text is lower than that of the text to be corrected, the rationality of the first-stage replacement text is higher than that of the text to be corrected, namely that the error words in the text to be corrected are corrected.
In order to avoid changing correct words in the text to be corrected into incorrect words, and to ensure that the rationality of the text obtained after correction is higher than that of the text before correction, the fifth correction threshold is set, and if the difference obtained by subtracting the confusion degree of one of the first-stage replacement texts from the confusion degree of the text to be corrected is greater than the fifth correction threshold, the first-stage replacement text can be temporarily reserved. The fifth error correction threshold is a loose error correction threshold, that is, the value of the fifth error correction threshold is smaller.
Further, from among the first-level replacement texts satisfying the above-described fifth error correction threshold limit, a specified number of first-level replacement texts having the lowest degree of confusion may be selected to execute the next step. The specified number may be set according to actual conditions.
S10214, for each first-stage replacement text in the selected specified number of first-stage replacement texts, searching a shape near word of a second word in the first-stage replacement text from a preset shape near word list, replacing the second word in the first-stage replacement text with each shape near word corresponding to the second word, obtaining a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts.
S10215, selecting a second-level replacement text with a specified number of confusion degree differences larger than a fifth correction threshold value from a plurality of second-level replacement texts according to the sequence from the confusion degree to the large degree.
The method for selecting the second level of replacement text is the same as the method for selecting the first level of replacement text in the above steps, and reference is made to the description related to the above, and details are not repeated here.
S10216, processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the specified number of first-level replacement texts until the last word of the text to be corrected is subjected to word-forming and word-approaching replacement, and obtaining the specified number of forward correction texts.
As shown in fig. 4, similar to the flow of fig. 3, the step S1022 of performing reverse error correction on the text to be error corrected by presetting a near word list in order from the last word to the first word in the text to be error corrected specifically includes the following steps:
s10221, searching the shape near word of the last word in the text to be corrected from a preset shape near word list.
S10222, respectively replacing the last word in the text to be corrected with each shape near word corresponding to the last word to obtain a plurality of N-th level replacement texts.
S10223, calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large.
S10224, searching a shape near word of the penultimate word in the N-th level replacement text from a preset shape near word list aiming at each N-th level replacement text in the selected appointed number, respectively replacing the penultimate word in the N-th level replacement text with each shape near word corresponding to the penultimate word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts.
S10225, selecting N-1 level replacement texts with the specified number of confusion degree differences from the text to be corrected being greater than a fifth correction threshold value from the N-1 level replacement texts according to the order of the confusion degree from small to large.
S10226, processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to the processing mode of the appointed number of N level replacement texts, and obtaining the specified number of reverse error correction texts until the first word of the text to be subjected to error correction is replaced by the adjective word.
After obtaining the specified number of forward error correction texts and the specified number of reverse error correction texts, S1023, determining the near word error correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction, may be specifically implemented as follows:
searching the same error correction texts in the specified number of forward error correction texts and the specified number of reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as a near word error correction text;
if the same error correction text is not found in the specified number of forward error correction text and the specified number of reverse error correction text, the text to be corrected is taken as the near word correction text.
In the embodiment of the application, the text to be corrected is subjected to forward correction and is subjected to reverse correction to obtain the same correction text, so that the correction accuracy is improved to a certain extent. Otherwise, if the same error correction text is not obtained after the forward error correction and the reverse error correction are respectively carried out on the text to be corrected, the fact that errors exist in both the forward error correction and the reverse error correction is indicated, so that the text to be corrected before the error correction of the shape near word is still reserved, namely the original text to be corrected is used as the result of the error correction of the shape near word, and the error correction precision of the text to be corrected in the subsequent process is ensured.
And after the near-word error correction text is obtained, the common word error correction can be further carried out on the basis of the near-word error correction text. As shown in fig. 5, the step S1024 of performing forward error correction on the near-word error correction text by presetting a common word list in order from the first word to the last word in the near-word error correction text specifically includes the following steps:
s10241, searching a common word matched with the first word in the near word error correction text from a preset common word list.
S10242, replacing the first word in the near word error correction text with each common word matched with the first word, and obtaining a plurality of first-stage replacement texts.
S10243, calculating the confusion degree of each first-level replacement text in the plurality of first-level replacement texts, and selecting the first-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of first-level replacement texts according to the order of the confusion degree from small to large.
The sixth error correction threshold may be a strict error correction threshold, that is, the sixth error correction threshold has a larger value. The sixth error correction threshold is greater than the fifth error correction threshold.
S10244, searching for a common word matched with a second word in the first-stage replacement text from a preset common word list aiming at each first-stage replacement text in the selected specified number of first-stage replacement texts, replacing the second word in the first-stage replacement text with each common word matched with the second word, obtaining a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts.
S10245, selecting a second-level replacement text with the difference value of the specified number of confusion degrees and the confusion degrees of the near-word error correction text being larger than or equal to a sixth error correction threshold value from the second-level replacement texts according to the sequence of the confusion degrees from small to large.
S10246, processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the specified number of first-level replacement texts until the last word of the near word error correction text is replaced by the common word, and obtaining the specified number of forward error correction texts.
As shown in fig. 6, similar to the flow of fig. 5, S1025 performs reverse correction on the near-word correction text by presetting a common word list in order from the last word to the first word in the near-word correction text, and specifically includes the following steps:
s10251, searching a common word matched with the last word in the near word error correction text from a preset common word list.
S10252, respectively replacing the last word in the near word error correction text with each common word matched with the last word to obtain a plurality of N-th level replacement texts.
S10253, calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value of the confusion degree of the appointed number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large.
S10254, searching a common word matched with the last-to-last word in the N-th level replacement text from a preset common word list aiming at each N-th level replacement text in the selected appointed number of N-th level replacement texts, respectively replacing the last-to-last word in the N-th level replacement text with each common word matched with the first-to-last word, obtaining a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts.
S10255, selecting N-1 level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being larger than or equal to a sixth error correction threshold value from the N-1 level replacement texts according to the sequence of the confusion degrees from small to large.
S10256, processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the near word error correction text is replaced by the common word, and obtaining the appointed number of reverse error correction texts.
After obtaining the specified number of forward error correction texts and the specified number of reverse error correction texts, the step S1026 of determining the first correction text according to the error correction result obtained by forward error correction on the near word error correction text and the error correction result obtained by reverse error correction may be specifically implemented as follows:
Searching the same error correction texts in the specified number of forward error correction texts and the specified number of reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as a first error correction text;
and if the error correction text which is the same as the error correction text in the specified number of forward error correction texts is not searched, taking the near word error correction text as a first error correction text.
The method of error correction of the common words described in fig. 5 and 6 is similar to the method of error correction of the near words described in fig. 3 and 4, and reference is made to the description of fig. 3 and 4 above for relevant points.
In another embodiment of the present application, the step S103 of performing the conventional word correction on the text to be corrected to obtain the second corrected text specifically includes the following three implementation manners:
and correcting the text to be corrected by sequentially passing through the first common word list, the second common word list and the third common word list to obtain a second corrected text.
The number of words included in each common word in the first common word list is at least a first number, the number of words included in each common word in the second common word list is a second number, the number of words included in each common word in the third common word list is a third number, the first number is larger than the second number, and the second number is larger than the third number.
For example, the first number is 4, the second number is 3, and the third number is 2.
As shown in fig. 7, one embodiment specifically includes the following steps:
s701, word segmentation operation is carried out on the text to be corrected.
In one embodiment, the word segmentation operation can be performed on the text to be corrected by a sub-word segmentation method. The sub-word segmentation method is a word segmentation method based on statistics, and can be used for counting some high-frequency phrases in a large-scale corpus through a BPE (Byte Pair Encoder, byte pair coding) algorithm. In the process of word segmentation of texts by using the sub-word segmentation method, words which can be combined into high-frequency phrases are preferentially segmented together. Therefore, the word that is separated into individual words in the word separation process is likely to be a wrong word, because the wrong word is often less likely to be combined with the word next to it into a word with higher frequency. By the sub-word segmentation method, the positioning range of the error word can be reduced, so that the error correction precision is improved, and the error correction recall rate is improved.
S702, matching the segmentation words included in the text to be corrected with the common words in the first common word list, and if the segmentation words are matched with the common words in the first common word list, updating the text to be corrected through the matched common words.
Wherein each of the commonly used words in the first list of commonly used words comprises a first number of words. Alternatively, each of the common words in the first common word list includes a number of words that is the first number or more.
For example, the first number is 4, and the common words included in the first common word list are four-word words. Let text to be corrected be "cell room elimination channel can be blocked? By matching the text to be corrected with the first common word list, "the room elimination channel" can be determined to match with the "fire fighting channel" in the first common word list, so the text to be corrected can be updated to "the district fire fighting channel can be blocked? "
S703, judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a first correction threshold value. If yes, then execute S704; if not, then S705 is performed.
The first error correction threshold is a loose error correction threshold, namely the value of the first error correction threshold is smaller.
S704, taking the updated text to be corrected as a first common word correction text.
And S705, taking the text to be corrected as a first common word correction text.
S706, matching the segmentation words included in the first common word error correction text with the common words in the second common word list, and if the segmentation words are matched with the common words in the second common word list, updating the first common word error correction text through the matched common words.
And S707, judging whether the difference value between the confusion degree of the first common word error correction text and the confusion degree of the updated first common word error correction text is larger than or equal to a second error correction threshold value. If yes, then execute S708; if not, S709 is performed.
And S708, taking the updated first common word correction text as a second common word correction text.
S709, the first common word correction text is used as a second common word correction text.
And S710, matching the segmentation words included in the second common word error correction text with the common words in the third common word list, and if the segmentation words are matched with the common words in the third common word list, updating the third common word error correction text through the matched common words.
S711, judging whether the difference value between the confusion degree of the second common word error correction text and the confusion degree of the updated second common word error correction text is larger than or equal to a third error correction threshold value. If yes, then execute S712; if not, S713 is performed.
The third error correction threshold is a strict error correction threshold, that is, the value of the third error correction threshold is larger. The third error correction threshold is greater than the first error correction threshold and the third error correction threshold is greater than the second error correction threshold.
And S712, taking the updated second common word correction text as a second correction text.
S713, using the second common word correction text as a second correction text.
After the second correction text is obtained, the correction text with the lowest confusion degree in the first correction text and the second correction text can be determined to be the correction text of the text to be corrected.
By adopting the technical scheme, the first common word list, the second common word list and the third preset common word list are sequentially used for correcting the word segmentation included in the text to be corrected, so that the text to be corrected is corrected according to different word numbers, the correction process is more refined, and the correction precision is improved.
And secondly, correcting the text to be corrected through a preset entity class common word list to obtain a second corrected text.
The preset entity category may be set according to an actual scene, for example, the preset entity category may be a place name, a movie name, a star name, etc.
As shown in fig. 8, the second embodiment specifically includes the following steps:
s801, word segmentation operation is carried out on the text to be corrected.
S802, matching the segmentation words included in the text to be corrected with a preset entity class common word list, and if the segmentation words are matched with the common words in the preset entity class common word list, updating the text to be corrected through the matched common words.
S803, judging whether the difference value between the confusion degree of the text to be corrected and the confusion degree of the updated text to be corrected is larger than or equal to a fourth correction threshold value. If yes, executing S804; if not, S805 is performed.
S804, taking the updated text to be corrected as a second correction text.
S805, taking the text to be corrected as a second correction text.
If the difference between the confusion degree of the text to be corrected and the confusion degree of the updated text to be corrected is smaller than the fourth correction threshold value, the correction effect of the updated text to be corrected is not ideal, so that the updated text to be corrected is not reserved, and the text to be corrected before updating is reserved.
After the second correction text is obtained, the correction text with the lowest confusion degree in the first correction text and the second correction text can be determined to be the correction text of the text to be corrected.
By adopting the technical scheme, the text to be corrected can be corrected through the common word list of the preset entity class, for example, the text to be corrected is corrected through the place name list and the film name list, and the correction precision can be improved.
And correcting the text to be corrected by sequentially passing through the first common word list, the second common word list and the third common word list to obtain a second corrected text. And correcting the text to be corrected through a preset entity category common word list to obtain a second corrected text.
The second correction text obtained in the third mode includes both the second correction text obtained in the first mode and the second correction text obtained in the second mode.
After the first correction text and the two second correction texts are obtained, the correction text with the lowest confusion degree in the first correction text and the two second correction texts can be determined to be the correction text of the text to be corrected.
Taking a third way as an example, the text correction method in the embodiment of the present application may be implemented by a text correction model shown in fig. 9, where the text correction model includes a word level correction module, a common word correction module, a preset entity class common word correction module, and an n-gram module.
The word-level error correction module comprises a near word error correction sub-module and a common word error correction sub-module, wherein the near word error correction sub-module is used for realizing the method flow shown in the figures 3 and 4, and the common word error correction sub-module is used for realizing the method flow shown in the figures 5 and 6.
The common word error correction module is used for implementing the method flow shown in fig. 7, taking the first common word list, the second common word list and the third common word list in fig. 7 as examples, wherein the first common word list, the second common word list and the third common word list respectively comprise four words, three words and two words, the common word correction module comprises a four-word language correction sub-module, a three-word language correction sub-module and a two-word language correction sub-module.
The preset entity class common word error correction module is used for implementing the method flow shown in fig. 8.
The word level error correction module, the common word error correction module, and the preset entity class common word error correction module may be combined in parallel, the word level error correction module may output the first correction text, the common word module outputs the second correction text in S713, and the preset entity class common word error correction module outputs the second correction text in S805.
The n-gram module may calculate the confusion degree of the first correction text and the two second correction texts, and take the correction text with the lowest confusion degree as the correction text of the text to be corrected, so as to output the correction text of the text to be corrected.
The text correction method of the text correction model is described below taking fig. 10 as an example, and as shown in fig. 10, the text to be corrected is "the cell room elimination channel is blocked? And inputting the text to be corrected into a word level correction module, a common word correction module and a preset entity category common word correction module respectively.
Wherein, the shape near word error correction submodule in the word level error correction module is to the text to be corrected "is the district room elimination channel to be can be blocked? "shape-near word correction, the text to be corrected is modified into" why is the cell room elimination channel blocked? ". Then "why is the cell's room-elimination channel blocked? "transmitted to the common word error correction sub-module," why is the cell room-elimination channel blocked after common word error correction by the common word error correction sub-module? What is the "modified to" district fire channel blocked? "and will then" why is the district fire channel blocked? "as first corrected text to the n-gram module.
A four-word language correction submodule in a common word correction module is used for correcting text to be corrected, "a district room elimination channel is blocked? "do four word language correction, change the text to be corrected to" district fire fighting channel can be blocked? ". Then "is the district fire channel can be blocked? "transmitted to the three-word error correction sub-module," after error correction by the three-word error correction sub-module, "is the district fire-fighting channel can be blocked? "no change occurs. Then "is the district fire channel can be blocked? "is transmitted to the two-word language correction sub-module," is the district fire-fighting channel can be blocked after two-word language correction by the two-word language correction sub-module? What is the "modified to" district fire channel thrown? ". And will "why is the district fire channel thrown? "as a second script input to the n-gram module.
Taking a preset entity class common word as a place name as an example, and correspondingly, the preset entity class common word error correction module is a place name error correction module. The place name error correction module treats the text "is the district room elimination channel blocked? After the place name correction is carried out, the text to be corrected is unchanged, and then the district room elimination channel is blocked? "as another second script input to the n-gram module.
The n-gram module calculates "why is the district fire channel blocked? "why is the district fire channel thrown? "and" is the cell's room-elimination channel can be blocked? "confusion degree, calculated" why is the district fire channel blocked? "lowest confusion, so" why is the district fire channel blocked? "corrected text as text to be corrected".
By means of the text error correction model, for the same text to be corrected, the possible errors in the text to be corrected can be detected and modified from multiple layers (namely, the layer of the word and the layer of the word) by means of detecting and modifying the possible errors in the text to be corrected through the combination of multiple error correction modules. In this way, errors possibly existing in the text to be corrected can be corrected as much as possible, namely recalled as much as possible, so that recall is improved. And the correction text with the lowest confusion degree score in the correction text output by each sub-model is used as the final output, so that the correction text to be corrected can be corrected more accurately, the error word can be recalled more accurately, and the correction precision is improved. And the text obtained by OCR recognition with more error contents and complex error types can be corrected as much and accurately as possible by combining and correcting the multiple submodels.
Compared with a single neural network model, the embodiment of the invention can correct more complicated word error conditions in the text to be corrected better through the more precise sub-modules.
Based on the same inventive concept, the embodiment of the present invention further provides a text error correction apparatus, as shown in fig. 11, including:
an obtaining module 1101, configured to obtain text to be corrected;
the first error correction module 1102 is configured to perform shape near word error correction and common word error correction on the text to be corrected in sequence, so as to obtain a first corrected text;
the second correction module 1103 is configured to perform common word correction on the text to be corrected to obtain a second corrected text;
a determining module 1104, configured to obtain the confusion degrees of the first correction text and the second correction text, and determine the correction text with the lowest confusion degree as the correction text of the text to be corrected.
Optionally, the first error correction module 1102 is specifically configured to:
forward error correction is carried out on the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list;
Performing reverse error correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through a preset shape near word list;
determining a near word error correction text according to an error correction result obtained by forward error correction and an error correction result obtained by reverse error correction;
forward correcting the error correction text of the shape near word according to the sequence from the first word to the last word in the error correction text of the shape near word by presetting a common word list;
performing reverse correction on the near-word correction text according to the sequence from the last word to the first word in the near-word correction text through a preset common word list;
and determining a first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction on the near-word error correction text.
Optionally, the second error correction module 1103 is specifically configured to:
correcting the text to be corrected through the first common word list, the second common word list and the third common word list in sequence to obtain a second corrected text; the number of words included in each common word in the first common word list is a first number, the number of words included in each common word in the second common word list is a second number, the number of words included in each common word in the third common word list is a third number, the first number is larger than the second number, and the second number is larger than the third number; and/or the number of the groups of groups,
Correcting the text to be corrected through a preset entity category common word list to obtain a second corrected text.
Optionally, the second error correction module 1103 is specifically configured to:
word segmentation operation is carried out on the text to be corrected;
matching the word segmentation included in the text to be corrected with the common words in the first common word list, and if the word segmentation included in the text to be corrected is matched with the common words in the first common word list, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a first correction threshold value, if so, taking the updated text to be corrected as a first common word correction text; if not, the text to be corrected is used as a first common word correction text;
matching the word segmentation included in the first common word error correction text with the common words in the second common word list, and if the word segmentation included in the first common word error correction text is matched with the common words in the second common word list, updating the first common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the first frequently used word error correction text and the confusion degree of the updated first frequently used word error correction text is larger than or equal to a second error correction threshold value, if so, taking the updated first frequently used word error correction text as a second frequently used word error correction text; if not, the first common word error correction text is used as a second common word error correction text;
Matching the word segmentation included in the second common word error correction text with the common words in the third common word list, and if the word segmentation included in the second common word error correction text is matched with the common words in the third common word list, updating the third common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the second common word error correction text and the confusion degree of the updated second common word error correction text is larger than or equal to a third error correction threshold value, and if so, taking the updated second common word error correction text as a second correction text; if not, the second common word correction text is used as a second correction text; wherein the third error correction threshold is greater than the first error correction threshold and the third error correction threshold is greater than the second error correction threshold.
Optionally, the second error correction module 1103 is specifically configured to:
word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with a preset entity class common word list, and if the segmentation words are matched with the common words in the preset entity class common word list, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a fourth correction threshold value; if yes, taking the updated text to be corrected as a second correction text; if not, the text to be corrected is taken as a second correction text.
Optionally, the first error correction module 1102 is specifically configured to:
searching a shape near word of a first word in the text to be corrected from a preset shape near word list;
respectively replacing a first word in the text to be corrected with each corresponding shape near word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-stage replacement text in the plurality of first-stage replacement texts, and selecting the first-stage replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of first-stage replacement texts according to the sequence of the confusion degree from small to large;
for each first-stage replacement text in a selected specified number of first-stage replacement texts, searching a shape near word of a second word in the first-stage replacement text from a preset shape near word list, respectively replacing the second word in the first-stage replacement text with each shape near word corresponding to the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
selecting a specified number of second-stage replacement texts with the confusion degree difference value between the confusion degree and the text to be corrected being greater than a fifth correction threshold value from the plurality of second-stage replacement texts according to the order of the confusion degree from small to large;
Processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the text to be corrected is subjected to word-shape word replacement, and obtaining the appointed number of forward correction texts;
the first error correction module 1102 is specifically further configured to:
searching the shape near word of the last word in the text to be corrected from a preset shape near word list;
respectively replacing the last word in the text to be corrected with each corresponding shape near word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in a plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
for each N-th level replacement text in the selected appointed number of N-th level replacement texts, searching a shape near word of the penultimate word in the N-th level replacement text from a preset shape near word list, respectively replacing the penultimate word in the N-th level replacement text with each shape near word corresponding to the penultimate word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the plurality of N-1-th level replacement texts;
Selecting N-1 level replacement texts with the specified number of confusion degree differences from the text to be corrected being greater than a fifth correction threshold value according to the sequence from the confusion degree to the large from the N-1 level replacement texts;
processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the text to be corrected is subjected to word-shape and word-near replacement, and obtaining the appointed number of reverse correction texts;
the first error correction module 1102 is specifically further configured to:
searching the same error correction texts in the specified number of forward error correction texts and the specified number of reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as a near word error correction text;
if the same error correction text is not found in the specified number of forward error correction text and the specified number of reverse error correction text, the text to be corrected is taken as the near word correction text.
Optionally, the first error correction module 1102 is specifically configured to:
searching a common word matched with a first word in the shape near word error correction text from a preset common word list;
respectively replacing a first word in the shape near word error correction text with each common word matched with the first word to obtain a plurality of first-stage replacement texts;
Calculating the confusion degree of each first-stage replacement text in the plurality of first-stage replacement texts, and selecting a first-stage replacement text with the difference value between the specified number of confusion degrees and the confusion degree of the shape near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of first-stage replacement texts according to the order of the confusion degree from small to large;
for each first-stage replacement text in a selected specified number of first-stage replacement texts, searching for a common word matched with a second word in the first-stage replacement text from a preset common word list, respectively replacing the second word in the first-stage replacement text with each common word matched with the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
selecting a second-level replacement text with a specified number of confusion degrees and the confusion degree of the shape near-word error correction text being more than or equal to a sixth error correction threshold value from the plurality of second-level replacement texts according to the order of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the near word error correction text is replaced by the common word, and obtaining the appointed number of forward error correction texts;
The first error correction module 1102 is specifically further configured to:
searching a common word matched with the last word in the shape near word error correction text from a preset common word list;
respectively replacing the last word in the shape near word error correction text with each common word matched with the last word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in a plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
for each N-th level replacement text in the selected appointed number of N-th level replacement texts, searching a common word matched with the penultimate word in the N-th level replacement text from a preset common word list, respectively replacing the penultimate word in the N-th level replacement text with each common word matched with the common word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts;
selecting N-1 level replacement texts with the difference value of the specified number of confusion degrees and the confusion degrees of the near-word error correction texts being greater than or equal to a sixth error correction threshold value from the N-1 level replacement texts according to the sequence of the confusion degrees from small to large;
Processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the near word error correction text is replaced by the common word, and obtaining the appointed number of reverse error correction texts;
the first error correction module 1102 is specifically further configured to:
searching the same error correction texts in the specified number of forward error correction texts and the specified number of reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as a first error correction text;
and if the error correction text which is the same as the error correction text in the specified number of forward error correction texts is not searched, taking the near word error correction text as a first error correction text.
Corresponding to the method embodiment, the embodiment of the invention also provides electronic equipment. Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 12, the electronic device may include a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, where the processor 1201, the communication interface 1202, and the memory 1203 complete communication with each other through the communication bus 1204;
A memory 1203 for storing a computer program;
processor 1201 is configured to implement the method steps of any of the text error correction method embodiments described above when executing a program stored on a memory.
The communication bus mentioned by the above electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
Corresponding to the above method embodiments, the present invention further provides a readable storage medium, in which a computer program is stored, which when executed by a processor, implements the method steps of any of the text error correction method embodiments described above. Wherein the readable storage medium is a computer readable storage medium.
Embodiments of the present invention provide a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method steps of any of the text correction method embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (9)
1. A method for text correction, comprising:
obtaining a text to be corrected;
sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first corrected text;
performing common word correction on the text to be corrected to obtain a second correction text;
obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected;
sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first corrected text, wherein the method comprises the following steps:
forward correcting the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list;
performing reverse error correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list;
Determining a near word error correction text according to an error correction result obtained by forward error correction and an error correction result obtained by reverse error correction;
forward correcting the near-word correction text according to the sequence from the first word to the last word in the near-word correction text through a preset common word list;
performing reverse correction on the near-word correction text according to the sequence from the last word to the first word in the near-word correction text through the preset common word list;
and determining the first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text.
2. The method of claim 1, wherein performing the commonly used word correction on the text to be corrected to obtain a second corrected text comprises:
correcting the text to be corrected by sequentially passing through the first common word list, the second common word list and the third common word list to obtain a second corrected text; the number of words included in each common word in the first common word list is a first number, the number of words included in each common word in the second common word list is a second number, the number of words included in each common word in the third common word list is a third number, the first number is greater than the second number, and the second number is greater than the third number; and/or the number of the groups of groups,
And correcting the text to be corrected through a preset entity class common word list to obtain a second corrected text.
3. The method according to claim 2, wherein the correcting the text to be corrected by sequentially passing through the first common word list, the second common word list, and the third common word list to obtain a second corrected text includes:
word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common words in the first common word list, and if the segmentation words are matched with the common words in the first common word list, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the updated confusion degree of the text to be corrected is larger than or equal to a first correction threshold value, if so, taking the updated text to be corrected as a first common word correction text; if not, the text to be corrected is used as a first common word correction text;
matching the word segmentation included in the first common word error correction text with the common words in a second common word list, and if the word segmentation included in the first common word error correction text is matched with the common words in the second common word list, updating the first common word error correction text through the matched common words;
Judging whether the difference value between the confusion degree of the first frequently used word error correction text and the confusion degree of the updated first frequently used word error correction text is larger than or equal to a second error correction threshold value, and if so, taking the updated first frequently used word error correction text as a second frequently used word error correction text; if not, the first common word error correction text is used as a second common word error correction text;
matching the word segmentation included in the second common word error correction text with the common words in a third common word list, and if the word segmentation included in the second common word error correction text is matched with the common words in the third common word list, updating the third common word error correction text through the matched common words;
judging whether the difference value between the confusion degree of the second common word error correction text and the confusion degree of the updated second common word error correction text is larger than or equal to a third error correction threshold value, and if so, taking the updated second common word error correction text as the second correction text; if not, the second common word error correction text is used as the second correction text; wherein the third error correction threshold is greater than the first error correction threshold and the third error correction threshold is greater than the second error correction threshold.
4. The method of claim 2, wherein correcting the text to be corrected by a preset entity class common word list to obtain a second corrected text, comprising:
Word segmentation operation is carried out on the text to be corrected;
matching the segmentation words included in the text to be corrected with the common word list of the preset entity category, and if the segmentation words are matched with the common words in the common word list of the preset entity category, updating the text to be corrected through the matched common words;
judging whether the difference value between the confusion degree of the text to be corrected and the confusion degree of the updated text to be corrected is larger than or equal to a fourth correction threshold value; if yes, taking the updated text to be corrected as the second correction text; and if not, taking the text to be corrected as the second correction text.
5. The method according to claim 1, wherein forward-correcting the text to be corrected by a pre-set shape near word list in order from a first word to a last word in the text to be corrected, comprises:
searching a shape near word of a first word in the text to be corrected from the preset shape near word list;
respectively replacing a first word in the text to be corrected with each corresponding shape near word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-stage replacement text in the plurality of first-stage replacement texts, and selecting a first-stage replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of first-stage replacement texts according to the sequence of the confusion degree from small to large;
For each first-stage replacement text in the selected specified number of first-stage replacement texts, searching a shape near word of a second word in the first-stage replacement text from the preset shape near word list, respectively replacing the second word in the first-stage replacement text with each shape near word corresponding to the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
selecting second-level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the plurality of second-level replacement texts according to the order of the confusion degrees from small to large;
processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the text to be corrected is subjected to word-shape and word-shape replacement, and obtaining the appointed number of forward correction texts;
and performing reverse correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list, wherein the method comprises the following steps:
Searching the shape near word of the last word in the text to be corrected from the preset shape near word list;
respectively replacing the last word in the text to be corrected with each corresponding shape near word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the text to be corrected being greater than or equal to a fifth correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
for each N-th level replacement text in the selected appointed number of N-th level replacement texts, searching a shape near word of the penultimate word in the N-th level replacement text from the preset shape near word list, respectively replacing the penultimate word in the N-th level replacement text with each shape near word corresponding to the penultimate word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the plurality of N-1-th level replacement texts;
selecting N-1 level replacement texts with the confusion degree difference value between the specified number of confusion degrees and the text to be corrected being greater than the fifth correction threshold value from the N-1 level replacement texts according to the order of the confusion degrees from small to large;
Processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the text to be corrected is subjected to word-shape and word-near replacement, and obtaining the appointed number of reverse correction texts;
the method for determining the near word error correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction comprises the following steps:
searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the near-word error correction text;
and if the text to be corrected is not searched for, wherein the text to be corrected is used as the near word correction text.
6. The method of claim 1, wherein forward correcting the near word error correction text by a preset list of common words in order from a first word to a last word in the near word error correction text comprises:
Searching a common word matched with the first word in the near word error correction text from the preset common word list;
respectively replacing a first word in the near-shape word error correction text with each common word matched with the first word to obtain a plurality of first-stage replacement texts;
calculating the confusion degree of each first-level replacement text in the plurality of first-level replacement texts, and selecting a first-level replacement text with the difference value between a specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of first-level replacement texts according to the order of the confusion degree from small to large;
searching for a common word matched with a second word in the first-stage replacement text from the preset common word list aiming at each first-stage replacement text in the selected appointed number of first-stage replacement texts, respectively replacing the second word in the first-stage replacement text with each common word matched with the second word to obtain a plurality of second-stage replacement texts corresponding to the first-stage replacement text, and calculating the confusion degree of each second-stage replacement text in the plurality of second-stage replacement texts;
selecting second-level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the second-level replacement texts according to the sequence of the confusion degrees from small to large;
Processing the selected second-level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of first-level replacement texts until the last word of the near word error correction text is replaced by a common word, so as to obtain the appointed number of forward error correction texts;
the method comprises the steps of performing reverse correction on the near word correction text according to the sequence from the last word to the first word in the near word correction text through a preset common word list, and comprises the following steps:
searching a common word matched with the last word in the shape near word error correction text from the preset common word list;
respectively replacing the last word in the near-shape word error correction text with each common word matched with the last word to obtain a plurality of N-th level replacement texts;
calculating the confusion degree of each Nth-level replacement text in the plurality of Nth-level replacement texts, and selecting the Nth-level replacement texts with the difference value between the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to a sixth error correction threshold value from the plurality of Nth-level replacement texts according to the order of the confusion degree from small to large;
searching a common word matched with the penultimate word in the N-th level replacement text from the preset common word list aiming at each N-th level replacement text in the selected appointed number of N-th level replacement texts, respectively replacing the penultimate word in the N-th level replacement text with each common word matched with the common word to obtain a plurality of N-1-th level replacement texts corresponding to the N-th level replacement text, and calculating the confusion degree of each N-1-th level replacement text in the N-1-th level replacement texts;
Selecting N-1 level replacement texts with the difference value of the specified number of confusion degrees and the confusion degree of the near-word error correction text being greater than or equal to the sixth error correction threshold from the N-1 level replacement texts according to the sequence of the confusion degrees from small to large;
processing the selected N-1 level replacement text and each level of replacement text obtained subsequently according to a processing mode of the appointed number of N level replacement texts until the first word of the near word error correction text is replaced by a common word, and obtaining the appointed number of reverse error correction texts;
the determining the first corrected text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text comprises the following steps:
searching the error correction texts with the same number as that in the reverse error correction texts, and taking the error correction text with the minimum confusion degree in the searched error correction texts as the first correction text;
and if the error correction text which is the same as the error correction text in the specified number of forward error correction texts is not searched, taking the near word error correction text as the first error correction text.
7. A text error correction apparatus, comprising:
the obtaining module is used for obtaining the text to be corrected;
the first correction module is used for sequentially performing shape near word correction and common word correction on the text to be corrected to obtain a first correction text;
the second error correction module is used for correcting the common words of the text to be corrected to obtain a second corrected text;
the determining module is used for obtaining the confusion degree of the first correction text and the second correction text, and determining the correction text with the lowest confusion degree as the correction text of the text to be corrected;
the first error correction module is specifically configured to:
forward correcting the text to be corrected according to the sequence from the first word to the last word in the text to be corrected through a preset shape near word list;
performing reverse error correction on the text to be corrected according to the sequence from the last word to the first word in the text to be corrected through the preset shape near word list;
determining a near word error correction text according to an error correction result obtained by forward error correction and an error correction result obtained by reverse error correction;
forward correcting the near-word correction text according to the sequence from the first word to the last word in the near-word correction text through a preset common word list;
Performing reverse correction on the near-word correction text according to the sequence from the last word to the first word in the near-word correction text through the preset common word list;
and determining the first correction text according to the error correction result obtained by forward error correction and the error correction result obtained by reverse error correction of the near-word error correction text.
8. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-6 when executing a program stored on a memory.
9. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011110293.0A CN112183072B (en) | 2020-10-16 | 2020-10-16 | Text error correction method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011110293.0A CN112183072B (en) | 2020-10-16 | 2020-10-16 | Text error correction method and device, electronic equipment and readable storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112183072A CN112183072A (en) | 2021-01-05 |
| CN112183072B true CN112183072B (en) | 2023-07-21 |
Family
ID=73951229
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011110293.0A Active CN112183072B (en) | 2020-10-16 | 2020-10-16 | Text error correction method and device, electronic equipment and readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112183072B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112818108B (en) * | 2021-02-24 | 2023-10-13 | 中国人民大学 | Text semantic misinterpretation chatbot based on similar characters and its data processing method |
| CN113051896B (en) * | 2021-04-23 | 2023-08-18 | 百度在线网络技术(北京)有限公司 | Method and device for correcting text, electronic equipment and storage medium |
| CN113903048A (en) * | 2021-10-15 | 2022-01-07 | 北京同城必应科技有限公司 | Bill recognition text error correction method used in express delivery field |
| CN114077832A (en) * | 2021-11-19 | 2022-02-22 | 中国建设银行股份有限公司 | Chinese text error correction method, device, electronic device and readable storage medium |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7516404B1 (en) * | 2003-06-02 | 2009-04-07 | Colby Steven M | Text correction |
| US9589049B1 (en) * | 2015-12-10 | 2017-03-07 | International Business Machines Corporation | Correcting natural language processing annotators in a question answering system |
| CN109086266A (en) * | 2018-07-02 | 2018-12-25 | 昆明理工大学 | A kind of error detection of text nearly word form and proofreading method |
| CN109344387A (en) * | 2018-08-01 | 2019-02-15 | 北京奇艺世纪科技有限公司 | The generation method of nearly word form dictionary, device and nearly word form error correction method, device |
| CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Error correction processing method and device, storage medium and processor |
| JP2020034704A (en) * | 2018-08-29 | 2020-03-05 | 富士通株式会社 | Text generation device, text generation program, and text generation method |
| CN111144101A (en) * | 2019-12-26 | 2020-05-12 | 北大方正集团有限公司 | Wrongly written character processing method and device |
| CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8494835B2 (en) * | 2008-12-02 | 2013-07-23 | Electronics And Telecommunications Research Institute | Post-editing apparatus and method for correcting translation errors |
| US9779080B2 (en) * | 2012-07-09 | 2017-10-03 | International Business Machines Corporation | Text auto-correction via N-grams |
| US10733377B2 (en) * | 2013-08-06 | 2020-08-04 | Lenovo (Singapore) Pte. Ltd. | Indicating automatically corrected words |
| US9904672B2 (en) * | 2015-06-30 | 2018-02-27 | Facebook, Inc. | Machine-translation based corrections |
| US10795938B2 (en) * | 2017-03-13 | 2020-10-06 | Target Brands, Inc. | Spell checker |
| KR102329127B1 (en) * | 2017-04-11 | 2021-11-22 | 삼성전자주식회사 | Apparatus and method for converting dialect into standard language |
-
2020
- 2020-10-16 CN CN202011110293.0A patent/CN112183072B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7516404B1 (en) * | 2003-06-02 | 2009-04-07 | Colby Steven M | Text correction |
| US9589049B1 (en) * | 2015-12-10 | 2017-03-07 | International Business Machines Corporation | Correcting natural language processing annotators in a question answering system |
| CN109086266A (en) * | 2018-07-02 | 2018-12-25 | 昆明理工大学 | A kind of error detection of text nearly word form and proofreading method |
| CN109344387A (en) * | 2018-08-01 | 2019-02-15 | 北京奇艺世纪科技有限公司 | The generation method of nearly word form dictionary, device and nearly word form error correction method, device |
| JP2020034704A (en) * | 2018-08-29 | 2020-03-05 | 富士通株式会社 | Text generation device, text generation program, and text generation method |
| CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Error correction processing method and device, storage medium and processor |
| CN111144101A (en) * | 2019-12-26 | 2020-05-12 | 北大方正集团有限公司 | Wrongly written character processing method and device |
| CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
Non-Patent Citations (1)
| Title |
|---|
| 一种支持混合语言的并行查询纠错方法;颛悦;熊锦华;马宏远;程舒杨;程学旗;;中文信息学报(第02期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112183072A (en) | 2021-01-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112183072B (en) | Text error correction method and device, electronic equipment and readable storage medium | |
| CN107704625B (en) | Method and device for field matching | |
| WO2022121178A1 (en) | Training method and apparatus and recognition method and apparatus for text error correction model, and computer device | |
| WO2022121251A1 (en) | Method and apparatus for training text processing model, computer device and storage medium | |
| US10282420B2 (en) | Evaluation element recognition method, evaluation element recognition apparatus, and evaluation element recognition system | |
| WO2018120889A1 (en) | Input sentence error correction method and device, electronic device, and medium | |
| CN113626608B (en) | Semantic-enhancement relationship extraction method and device, computer equipment and storage medium | |
| CN111767713A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
| CN110210028A (en) | For domain feature words extracting method, device, equipment and the medium of speech translation text | |
| CN114792089B (en) | Method, apparatus and program product for managing a computer system | |
| CN112445912A (en) | Fault log classification method, system, device and medium | |
| CN109492217B (en) | Word segmentation method based on machine learning and terminal equipment | |
| US20230252225A1 (en) | Automatic Text Summarisation Post-processing for Removal of Erroneous Sentences | |
| WO2014073206A1 (en) | Information-processing device and information-processing method | |
| CN111326144A (en) | Voice data processing method, device, medium and computing equipment | |
| CN115101072A (en) | Voice recognition processing method and device | |
| CN112949290A (en) | Text error correction method and device and communication equipment | |
| CN113239683A (en) | Method, system and medium for correcting Chinese text errors | |
| CN112835798B (en) | Clustering learning method, testing step clustering method and related devices | |
| CN116956835A (en) | Document generation method based on pre-training language model | |
| CN114254706B (en) | Sequence recognition model training method and device, electronic equipment and storage medium | |
| CN110334104B (en) | List updating method and device, electronic equipment and storage medium | |
| CN111563391A (en) | Machine translation method and device and electronic equipment | |
| CN109062888B (en) | Self-correcting method for input of wrong text | |
| CN113743409B (en) | A text recognition method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |