CN111259652B - Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment - Google Patents
Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment Download PDFInfo
- Publication number
- CN111259652B CN111259652B CN202010084543.1A CN202010084543A CN111259652B CN 111259652 B CN111259652 B CN 111259652B CN 202010084543 A CN202010084543 A CN 202010084543A CN 111259652 B CN111259652 B CN 111259652B
- Authority
- CN
- China
- Prior art keywords
- sentence
- corpus
- language
- word segmentation
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000011218 segmentation Effects 0.000 claims abstract description 228
- 239000012634 fragment Substances 0.000 claims abstract description 206
- 238000012545 processing Methods 0.000 claims abstract description 160
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 33
- 238000001914 filtration Methods 0.000 claims description 37
- 238000012549 training Methods 0.000 claims description 36
- 239000000463 material Substances 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 9
- 238000012423 maintenance Methods 0.000 abstract description 9
- 230000008878 coupling Effects 0.000 abstract description 5
- 238000010168 coupling process Methods 0.000 abstract description 5
- 238000005859 coupling reaction Methods 0.000 abstract description 5
- 238000013519 translation Methods 0.000 description 31
- 230000014616 translation Effects 0.000 description 28
- 238000006243 chemical reaction Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000001514 detection method Methods 0.000 description 9
- 238000004140 cleaning Methods 0.000 description 7
- 238000010276 construction Methods 0.000 description 4
- 241000219095 Vitis Species 0.000 description 3
- 235000009754 Vitis X bourquina Nutrition 0.000 description 3
- 235000012333 Vitis X labruscana Nutrition 0.000 description 3
- 235000014787 Vitis vinifera Nutrition 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 241000894007 species Species 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The application relates to a bilingual corpus alignment method, a bilingual corpus alignment device, a computer-readable storage medium and computer equipment, wherein the bilingual corpus alignment method comprises the following steps: obtaining the language type of the text to be aligned and the plain text and the language type of the translated text; preprocessing the flush line text to be aligned to obtain a flush line sentence pair to be aligned; calling a single word segmentation model corresponding to the language types of the original text and the translated text from a single word segmentation model group trained by a Sentence piece algorithm, and performing word segmentation processing to obtain a sentence fragment group of the original text to be aligned and a sentence fragment group of the translated text to be aligned; carrying out format processing on sentence fragment groups to be aligned with original texts and translated texts according to a preset format processing mode, obtaining double sentence pairs, calling a sentence alignment tool, carrying out sentence alignment processing on the double sentence pairs according to a double language dictionary, and obtaining sentence pair flush line corpus. The code coupling degree and the maintenance difficulty are reduced and the maintenance cost is reduced through the monolingual word segmentation model of each language trained by the Sentence piece algorithm.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a bilingual corpus sentence alignment method, a bilingual corpus sentence alignment device, a computer readable storage medium, and a computer device.
Background
When sentence level alignment is performed on bilingual parallel corpus with aligned chapter levels, one feasible method is to judge the similarity degree of each sentence in the two language parallel corpus by using sentence length information and vocabulary information.
For example, if the lengths of two sentences differ significantly, the similarity of the two sentences is low, and the likelihood of being parallel sentence pairs is low. Or, if two sentences contain the same number at the same time or contain the same letter string, the similarity of the two sentences is higher, and the probability that the two sentences are parallel sentence pairs is higher. And, when two sentences contain words of the same concept in both languages, the similarity of the two languages is also higher, such as English sentences containing "frame" and Chinese containing "frame". Based on the alignment logic, the general processing flow is to token the sentence of two languages, wherein the token operation is equivalent to word segmentation operation on the sentence, namely, the consecutive sentence is disassembled into words, and a pre-generated or extracted bilingual dictionary is provided as auxiliary information for alignment. If the existing bilingual dictionary is not provided, the corpus can be initially aligned by using a sentence length method, then the bilingual dictionary is extracted from the initially aligned corpus, and the bilingual dictionary is utilized for second alignment.
But when aligning the corpora to be aligned for a plurality of languages, word segmentation tools corresponding to the languages are required to be deployed on a server so as to extract bilingual dictionaries of different languages and align the corpora to be aligned for different languages. Taking Python as an example, jieba may be used in chinese, mecab may be used in japanese, and ko extension of mecab may be used in korean. Different word segmentation tools not only depend on different running environments (such as mecab requires additional C++ support, and the ko expansion of mecab is only run under python version 3.7), but also need to load different dependent dictionary files themselves respectively. Therefore, the coupling degree of codes is greatly improved, and the maintenance cost is high.
Disclosure of Invention
Based on this, it is necessary to provide a bilingual corpus sentence alignment method, device, computer-readable storage medium and computer equipment, aiming at the problem of high maintenance cost of bilingual corpus sentence alignment.
A bilingual corpus sentence alignment method comprises the following steps:
acquiring a text to be aligned and a language type of an original text and a language type of a translated text in the text to be aligned;
preprocessing the to-be-aligned flush line text to obtain to-be-aligned flush line sentence pairs;
Calling a single word segmentation model corresponding to the language type of the original text from a single word segmentation model group, and performing word segmentation processing on the original text in the flush sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned;
invoking a single word segmentation model corresponding to the language type of the translated text from the single word segmentation model group, and performing word segmentation processing on the translated text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned translated text;
carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group;
based on the preset format processing mode, obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text;
invoking a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment corpus;
the training mode of the monolingual word segmentation model comprises the following steps:
acquiring single-language data corresponding to the language type of the single-language word segmentation model to be trained;
preprocessing the monolingual data to obtain monolingual data samples;
And training the monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.
In one embodiment, the training method of the bilingual dictionary comprises the following steps:
obtaining sentence pair flush corpus samples corresponding to the language types of a bilingual dictionary to be trained from a sentence pair flush corpus, wherein the language types of the bilingual dictionary to be trained comprise the language types of original language corpus and the language types of translated language corpus;
preprocessing the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs;
invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair flush line corpus to obtain a sentence fragment group of a sample original text;
invoking a single word segmentation model corresponding to the language type of the translated language material from the single word segmentation model group, and performing word segmentation on the translated language material in the sentence pair flush line language material pair to obtain a sentence fragment group of a sample translated language;
performing format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to the preset format processing mode to obtain a double sentence pair sample group;
And aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.
In one embodiment, the preset format processing manner includes:
obtaining sentence fragment groups to be formatted;
and detecting the underline character in the sentence fragment group, and removing the detected underline character from the sentence fragment group.
In one embodiment, the preset format processing manner includes:
acquiring sentence fragment groups to be formatted and corresponding language types;
determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group;
and detecting the underline character in the sentence fragment group when the sentence fragment group belongs to a format processing object, and removing the detected underline character from the sentence fragment group.
In one embodiment, the step of calling the sentence alignment tool to perform sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain the sentence pair flush line corpus further includes:
and filtering the sentence pair flush line corpus based on preset filtering conditions to obtain filtered sentence pair flush line corpus.
In one embodiment, the preset filtering condition includes at least one of the following conditions:
Analyzing whether empty sentences exist in the sentence pair flush line corpus, and filtering the empty sentences in the sentence pair flush line corpus;
filtering sentences with scores smaller than a preset value in the sentence alignment corpus according to the preset value;
filtering out sentences of which the language types in the sentence pair flush corpus are not consistent according to the language types of the original text and the language types of the translated text;
and filtering sentences which do not accord with the digital equality features in the sentence pair parallel corpus according to the digital equality features.
In one embodiment, the method further comprises:
and adding the sentence pair flush line corpus into the sentence pair flush line corpus.
A bilingual corpus alignment apparatus comprising:
the parallel text acquisition module is used for acquiring the text to be aligned and the language type of the original text and the language type of the translated text in the text to be aligned;
the preprocessing module is used for preprocessing the to-be-aligned parallel text to obtain to-be-aligned parallel sentence pairs;
the first word segmentation processing module is used for calling a single word segmentation model corresponding to the language type of the original text from the single word segmentation model group, and performing word segmentation processing on the original text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned original text;
The second word segmentation processing module is used for calling a single word segmentation model corresponding to the language type of the translated text from the single word segmentation model group, and performing word segmentation processing on the translated text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned translated text;
the format processing module is used for carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group;
the bilingual dictionary acquisition module is used for acquiring a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing mode;
the sentence alignment processing module is used for calling a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence pair flush line corpus;
the training mode of the monolingual word segmentation model comprises the following steps:
acquiring single-language data corresponding to the language type of the single-language word segmentation model to be trained;
preprocessing the monolingual data to obtain monolingual data samples;
and training the monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.
A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method.
The bilingual corpus alignment method, the bilingual corpus alignment device, the computer readable storage medium and the computer equipment are used for acquiring the text to be aligned and the language type of the original text in the text to be aligned and the language type of the translated text; preprocessing the flush line text to be aligned to obtain a flush line sentence pair to be aligned; calling a single word segmentation model corresponding to the language type of an original text from a single word segmentation model group trained by a Sentence piece algorithm, performing word segmentation processing on the original text in a to-be-aligned sentence pair to obtain a sentence fragment group of the original text, calling the single word segmentation model corresponding to the language type of the translated text, and performing word segmentation processing on the translated text in the to-be-aligned sentence pair to obtain the sentence fragment group of the translated text; performing format processing on the sentence fragment group to be aligned with the original text and the sentence fragment group to be aligned with the translated text according to a preset format processing mode to obtain a double sentence pair group, improving sentence alignment accuracy of a sentence alignment tool, and obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing mode; and calling a sentence alignment tool, and carrying out sentence alignment processing on the double-sentence pair group according to the bilingual dictionary to obtain sentence pair flush line corpus. Through the single-language word segmentation model of each language trained by the Sentence piece algorithm, a set of processing flow can process bilingual corpus and sentence alignment of all required languages simultaneously, design difficulty and code complexity are greatly simplified, code coupling degree and maintenance difficulty are reduced, maintenance cost is reduced, and sentence alignment accuracy can be improved through format processing of sentence fragments, and obtained sentence alignment corpus and sentence alignment corpus result is more accurate.
Drawings
FIG. 1 is an application environment diagram of a bilingual corpus sentence alignment method in one embodiment;
FIG. 2 is a flow chart of a bilingual corpus sentence alignment method in one embodiment;
FIG. 3 is a flow diagram of training of a monolingual word segmentation model in one embodiment;
FIG. 4 is a flow diagram of training of a bilingual dictionary in one embodiment;
FIG. 5 is an application diagram of a bilingual corpus sentence alignment method in one embodiment;
FIG. 6 is a flowchart of a bilingual corpus sentence alignment method in one embodiment;
FIG. 7 is a block diagram of a bilingual corpus alignment apparatus in one embodiment;
FIG. 8 is a block diagram of a bilingual corpus alignment apparatus according to another embodiment;
FIG. 9 is a block diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
FIG. 1 is a diagram of an application environment for a bilingual corpus sentence alignment method in one embodiment. The application environment relates to the terminal 110 or to the terminal 110 and the server 120. The terminal 110 and the server 120 are connected through a network. When the terminal 110 is involved, the terminal 110 acquires the language type of the original text in the text to be aligned and the language type of the translated text; preprocessing the flush text, calling a monolingual word segmentation model, and performing word segmentation on the original text and the translated text to obtain sentence fragment groups; performing format processing on the sentence fragment group to obtain a double sentence pair group; and calling a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group based on the bilingual dictionary to obtain sentence pair flush line corpus. When the terminal 110 and the server 120 are involved, the server 120 obtains the to-be-aligned text sent by the terminal 110 and the language type of the original text and the language type of the translated text in the to-be-aligned text; the server 120 pre-processes the flush text, calls a monolingual word segmentation model, and performs word segmentation processing on the original text and the translated text to obtain sentence fragment groups; performing format processing on the sentence fragment group to obtain a double sentence pair group; and calling a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group based on the bilingual dictionary to obtain sentence pair flush line corpus. The terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
As shown in fig. 2, in one embodiment, a bilingual corpus sentence alignment method is provided. The present embodiment is mainly exemplified by the application of the method to the terminal 110 in fig. 1. Referring to fig. 2, the bilingual corpus alignment method specifically includes the following steps:
step S220, obtaining the text to be aligned and the language type of the original text and the language type of the translated text in the text to be aligned.
The text to be aligned is a bilingual text composed of an original text requiring sentence alignment and parallel corresponding translated text, the text to be aligned can be a chapter-level parallel text, and chapter-level parallel text can be obtained from various channels, such as: various websites can be obtained by grabbing, various public data can be obtained in a centralized way, or can be obtained by manual translation. The original text can be any text in the text to be aligned, and when one text in the text to be aligned is determined to be the original text, the other text is the translated text. The language type of the original text and the language type of the translated text may be any two of chinese, english, japanese, korean, spanish, indian, vietnam, french, russian, german, arabic, italian, grape, turkish, tay, malay, and the like.
The language type of the original text of the flush text or the language type of the translated text is determined according to the language used by the parallel text to be aligned, for example: the text to be aligned is a Chinese text and an English text which have the same meaning and are translated mutually, the Chinese text can be used as an original text, the English text is a translated text, the Chinese type of the original text is Chinese, and the English type of the translated text is English; or English text is used as the original text, chinese text is translated text, the type of language of the original text is English, and the type of language of the translated text is Chinese. The language type of the original text of the flush text and the language type of the translated text can be obtained by the user through the language selection of the terminal or by detecting the flush text.
In one embodiment, the language type of the original text and the language type of the translated text in the to-be-aligned text are determined based on the trained language detection model to perform language type detection on the acquired to-be-aligned text. The language detection model can be obtained by training based on a naive Bayesian algorithm. The work efficiency can be improved by automatically detecting the language type of the original text and the language type of the translated text to be aligned with the parallel text.
Step S240, preprocessing the flush line text to be aligned to obtain the flush line sentence pair to be aligned.
Wherein, the preprocessing may include: and splitting the sentence, namely splitting the original text and the translated text of the to-be-aligned text into a plurality of sentences, and splitting the sentence by using the existing sentence splitting method to obtain the to-be-aligned text pair. The preprocessing may further include: removing sentences, cleaning illegal characters, removing inferior corpus, full-half-angle conversion and the like. The illegal character to be cleaned of the flush text is the character of the control class, the expression character and the like, and can be correspondingly searched according to the Unicode character table, and the illegal character is searched for deletion. Removing inferior corpora to be performed on flush text includes: the corpus which is quite messy and obviously not in line with daily use logic is removed from some numbers and punctuations, and the corpus can be filtered by the proportion of the numbers and the punctuations to sentences; the language materials which are inconsistent in language are removed, namely the language materials of the B language mixed in the language materials of the A language, and tools such as language identification and the like can be used. Full-half angle conversion of the text to be aligned is to convert punctuation marks, numbers and the like in the original text and the translated text of the text to be aligned, such as: the English punctuation marks are used in the Chinese, and the English punctuation marks in the Chinese are converted into Chinese punctuation marks through full-half-angle conversion. And (3) carrying out processing such as splitting sentences, cleaning illegal characters, removing inferior corpus, full-half-angle conversion and the like on the flush line text to be aligned, so as to obtain the flush line sentence pair to be aligned.
Step S260, calling a single word segmentation model corresponding to the language type of the original text from the single word segmentation model group, and performing word segmentation on the original text in the sentence pair to be aligned to obtain a sentence fragment group of the original to be aligned.
The single-language word segmentation model group comprises single-language word segmentation models of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, vital, turkish, thai, mal and other languages, and the single-language word segmentation models are word segmentation models of corresponding language types obtained through training according to single-language data of the corresponding language types. Invoking a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, for example: and calling a Chinese single-word segmentation model from the single-word segmentation model group when the type of the original text of the parallel sentence pair is Chinese. Calling a monolingual word segmentation model, running the monolingual word segmentation model, performing word segmentation processing on the original text to be aligned with the sentence pairs according to a word segmentation list of the monolingual word segmentation model, segmenting words in the original text to obtain each sentence segment, and forming sentence segment groups of the original text to be aligned, wherein the word segments comprise the following steps: the input of 'I am in Beijing Tengxue' to the single word segmentation model, and the output sentence fragment group can be 'I am in Beijing Tengxue'.
In one embodiment, as shown in fig. 3, the training manner of the monolingual word segmentation model includes:
step S262, obtaining the monolingual data corresponding to the language type of the monolingual word segmentation model to be trained.
The single language data is a text only containing one language, the single language data corresponding to the single language type is captured through various websites, and the language type of the single language word segmentation model to be trained can be any one of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, grape, turkish, thai, malay and the like. For any common language, whether in big or small languages, the monolingual corpus stock is infinite, and the amount of monolingual data can be obtained according to accuracy considerations.
Step S264, preprocessing the whisper data to obtain whisper data samples.
Wherein the preprocessing comprises the following steps: cleaning illegal characters, removing inferior corpus, full-half-angle conversion, and the like. The illegal characters are deleted by cleaning the single language data, such as characters of control class, emoticons and the like, and the illegal characters can be searched correspondingly according to the Unicode character table for deletion. The removal of poor corpus from the monolingual data comprises the following steps: the corpus which is quite messy and obviously not in line with daily use logic is removed from some numbers and punctuations, and the corpus can be filtered by the proportion of the numbers and the punctuations to sentences; the language materials which are inconsistent in language are removed, namely the language materials of the B language mixed in the language materials of the A language, and tools such as language identification and the like can be used. Full-half angle conversion is carried out on the monolingual data, namely punctuation marks, numbers and the like are converted, such as: english punctuation marks are used in the Chinese single-word data, and the English punctuation marks in the Chinese single-word data are converted into Chinese punctuation marks through full-half-angle conversion. And (3) splitting sentences, cleaning illegal characters, removing inferior corpus, full-half angle conversion and the like on the monolingual data to obtain monolingual data samples.
And step S268, training a monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain a monolingual word segmentation model corresponding to the language type.
The sentence piece algorithm is a word segmentation algorithm. Training a monolingual word segmentation model based on monolingual data through a Sentence piece algorithm, such as: using a Chinese monolingual data sample to carry out a monolingual word segmentation model by adopting a Sentence piece algorithm, and obtaining a Chinese monolingual word segmentation model and a word segmentation vocabulary of the monolingual word segmentation model; and carrying out a single-word segmentation model by using an English single-word data sample by adopting a Sentence piece algorithm, so as to obtain the English single-word segmentation model and a word segmentation vocabulary of the single-word segmentation model.
In one embodiment, the step of training the monolingual word segmentation model based on monolingual data samples by the sentence piece algorithm comprises: randomly initializing a monolingual data sample into a sufficiently large vocabulary; the following steps are circularly executed until the word segmentation vocabulary reaches the appointed size: (1) fixing a word list, and optimizing word probability p by using an EM algorithm; (2) For each word in the vocabulary, calculating the loss caused by removal of the word; (3) The 20% of the words that cause the least loss are removed while retaining all the words from OOV (Out Of Vocabulary, meaning that words that are not in the vocabulary are avoided, retaining all the words can avoid words that are not in the vocabulary from appearing in the text). The coverage parameters of the vocabulary can be adjusted according to the efficiency problem caused by the capacity of the vocabulary and the precision problem caused by the occurrence frequency of unknown token (token, which can be a word, a word segment or a sentence segment and is determined according to a word segmentation mode), for example: the vocabulary coverage parameter of the Chinese and Japanese languages which are difficult to traverse the languages constituting the letters is set to 0.9995, and the coverage of the other languages is set to 1, etc. The size of the word segmentation vocabulary can be properly amplified according to the mainly related translation languages (such as Chinese and English), and finally the size parameters of the word segmentation vocabulary can be generated, such as: chinese and english are set to 48000, and the vocabulary size of the remaining languages is set to 32000, etc.
The method can train multilingual single-word segmentation models only through a Sentence piece algorithm, is not limited by the problem that different word segmentation tools (namely, chinese needs to use jieba (Chinese word segmentation tools named jieba), japanese needs to use mecab (Japanese word segmentation tools named mecab), korean needs to use mecab's ko expansion and the like) are used for sentence alignment, and the problem that the method depends on different operating environments (such as mecab needs additional C++ support, the ko expansion of mecab can only operate under the python3.7 version) is solved.
Step S280, calling a single word segmentation model corresponding to the language type of the translated text from the single word segmentation model group, and performing word segmentation on the translated text in the sentence pair to be aligned to obtain a sentence fragment group of the translated text to be aligned.
The single-language word segmentation model group comprises single-language word segmentation models of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, vital, turkish, thai, mal and other languages, and the single-language word segmentation models are word segmentation models of corresponding language types are respectively obtained through training according to single-language data of the corresponding language types, and the construction mode is not repeated. Invoking a monolingual word segmentation model corresponding to the language type of the translated text from the monolingual word segmentation model group, for example: and calling an English single-word segmentation model from the single-word segmentation model group when the language type of the translated text of the parallel sentence pair is English. Calling a single-word segmentation model, running the single-word segmentation model, performing word segmentation processing on the translated text to be aligned with the parallel sentence pairs according to a word segmentation list of the single-word segmentation model, and segmenting words in the translated text to obtain each sentence segment, wherein each sentence segment forms a sentence segment group of the translated text to be aligned.
And step S300, carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group.
The bilingual sentence pair group is a sentence fragment group of an original text to be aligned and a sentence fragment group of a translated text to be aligned after format processing. After the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned are subjected to format processing, the situation that "_" (the symbol is a character with the number of 2581 in a Unicode character table) is reserved for marking the beginning of a text and the underline before a word segment after the sentence fragment group is subjected to word processing by a single-word segmentation model (the underline is obtained by replacing a space in the sentence fragment group when the sentence fragment group is subjected to word processing by the single-word segmentation model) can be avoided, for example: for languages that do not themselves typically include spaces, such as chinese, japanese, etc., the period labels may be such that the same word corresponds to two different tokens, such as: after word segmentation processing is carried out on I, I and I correspond to I, and therefore accuracy of sentence alignment of sentence fragment groups of original texts to be aligned and sentence fragment groups of translated texts to be aligned is reduced.
In one embodiment, the preset format processing manner may be: obtaining sentence fragment groups to be formatted; the underscores in the sentence fragment groups are detected, and the detected underscores are removed from the sentence fragment groups.
And respectively detecting the underline symbols in the sentence fragment groups of the original text to be aligned and the sentence fragment groups of the translated text to be aligned, and removing the detected underline symbols from the sentence fragment groups. And (3) carrying out underline character detection on the sentence fragment group of the original text to be aligned, removing the detected underline character from the sentence fragment group of the original text to be aligned, carrying out underline character detection on the sentence fragment group of the translated text to be aligned, and removing the detected underline character from the sentence fragment group of the translated text to be aligned. The searched underline can be replaced or deleted in the sentence fragment group by searching the underline in the sentence fragment group. And detecting the underline symbol of each of the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned, and removing the detected underline symbol from the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned to obtain a double sentence pair group. The accuracy of sentence alignment of the double sentence pair group can be improved.
In one embodiment, the preset format processing manner may further be: acquiring sentence fragment groups to be formatted and corresponding language types; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, the underline in the sentence fragment group is detected, and the detected underline is removed from the sentence fragment group.
Respectively taking the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned as sentence fragment groups to be processed in a format, and determining whether the sentence fragment groups of the original text to be aligned belong to a format processing object according to the language type of the sentence fragment groups of the original text to be aligned; when the sentence fragment group of the original text to be aligned belongs to the format processing object, detecting an underline symbol in the sentence fragment group of the original text to be aligned, and removing the detected underline symbol from the sentence fragment group of the original text to be aligned. Determining whether the sentence fragment group of the translation to be aligned belongs to a format processing object according to the language type of the sentence fragment group of the translation to be aligned; when the sentence fragment group of the to-be-aligned translation belongs to the format processing object, detecting an underline in the sentence fragment group of the to-be-aligned translation, and removing the detected underline from the sentence fragment group of the to-be-aligned translation.
Depending on the language type, whether the language type belongs to the format processing object may be determined based on whether each language type does not itself normally contain a space, and whether the language type does not itself normally contain a space belongs to the format processing object, for example: chinese, japanese, korean, tay, etc., the language type itself usually contains a space that is not a formatting object, such as: english, french, and the like. The searched underline can be replaced or deleted in the sentence fragment group by searching for the underline in the sentence fragment group belonging to the format processing object. And carrying out underline character detection through the sentence fragment group belonging to the format processing object, and removing the detected underline character from the sentence fragment group of the format processing object to obtain a double sentence pair group. The accuracy of sentence alignment of the double sentence pair group can be improved.
Step S320, based on a preset format processing mode, obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text.
The bilingual type of the obtained bilingual dictionary corresponds to the language type of the original text and the language type of the translated text, and sentence fragments in the bilingual dictionary are subjected to the same processing by adopting a preset format processing mode. Each language of two language types is provided with a corresponding bilingual dictionary, the language type of the original text and the bilingual dictionary corresponding to the language type of the translated text, such as: the Chinese type of the original text and the Chinese type of the translated text are Chinese and English, the obtained bilingual dictionary is a Chinese-English inter-translation dictionary, the Chinese type of the original text and the Chinese type of the translated text are Chinese and Korean, the obtained bilingual dictionary is a Chinese-Korean inter-translation dictionary, and the like.
In one embodiment, as shown in fig. 4, the bilingual dictionary is constructed in a manner including steps S322 to S332:
in step S322, sentence pair flush corpus samples corresponding to the language types of the bilingual dictionary to be trained are obtained from the sentence pair flush corpus, wherein the language types of the bilingual dictionary to be trained include the language types of the original language corpus and the language types of the translated language corpus.
The sentence pair flush corpus is a database for storing sentence pair flush corpus of each bilingual species, and sentence pair flush corpus of each bilingual species is formed by each language in the sentence pair flush corpus, for example: chinese-English inter-translated sentence pair flush corpus, chinese-Japanese inter-translated sentence pair flush corpus, chinese-English inter-translated sentence pair flush corpus, korean-English inter-translated sentence pair flush corpus, and the like. The bilingual dictionary to be trained is a bilingual dictionary to be trained, the language types of the bilingual dictionary to be trained comprise the language types of original language materials and the language types of translated language materials, and the language types of the bilingual dictionary to be trained can be any two of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, grape, turkey, thai, malay and the like. The sentence pair flush corpus sample is sentence pair flush corpus used for training a bilingual dictionary to be trained, the sentence pair flush corpus is sentence corpus of two translations which are opposite to each other, and the number of the sentence pair flush corpus samples can be determined according to actual construction accuracy. Sentence-pair flush line corpora are sentence corpora of two translations of each other.
Obtaining sentence pair flush corpus samples corresponding to the language types of the bilingual dictionary to be trained from the sentence pair flush corpus, for example: when a bilingual dictionary of the Chinese-English inter-translation needs to be trained, sentence-pair flush corpus of the Chinese-Japanese inter-translation is obtained from the sentence-pair flush corpus as a sentence-pair flush corpus sample.
Step S324, preprocessing is performed on the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs.
Wherein the preprocessing comprises the following steps: cleaning illegal characters, removing inferior corpus, full-half-angle conversion, and the like. The original corpus and the translated corpus of the sentence pair flush corpus sample are cleaned, illegal characters, such as control class characters, expression characters and the like, are deleted, corresponding searching can be carried out according to a Unicode character table, and illegal characters are searched for deletion. Removing inferior linguistic data from both original linguistic data and translated linguistic data of the parallel linguistic data samples comprises: the corpus which is quite messy and obviously not in line with daily use logic is removed from some numbers and punctuations, and the corpus can be filtered by the proportion of the numbers and the punctuations to sentences; the language materials which are inconsistent in language are removed, namely the language materials of the B language mixed in the language materials of the A language, and tools such as language identification and the like can be used. Full-half-angle conversion is carried out on the original corpus and the translated corpus of the sentence pair flush corpus sample, namely punctuation marks, numbers and the like are converted, for example: the English punctuation marks are used in the Chinese, and the English punctuation marks in the Chinese are converted into Chinese punctuation marks through full-half-angle conversion. And (3) cleaning illegal characters, removing inferior corpus, full-half-angle conversion and the like on the original corpus and the translated corpus of the sentence pair flush corpus sample, so that the sentence pair flush corpus pair can be obtained.
Step S326, invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair flush line corpus to obtain a sentence fragment group of the sample original text.
The single-language word segmentation model group comprises single-language word segmentation models of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, vital, turkish, thai, mal and other languages, and the single-language word segmentation models are word segmentation models of corresponding language types are respectively obtained through training according to single-language data of the corresponding language types, and the construction mode is not repeated. Invoking a monolingual word segmentation model corresponding to the language type of the original corpus from the monolingual word segmentation model group, for example: and if the type of the sentence pairs of the original language corpus of the parallel language corpus is Chinese, calling a Chinese single-word segmentation model from the single-word segmentation model group. Calling a monolingual word segmentation model, running the monolingual word segmentation model, performing word segmentation processing on the original corpus of the sentence pairs parallel to the corpus according to a word segmentation list of the monolingual word segmentation model, and segmenting words in the original corpus to obtain each sentence segment, wherein each sentence segment forms a sentence segment group of a sample original.
Step S328, invoking a single word segmentation model corresponding to the language type of the translated language material from the single word segmentation model group, and performing word segmentation on the translated language material in the sentence pair flush line language material pair to obtain a sentence fragment group of the sample translated language.
The single-language word segmentation model group comprises single-language word segmentation models of Chinese, english, japanese, korean, spanish, seal land, vietnam, french, russian, german, arabic, italian, vital, turkish, thai, mal and other languages, and the single-language word segmentation models are word segmentation models of corresponding language types are respectively obtained through training according to single-language data of the corresponding language types, and the construction mode is not repeated. Invoking a monolingual word segmentation model corresponding to the language type of the translated corpus from the monolingual word segmentation model group, for example: and if the language type of the translation corpus in the sentence pair parallel corpus pair is English, calling an English single-word segmentation model from the single-word segmentation model group. Calling a single-word segmentation model, running the single-word segmentation model, performing word segmentation on the translation corpus in the sentence pairs parallel to the corpus pairs according to a word segmentation list of the single-word segmentation model, and segmenting words in the translation corpus to obtain sentence fragments, wherein the sentence fragments form sentence fragment groups of sample translations.
Step S330, format processing is carried out on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to a preset format processing mode, so as to obtain a double sentence pair sample group.
The bilingual sentence pair sample group is a sentence fragment group of the sample original text and a sentence fragment group of the sample translated text after the format processing. After the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text are subjected to format processing, the sentence fragment group can be prevented from remaining "_" (the symbol is a character numbered 2581 in a Unicode character table) after being subjected to word segmentation processing of a single-language word segmentation model and being used for marking the beginning of a text and an underline (the underline is obtained by replacing a space in the sentence fragment group when the single-language word segmentation model carries out word segmentation processing on the sentence fragment group), and the method comprises the following steps: for languages that do not themselves typically include spaces, such as chinese, japanese, etc., the period labels may be such that the same word corresponds to two different tokens, such as: after word segmentation processing is carried out on 'I', corresponding to 'I' and 'I', the problem that the probability of word pair translation conditions cannot be accurately counted in the follow-up process of constructing a bilingual dictionary and the practicability is low is caused.
In one embodiment, the preset format processing manner may be that a sentence fragment group to be processed in a format is obtained; the underscores in the sentence fragment groups are detected, and the detected underscores are removed from the sentence fragment groups.
And respectively detecting the underline symbols in the sentence fragment groups of the sample original text and the sentence fragment groups of the sample translated text, and removing the detected underline symbols from the sentence fragment groups. And carrying out underline character detection on the sentence fragment group of the sample original text, removing the detected underline character from the sentence fragment group of the sample original text, carrying out underline character detection on the sentence fragment group of the sample translation, and removing the detected underline character from the sentence fragment group of the sample translation. The searched underline can be replaced or deleted in the sentence fragment group by searching the underline in the sentence fragment group. And detecting the underline symbol of the sample original text and the sample translated text, and removing the detected underline symbol from the sample original text and the sample translated text to obtain a double-sentence pair group. The problem that the probability of word pairs translation conditions cannot be accurately counted in the follow-up process of constructing the bilingual dictionary and the practicability is low can be avoided, and the accuracy of sentence alignment based on the bilingual dictionary is further improved.
In one embodiment, the preset format processing mode may be to obtain a sentence fragment group to be processed in a format and a corresponding language type; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, the underline in the sentence fragment group is detected, and the detected underline is removed from the sentence fragment group.
Respectively taking the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text as sentence fragment groups to be processed in a format, and determining whether the sentence fragment groups of the sample original text belong to a format processing object according to the language type of the sentence fragment groups of the sample original text; when the sentence fragment group of the sample original text belongs to the format processing object, detecting an underline character in the sentence fragment group of the sample original text, and removing the detected underline character from the sentence fragment group of the sample original text. Determining whether the sentence fragment group of the sample translation belongs to a format processing object according to the language type of the sentence fragment group of the sample translation; and when the sentence fragment group of the sample translation belongs to the format processing object, detecting an underline character in the sentence fragment group of the sample translation, and removing the detected underline character from the sentence fragment group of the sample translation.
Wherein, according to the language type, whether the language belongs to the format processing object is determined, whether each language type does not contain space, and the language type does not contain space, such as: chinese, japanese, korean, tay, etc., the language type itself usually contains a space that is not a formatting object, such as: english, french, and the like. The searched underline can be replaced or deleted in the sentence fragment group by searching for the underline in the sentence fragment group belonging to the format processing object. And carrying out underline character detection through the sentence fragment group belonging to the format processing object, and removing the detected underline character from the sentence fragment group of the format processing object to obtain a double sentence pair group. The problem that the probability of the word pair translation condition cannot be accurately counted in the follow-up process of constructing the bilingual dictionary and the practicability is low can be avoided, and the accuracy of sentence alignment by adopting the bilingual dictionary is further improved.
In step S332, alignment is performed on the bilingual sentence pair sample set by the bilingual word pair extraction algorithm, so as to obtain a bilingual dictionary.
The bilingual word pair extraction algorithm may be FastAlign algorithm, hunDict algorithm, or the like. The language type of the bilingual dictionary corresponds to the sentence pair flush corpus sample, for example: when the sentence pair flush line corpus sample is Chinese-English inter-translated sentence pair flush line corpus, a bilingual dictionary which is Chinese-English inter-translated is obtained, when the sentence pair flush line corpus sample is Chinese-Japanese inter-translated sentence pair flush line corpus, a bilingual dictionary which is Chinese-Japanese inter-translated is obtained, and the like.
The alignment of the double sentences to the sample group is carried out by using the FastAlign algorithm, and a conditional probability matrix, namely a bilingual dictionary, is output, wherein the following steps are as follows: the specific content is a table in which the probability of translating word a in the a language into word B in the B language, for example, the probability of translating english "I" into chinese "I" is 0.95. Outputting a conditional probability matrix, and removing word pairs with low occurrence probability (FastAlign itself uses a relative threshold) according to an absolute threshold, wherein the absolute threshold can be set to-9.0, and word pairs which do not meet the language requirement are cleaned, so that the bilingual dictionary is further accurate.
Step S340, calling a sentence alignment tool, and carrying out sentence alignment processing on the double-sentence pair group according to the bilingual dictionary to obtain sentence pair parallel corpus.
The sentence alignment tool may be a HunAlign tool. The sentence alignment tool calculates a score by modeling the characteristics of the bilingual sentence pair group, such as sentence length, the number of words translated between two sentences, and the like, on the basis of the corresponding bilingual dictionary in the multilingual bilingual dictionary subjected to format processing, searches for a specific sentence alignment relation, and obtains and outputs an alignment result with the maximum score. The output alignment result comprises aligned sentence pairs and corresponding scores, wherein the aligned sentence pairs are sentence pairs and are parallel corpora.
According to the bilingual corpus alignment method, the text to be aligned and the language type of the original text and the language type of the translated text in the text to be aligned are obtained; preprocessing the flush line text to be aligned to obtain a flush line sentence pair to be aligned; calling a single word segmentation model corresponding to the language type of the original text from a single word segmentation model group trained by a Sentence piece algorithm, and performing word segmentation on the original text in the alignment sentence pair to obtain a sentence fragment group of the original to be aligned; calling a monolingual word segmentation model corresponding to the language type of the translated text from the monolingual word segmentation model group, and performing word segmentation on the translated text in the sentence pair to be aligned to obtain a sentence fragment group of the translated text to be aligned; performing format processing on the sentence fragment group to be aligned with the original text and the sentence fragment group to be aligned with the translated text according to a preset format processing mode to obtain a double sentence pair group, improving sentence alignment accuracy of a sentence alignment tool, and obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing mode; and calling a sentence alignment tool, and carrying out sentence alignment processing on the double-sentence pair group according to the bilingual dictionary to obtain sentence pair flush line corpus. Through the single-language word segmentation model of each language trained by the Sentence piece algorithm, a set of processing flow can process bilingual corpus alignment of all required languages simultaneously, design difficulty and code complexity are greatly simplified, code coupling degree and maintenance difficulty are reduced, maintenance cost is reduced, sentence fragments are subjected to format processing, sentence alignment is performed by using a bilingual dictionary processed by the same format, word and sentence alignment accuracy can be improved, and obtained sentence alignment flush corpus results are more accurate.
In one embodiment, invoking the sentence alignment tool, performing sentence alignment processing on the double-sentence pair group according to the bilingual dictionary, and after the step of obtaining the sentence pair flush line corpus, further includes: and filtering the sentence pair flush line corpus based on preset filtering conditions to obtain the filtered sentence pair flush line corpus.
The preset filtering conditions comprise: (1) And analyzing whether empty sentences exist in the sentence pair flush line corpus, and filtering the empty sentences in the sentence pair flush line corpus, namely, the sentences without corresponding sentences. (2) And filtering sentences with scores smaller than a preset value from the sentence pairs in the parallel corpus according to the preset value. Sentences with lower scores are removed according to the HunAlign score, and the HunAlign score reflects the alignment quality of the sentences, which is also a place for reflecting the effect of improving the accuracy of the bilingual dictionary. There are two processing strategies here: if there is a relatively large number of filters (e.g., other filtering based on a machine translation engine, etc.) then a relatively loose threshold, such as 0.1 or 0.2, etc., may be chosen here, which removes only the significantly mis-aligned sentence pairs, in hopes of providing as many sentence pairs as possible for post filtering. Since HunAlign defaults to an alignment score of 0.3 for two sentences with a sentence token number of 1 without alignment, sentence pairs with a score of 0.3 can be additionally removed to reduce possible errors. If there are not a large number of filters in the following, a relatively conservative filtering threshold, such as 0.8, may be selected, so that the remaining sentence pairs mostly have relatively high accuracy, but the recall of the sentence pairs is correspondingly reduced. (3) According to the language type of the original text and the language type of the translated text, sentences which are inconsistent with the language type in the flush line corpus are filtered, for example: to obtain sentences which are not aligned with Chinese and English in the sentence pair flush corpus, carrying out language identification on the sentences in the sentence pair flush corpus, and filtering out the sentences which are not aligned with Chinese and English in the sentence pair flush corpus. (4) According to the equal numbers, sentences which do not accord with the equal numbers in the corpus of the parallel sentence pairs are filtered, sentences with different numbers in the sentence pairs are removed, and the parallel sentences are further cleaned.
In one embodiment, further comprising: and adding the sentence pair flush corpus into the sentence pair flush corpus.
According to the text to be aligned and the language type of the original text in the text to be aligned and the language type of the translated text, sentence-to-flush corpus corresponding to the language type of the original text in the text to be aligned and the language type of the translated text is obtained after sentence alignment, sentence-to-flush corpus can be added into the sentence-to-flush corpus, and the sentence-to-flush corpus can be used as a sentence-to-flush corpus sample for training a bilingual dictionary. Further training the bilingual dictionary to obtain a more accurate bilingual dictionary.
In one embodiment, referring to fig. 5, a bilingual corpus sentence alignment method is specifically applied to obtain sentence pair flush corpus of multilingual different languages to fill a sentence pair flush corpus, and sentence pair flush corpus in the sentence pair flush corpus can be used as a sample required for training, verifying or testing a machine translation model to obtain a machine translation model, so as to realize multilingual inter-translated products, such as translation software, simultaneous interpretation software, education software, and the like, including but not limited to multilingual machine translation models such as japanese korea, glucozelidean, etc.
Referring to fig. 6, before bilingual corpus alignment is performed, training is performed by using the following training method of the bilingual word segmentation model to obtain a bilingual word segmentation model group including the bilingual word segmentation models of various languages, where the training method of the bilingual word segmentation model is as follows: the method comprises the steps of obtaining single-language data corresponding to the language type of a single-language word segmentation model to be trained, preprocessing the single-language data to obtain a single-language data sample, and training the single-language word segmentation model based on the single-language data sample through a Sentence piece algorithm to obtain the single-language word segmentation model.
Training to obtain a bilingual dictionary of the bilingual species by the training mode of the bilingual dictionary: obtaining sentence pair flush line corpus samples corresponding to the language types of a bilingual dictionary to be trained from a sentence pair flush line corpus, wherein the language types of the bilingual dictionary to be trained comprise the language types of original language corpus and the language types of translated language corpus; preprocessing the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs; invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair parallel line corpus to obtain a sentence fragment group of the sample original text; invoking a single word segmentation model corresponding to the language type of the translated language from the single word segmentation model group, and performing word segmentation on the translated language in the sentence pair flush line language pair to obtain a sentence fragment group of the sample translated language; carrying out format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to a preset format processing mode to obtain a double sentence pair sample group; and aligning the sample groups of the bilingual sentences through a FastAlign algorithm to obtain a bilingual dictionary.
When bilingual corpus sentence alignment is carried out, acquiring a text to be aligned in a flush line and the language type of an original text in the text to be aligned in the flush line and the language type of a translated text; preprocessing the flush line text to be aligned to obtain a flush line sentence pair to be aligned; calling a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, and performing word segmentation on the original text in the sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned; calling a monolingual word segmentation model corresponding to the language type of the translated text from the monolingual word segmentation model group, and performing word segmentation on the translated text in the sentence pair to be aligned to obtain a sentence fragment group of the translated text to be aligned; carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group; based on a preset format processing mode, obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text; and calling a sentence alignment tool, and carrying out sentence alignment processing on the double-sentence pair group according to the bilingual dictionary to obtain sentence pair flush line corpus.
Through using the monolingual word segmentation model of each language of the Sentence piece algorithm, sentence alignment is carried out on the extraction algorithm through bilingual words based on sentence fragment granularity, so that the bilingual word alignment is not limited by different word segmentation tools among various languages any more when the extraction algorithm is carried out on sentence alignment, small languages with fewer related tools can be used universally, a set of processing flow can process sentence alignment flush line corpus of all required languages at the same time, the design difficulty of a system and the complexity of codes are greatly simplified, and the coupling degree and maintenance difficulty of codes are reduced. When training the bilingual dictionary, format processing is carried out on the sentence fragment group, so that the sentence fragment group is more suitable for the extraction flow of the bilingual dictionary, and the statistics of the co-occurrence probability of words and the calculation of the conditional probability are more accurate, thereby improving the accuracy of the bilingual dictionary. When bilingual corpus sentence alignment is carried out, format processing is carried out on sentence fragment groups, so that alignment accuracy of sentence alignment tools in sentence alignment is further improved, and accuracy of sentence alignment on flush line corpus is improved.
Fig. 2 is a flow chart of a bilingual corpus sentence alignment method in an embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Referring to fig. 7, a bilingual corpus alignment apparatus includes: a parallel text acquisition module 310, a preprocessing module 320, a first word segmentation processing module 330, a second word segmentation processing module 340, a format processing module 350, a bilingual dictionary acquisition module 360, and a sentence alignment processing module 370.
The parallel text obtaining module 310 is configured to obtain the text to be aligned and the language type of the original text and the language type of the translated text in the text to be aligned.
The preprocessing module 320 is configured to preprocess the to-be-aligned text to obtain a to-be-aligned sentence pair.
The first word segmentation module 330 is configured to call a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, perform word segmentation processing on the original text in the sentence pair to be aligned, and obtain a sentence fragment group of the original to be aligned.
And the second word segmentation module 340 is configured to invoke a monolingual word segmentation model corresponding to the language type of the translated text from the monolingual word segmentation model group, perform word segmentation processing on the translated text in the pair of parallel sentences, and obtain a sentence fragment group of the translated sentence to be aligned.
The format processing module 350 is configured to perform format processing on the sentence fragment group to be aligned with the original text and the sentence fragment group to be aligned with the translated text according to a preset format processing manner, so as to obtain a double sentence pair group.
The bilingual dictionary obtaining module 360 is configured to obtain a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on a preset format processing manner.
The sentence alignment processing module 370 is configured to invoke a sentence alignment tool, perform sentence alignment processing on the pair group of double sentences according to the bilingual dictionary, and obtain a sentence pair flush line corpus.
The training mode of the monolingual word segmentation model comprises the following steps:
acquiring single-language data corresponding to the language type of the single-language word segmentation model to be trained; preprocessing the monolingual data to obtain monolingual data samples; and training the monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.
Referring to fig. 8, in one embodiment, the bilingual corpus sentence alignment device further includes a bilingual dictionary training module 380, configured to obtain sentence pair flush corpus samples corresponding to the language types of the bilingual dictionary to be trained from the sentence pair flush corpus, where the language types of the bilingual dictionary to be trained include the language types of the original language corpus and the language types of the translated language corpus; preprocessing the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs; invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair parallel line corpus to obtain a sentence fragment group of the sample original text; invoking a single word segmentation model corresponding to the language type of the translated language from the single word segmentation model group, and performing word segmentation on the translated language in the sentence pair flush line language pair to obtain a sentence fragment group of the sample translated language; carrying out format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text to obtain a double sentence pair sample group; and aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.
In one embodiment, the format processing module 350 is further configured to obtain a sentence fragment group to be formatted; the underscores in the sentence fragment groups are detected, and the detected underscores are removed from the sentence fragment groups.
In one embodiment, the format processing module 350 is further configured to obtain a sentence fragment group to be processed in a format and a corresponding language type; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, the underline in the sentence fragment group is detected, and the detected underline is removed from the sentence fragment group.
In one embodiment, the bilingual corpus sentence alignment device further includes a corpus filtering module 390, configured to filter the sentence pairs of the parallel corpora based on a preset filtering condition, so as to obtain filtered sentence pairs of the parallel corpora.
In one embodiment, the corpus filtering module 390 is further configured to analyze whether empty sentences exist in the sentence pair flush corpus, and filter the empty sentences in the sentence pair flush corpus; filtering sentences with scores smaller than a preset value from sentence pairs in the parallel corpus according to the preset value; according to the language type of the original text and the language type of the translated text, filtering out sentences of which the sentence pairs are inconsistent with the language type in the flush line corpus; and filtering sentences which do not accord with the digital equal characteristics in the sentence pairs flush line corpus according to the digital equal characteristics.
In one embodiment, the bilingual corpus sentence alignment device further includes an adding module 400, configured to add the sentence-to-flat corpus into the sentence-to-flat corpus.
FIG. 9 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the terminal 110 (or the server 120) in fig. 1. As shown in fig. 9, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a bilingual corpus sentence alignment method. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform a bilingual corpus sentence alignment method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown in fig. 9, or may combine certain components, or have a different arrangement of components.
In one embodiment, the bilingual corpus alignment apparatus provided by the present application may be implemented in the form of a computer program, which may be executed on a computer device as shown in fig. 9. The memory of the computer device may store various program modules constituting the bilingual corpus sentence alignment device, for example, a parallel text acquisition module 310, a preprocessing module 320, a first word segmentation processing module 330, a second word segmentation processing module 340, a format processing module 350, a bilingual dictionary acquisition module 360, and a sentence alignment processing module 370 shown in fig. 7. The computer program constituted by the respective program modules causes the processor to execute the steps in the bilingual corpus alignment method of the respective embodiments of the present application described in the present specification.
For example, the computer apparatus shown in fig. 9 may execute step S220 by the parallel text acquisition module 310 in the bilingual corpus alignment apparatus shown in fig. 7. The computer device may perform step S240 through the preprocessing module 320. The computer apparatus may perform step S260 through the first word segmentation processing module 330. The computer apparatus may perform step S280 through the second word processing module 340. The computer device may perform step S300 through the format processing module 350. The computer device may perform step S320 through the bilingual dictionary acquisition module 360. The computer device may perform step S340 through the sentence-alignment processing module 370.
In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the bilingual corpus alignment method described above. The step of the bilingual corpus sentence alignment method herein may be a step in the bilingual corpus sentence alignment method of the above-described respective embodiments.
In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the bilingual corpus alignment method described above. The step of the bilingual corpus sentence alignment method herein may be a step in the bilingual corpus sentence alignment method of the above-described respective embodiments.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (16)
1. The bilingual corpus sentence alignment method is characterized by comprising the following steps of:
acquiring a text to be aligned and a language type of an original text and a language type of a translated text in the text to be aligned;
preprocessing the to-be-aligned flush line text to obtain to-be-aligned flush line sentence pairs;
calling a single word segmentation model corresponding to the language type of the original text from a single word segmentation model group, and performing word segmentation processing on the original text in the flush sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned;
Invoking a single word segmentation model corresponding to the language type of the translated text from the single word segmentation model group, and performing word segmentation processing on the translated text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned translated text;
carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group;
based on the preset format processing mode, obtaining a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text; the language types of the bilingual dictionary comprise the language types of the original text and the language types of the translated text, and sentence fragments in the bilingual dictionary are processed in the same manner as the original text to be aligned and the translated text to be aligned by adopting the preset format processing mode;
invoking a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment corpus;
the training mode of the monolingual word segmentation model comprises the following steps:
acquiring single-language data corresponding to the language type of the single-language word segmentation model to be trained;
Preprocessing the monolingual data to obtain monolingual data samples;
and training the monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.
2. The method of claim 1, wherein the training mode of the bilingual dictionary comprises:
obtaining sentence pair flush corpus samples corresponding to the language types of a bilingual dictionary to be trained from a sentence pair flush corpus, wherein the language types of the bilingual dictionary to be trained comprise the language types of original language corpus and the language types of translated language corpus;
preprocessing the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs;
invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair flush line corpus to obtain a sentence fragment group of a sample original text;
invoking a single word segmentation model corresponding to the language type of the translated language material from the single word segmentation model group, and performing word segmentation on the translated language material in the sentence pair flush line language material pair to obtain a sentence fragment group of a sample translated language;
performing format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to the preset format processing mode to obtain a double sentence pair sample group;
And aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.
3. The method according to claim 1 or 2, wherein the predetermined format handling manner includes:
obtaining sentence fragment groups to be formatted;
and detecting the underline character in the sentence fragment group, and removing the detected underline character from the sentence fragment group.
4. The method according to claim 1 or 2, wherein the predetermined format handling manner includes:
acquiring sentence fragment groups to be formatted and corresponding language types;
determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group;
and detecting the underline character in the sentence fragment group when the sentence fragment group belongs to a format processing object, and removing the detected underline character from the sentence fragment group.
5. The method of claim 1, wherein the step of calling a sentence alignment tool to perform sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain a sentence pair flush line corpus further comprises:
and filtering the sentence pair flush line corpus based on preset filtering conditions to obtain filtered sentence pair flush line corpus.
6. The method of claim 5, wherein the preset filtering conditions include at least one of:
analyzing whether empty sentences exist in the sentence pair flush line corpus, and filtering the empty sentences in the sentence pair flush line corpus;
filtering sentences with scores smaller than a preset value in the sentence alignment corpus according to the preset value;
filtering out sentences of which the language types in the sentence pair flush corpus are not consistent according to the language types of the original text and the language types of the translated text;
and filtering sentences which do not accord with the digital equality features in the sentence pair parallel corpus according to the digital equality features.
7. The method as recited in claim 2, further comprising:
and adding the sentence pair flush line corpus into the sentence pair flush line corpus.
8. A bilingual corpus alignment apparatus, comprising:
the parallel text acquisition module is used for acquiring the text to be aligned and the language type of the original text and the language type of the translated text in the text to be aligned;
the preprocessing module is used for preprocessing the to-be-aligned parallel text to obtain to-be-aligned parallel sentence pairs;
The first word segmentation processing module is used for calling a single word segmentation model corresponding to the language type of the original text from the single word segmentation model group, and performing word segmentation processing on the original text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned original text;
the second word segmentation processing module is used for calling a single word segmentation model corresponding to the language type of the translated text from the single word segmentation model group, and performing word segmentation processing on the translated text in the to-be-aligned sentence pair to obtain a sentence fragment group of the to-be-aligned translated text;
the format processing module is used for carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a double sentence pair group;
the bilingual dictionary acquisition module is used for acquiring a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing mode; the language types of the bilingual dictionary comprise the language types of the original text and the language types of the translated text, and sentence fragments in the bilingual dictionary are processed in the same manner as the original text to be aligned and the translated text to be aligned by adopting the preset format processing mode;
The sentence alignment processing module is used for calling a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence pair flush line corpus;
the training mode of the monolingual word segmentation model comprises the following steps:
acquiring single-language data corresponding to the language type of the single-language word segmentation model to be trained;
preprocessing the monolingual data to obtain monolingual data samples;
and training the monolingual word segmentation model based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.
9. The apparatus of claim 8, further comprising a bilingual dictionary training module to:
obtaining sentence pair flush corpus samples corresponding to the language types of a bilingual dictionary to be trained from a sentence pair flush corpus, wherein the language types of the bilingual dictionary to be trained comprise the language types of original language corpus and the language types of translated language corpus;
preprocessing the sentence pair flush line corpus sample to obtain sentence pair flush line corpus pairs;
invoking a single word segmentation model corresponding to the language type of the original text corpus from the single word segmentation model group, and performing word segmentation on the original text corpus in the sentence pair flush line corpus to obtain a sentence fragment group of a sample original text;
Invoking a single word segmentation model corresponding to the language type of the translated language material from the single word segmentation model group, and performing word segmentation on the translated language material in the sentence pair flush line language material pair to obtain a sentence fragment group of a sample translated language;
performing format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to the preset format processing mode to obtain a double sentence pair sample group;
and aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.
10. The apparatus according to claim 8 or 9, wherein the format processing module is specifically configured to:
obtaining sentence fragment groups to be formatted;
and detecting the underline character in the sentence fragment group, and removing the detected underline character from the sentence fragment group.
11. The apparatus according to claim 8 or 9, wherein the format processing module is specifically configured to:
acquiring sentence fragment groups to be formatted and corresponding language types;
determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group;
and detecting the underline character in the sentence fragment group when the sentence fragment group belongs to a format processing object, and removing the detected underline character from the sentence fragment group.
12. The apparatus of claim 8, further comprising a corpus filtering module configured to:
and filtering the sentence pair flush line corpus based on preset filtering conditions to obtain filtered sentence pair flush line corpus.
13. The apparatus of claim 12, wherein the corpus filtering module is specifically configured to:
analyzing whether empty sentences exist in the sentence pair flush line corpus, and filtering the empty sentences in the sentence pair flush line corpus;
filtering sentences with scores smaller than a preset value in the sentence alignment corpus according to the preset value;
filtering out sentences of which the language types in the sentence pair flush corpus are not consistent according to the language types of the original text and the language types of the translated text;
and filtering sentences which do not accord with the digital equality features in the sentence pair parallel corpus according to the digital equality features.
14. The apparatus of claim 9, further comprising an adding module configured to:
and adding the sentence pair flush line corpus into the sentence pair flush line corpus.
15. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.
16. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010084543.1A CN111259652B (en) | 2020-02-10 | 2020-02-10 | Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010084543.1A CN111259652B (en) | 2020-02-10 | 2020-02-10 | Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111259652A CN111259652A (en) | 2020-06-09 |
| CN111259652B true CN111259652B (en) | 2023-08-15 |
Family
ID=70949238
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010084543.1A Active CN111259652B (en) | 2020-02-10 | 2020-02-10 | Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111259652B (en) |
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111723587A (en) * | 2020-06-23 | 2020-09-29 | 桂林电子科技大学 | A Chinese-Thai entity alignment method for cross-language knowledge graph |
| CN111753556B (en) * | 2020-06-24 | 2022-01-04 | 掌阅科技股份有限公司 | Bilingual comparison reading method, terminal and computer storage medium |
| CN112347757A (en) * | 2020-10-12 | 2021-02-09 | 四川语言桥信息技术有限公司 | Parallel corpus alignment method, device, equipment and storage medium |
| CN112446224B (en) * | 2020-12-07 | 2024-12-10 | 北京彩云环太平洋科技有限公司 | Parallel corpus processing method, device, equipment and computer readable storage medium |
| CN114118112A (en) * | 2021-12-02 | 2022-03-01 | 江苏省舜禹信息技术有限公司 | A merging method for bilingual merging documents |
| CN115033753A (en) * | 2022-06-17 | 2022-09-09 | 北京金山数字娱乐科技有限公司 | Training corpus construction method, text processing method and device |
| CN115587599B (en) * | 2022-09-16 | 2023-07-14 | 粤港澳大湾区数字经济研究院(福田) | Quality detection method and device for machine translation corpus |
| CN115422922A (en) * | 2022-09-26 | 2022-12-02 | 火星语盟(深圳)科技有限公司 | Device and method for acquiring and storing parallel corpus |
| CN115587590A (en) * | 2022-10-13 | 2023-01-10 | 北京金山数字娱乐科技有限公司 | Training corpus construction method, translation model training method, translation method |
| CN115658838B (en) * | 2022-11-18 | 2023-04-07 | 山东省地图院 | Map set data generation method and device, electronic equipment and storage medium |
| CN116028741B (en) * | 2022-12-23 | 2025-08-05 | 华中科技大学 | Method, system, device and medium for aligning bilingual text paragraphs within a single web page |
| CN116306518B (en) * | 2023-02-09 | 2025-09-12 | 网易(杭州)网络有限公司 | Method, device, equipment and storage medium for generating poetry |
| CN117474510B (en) * | 2023-12-25 | 2024-11-26 | 彩讯科技股份有限公司 | A spam filtering method based on feature selection |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101667177A (en) * | 2009-09-23 | 2010-03-10 | 清华大学 | Method and device for aligning bilingual text |
| CN109684648A (en) * | 2019-01-14 | 2019-04-26 | 浙江大学 | A kind of Chinese automatic translating method at all times of multiple features fusion |
| CN110210041A (en) * | 2019-05-23 | 2019-09-06 | 北京百度网讯科技有限公司 | The neat method, device and equipment of intertranslation sentence pair |
| CN110263350A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Model training method, device, computer readable storage medium and computer equipment |
| CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
| CN110334360A (en) * | 2019-07-08 | 2019-10-15 | 腾讯科技(深圳)有限公司 | Machine translation method and device, electronic equipment and storage medium |
| CN110362820A (en) * | 2019-06-17 | 2019-10-22 | 昆明理工大学 | A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm |
-
2020
- 2020-02-10 CN CN202010084543.1A patent/CN111259652B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101667177A (en) * | 2009-09-23 | 2010-03-10 | 清华大学 | Method and device for aligning bilingual text |
| CN109684648A (en) * | 2019-01-14 | 2019-04-26 | 浙江大学 | A kind of Chinese automatic translating method at all times of multiple features fusion |
| CN110263350A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Model training method, device, computer readable storage medium and computer equipment |
| CN110263349A (en) * | 2019-03-08 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Corpus assessment models training method, device, storage medium and computer equipment |
| CN110210041A (en) * | 2019-05-23 | 2019-09-06 | 北京百度网讯科技有限公司 | The neat method, device and equipment of intertranslation sentence pair |
| CN110362820A (en) * | 2019-06-17 | 2019-10-22 | 昆明理工大学 | A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm |
| CN110334360A (en) * | 2019-07-08 | 2019-10-15 | 腾讯科技(深圳)有限公司 | Machine translation method and device, electronic equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111259652A (en) | 2020-06-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111259652B (en) | Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment | |
| KR101435265B1 (en) | Method for disambiguating multiple readings in language conversion | |
| CN107341143B (en) | Sentence continuity judgment method and device and electronic equipment | |
| KR101500617B1 (en) | Method and system for Context-sensitive Spelling Correction Rules using Korean WordNet | |
| US20130197896A1 (en) | Resolving out-of-vocabulary words during machine translation | |
| KR101509727B1 (en) | Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof | |
| CN102262621A (en) | Device and method for checking translated text | |
| US8880391B2 (en) | Natural language processing apparatus, natural language processing method, natural language processing program, and computer-readable recording medium storing natural language processing program | |
| US10120843B2 (en) | Generation of parsable data for deep parsing | |
| US11663408B1 (en) | OCR error correction | |
| Simoes et al. | Language identification: a neural network approach | |
| US20250363302A1 (en) | Mapping entities in unstructured text documents via entity correction and entity resolution | |
| JP6626917B2 (en) | Readability evaluation method and system based on English syllable calculation method | |
| Mon et al. | SymSpell4Burmese: Symmetric delete spelling correction algorithm (SymSpell) for burmese spelling checking | |
| US20150106698A1 (en) | Systems and methods to segment text for layout and rendering | |
| Lehal et al. | Sangam: A Perso-Arabic to Indic script machine transliteration model | |
| CN109960812B (en) | Language processing method and device | |
| CN104933030A (en) | Uygur language spelling examination method and device | |
| Hocking et al. | Optical character recognition for South African languages | |
| JPS59165179A (en) | Dictionary look-up system | |
| HK40024736A (en) | Bilingual corpus sentence alignment method and apparatus, readable storage medium and computer device | |
| HK40024736B (en) | Bilingual corpus sentence alignment method and apparatus, readable storage medium and computer device | |
| Lu et al. | Language model for Mongolian polyphone proofreading | |
| KR100910275B1 (en) | Method and apparatus for automatic extraction of tuning fork band pairs in bilingual documents | |
| Anuradha et al. | Estimating the effects of text genre, image resolution and algorithmic complexity needed for sinhala optical character recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40024736 Country of ref document: HK |
|
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |