Disclosure of Invention
The invention provides an intelligent translation method and system based on big data, which are used for effectively connecting word parts of speech in English texts with contexts, so that the accuracy of subsequent translation of English texts is improved.
In order to solve the above technical problems, the present invention provides an intelligent translation method based on big data, which is executed by a computer and includes:
acquiring initial English text data and dividing sentences to obtain English sentences and English words;
counting the occurrence frequency of the English words, and calculating to obtain a first vocabulary weight;
calculating the semantic association degree of the English word, and calculating a second vocabulary weight according to the first vocabulary weight and the semantic association degree;
inputting the English words into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight;
calculating part-of-speech confidence of the word through a preset part-of-speech probability, and calculating to obtain a final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight;
Generating a text vector of the English sentence according to the final vocabulary weight;
and inputting the text vector to a preset translation decoder to generate a translation result of the target language.
As an optional implementation manner, the counting the occurrence frequency of the english word, calculating to obtain a first vocabulary weight, includes:
counting the occurrence frequency of the English words;
According to the occurrence frequency, calculating to obtain the word liveness, wherein the calculation formula is as follows:
Wherein a j is word liveness, and f j is frequency of occurrence;
According to the word liveness, calculating to obtain a first vocabulary weight, wherein the calculation formula is as follows:
wfirst=aj·α
Wherein w first is a first vocabulary weight, alpha is an activity adjusting parameter, and the value range is [0.1,10].
As an optional implementation manner, the calculating the semantic association degree of the english word and calculating the second vocabulary weight according to the first vocabulary weight and the semantic association degree includes:
Calculating the semantic association degree of the English word, wherein the calculation formula is as follows:
Wherein r j is the semantic association degree of the jth English word and the kth English word, sim (v j,vk) represents the cosine similarity between v j and v k, v j is the jth English word, v k is the kth English word, and m is the total number of English words;
And calculating a second vocabulary weight of the English word according to the semantic association degree, wherein the calculation formula is as follows:
wsecond=wfirst·(1+rj)
Wherein w second is the second vocabulary weight and w first is the first vocabulary weight.
As an optional implementation manner, the step of inputting the english word into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight includes:
Inputting the English words into a machine learning model to calculate to obtain offset values of the words;
Calculating an offset weight according to the offset value and a preset knowledge graph, wherein the preset knowledge graph comprises characteristic importance weights and characteristic quantity of English words, and the calculation mode of the offset weight is as follows:
wherein θ is an offset weight, δ i is an offset value, K j is a characteristic importance weight, and the value range is [0,1];
And combining a preset knowledge graph to obtain a third vocabulary weight, wherein the third vocabulary weight is calculated in the following manner:
wthird=wsecond·(θ·β+(1-θ)·γ)
Wherein w third is the third vocabulary weight, beta is the first offset adjustment parameter, the value range is [0.5,1.5], gamma is the second offset adjustment parameter, and the value range is [0.5,2.0].
As an optional implementation manner, calculating the part-of-speech confidence of the word through the preset part-of-speech probability, and calculating to obtain the final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight, including:
Calculating part-of-speech confidence of a word through preset part-of-speech probabilities, wherein the preset part-of-speech probabilities comprise part-of-speech occurrence probabilities and part-of-speech total probabilities, and the part-of-speech confidence is calculated according to the formula:
Wherein conf pos is part-of-speech confidence, F pos is part-of-speech occurrence probability, and F total is part-of-speech total probability, which represents the sum of part-of-speech occurrence probabilities of all words in an English sentence;
and calculating according to the part-of-speech confidence to obtain a final vocabulary weight, wherein the calculation formula is as follows:
wfinal=wthird·(1+confpos)
Where w final is the final vocabulary weight and w third is the third vocabulary weight.
As an optional implementation manner, the generating the text vector of the english sentence according to the final vocabulary weight includes:
And calculating the sentence weight of the English sentence through the final vocabulary weight, wherein the calculation formula is as follows:
Wherein S i is the sentence weight of the ith English sentence, w final,j is the final vocabulary weight of the jth word in the ith English sentence, n i is the total number of words in the ith English sentence and is used for representing the sentence length;
Through normalization of sentence length and text length, a text vector is generated, and the calculation mode is as follows:
where V text is a text vector, and p is the total number of sentences in the text, and is used to represent the text length.
As an optional implementation manner, the inputting the text vector to a preset translation decoder generates a translation result of the target language, which includes:
According to the received text vector, calculating an initial context representation through an attention mechanism, generating probability distribution of a first word by a decoder, and selecting a word with highest probability to be added into a blank target language text sequence;
Calculating the correlation between the text vector input currently and the generated target language text sequence through an attention mechanism to obtain a weighted context representation so as to update the current context representation;
Based on the current context representation, the decoder generates a next word and a probability distribution representing a selection probability of the next word;
Selecting a word with highest probability from the probability distribution, and adding the word into a target language text sequence;
when the generated word is an ending character, the decoder stops generating and outputs the final target language text.
In a second aspect, the present invention also provides an intelligent translation system based on big data, including:
The data acquisition module is used for acquiring initial English text data and dividing sentences to obtain English sentences and English words;
the frequency calculation module is used for counting the occurrence frequency of the English words and calculating to obtain first vocabulary weight;
the semantic association module is used for calculating the semantic association of the English word and calculating a second vocabulary weight according to the first vocabulary weight and the semantic association;
The knowledge graph fusion module is used for inputting the English words into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight;
The part-of-speech confidence optimization module is used for calculating the part-of-speech confidence of the word through the preset part-of-speech probability and calculating to obtain a final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight;
The text vector generation module is used for generating the text vector of the English sentence according to the final vocabulary weight;
And the translation generation module is used for generating a translation result of the target language based on the text vector.
In a third aspect, the present invention also provides an electronic device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the big data based intelligent translation method of any of the above when executing the computer program.
In a fourth aspect, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, the device where the computer readable storage medium is controlled to execute the intelligent translation method based on big data according to any one of the above.
Compared with the prior art, the invention has the following beneficial effects:
The invention converts English text into target language text with context and semantic accuracy through a series of calculation steps, and aims to enhance the understanding and translation quality of a translation system to original text. By considering the context association degree of the vocabulary, the accuracy of translation is improved. Part of speech variations and ambiguities are better handled through the computation of part of speech confidence. And additional semantic information is provided by using the knowledge graph, so that semantic understanding of translation is enhanced. The whole process is automatic, and the translation efficiency is improved.
According to the intelligent English translation method based on big data, pre-translation pretreatment is carried out on English texts from vocabulary dimension and sentence dimension, vocabulary part of speech in the English texts and contexts are effectively connected, and therefore accuracy of subsequent translation of the English texts is improved. According to the invention, through the steps of encoder processing, vocabulary set formation and the like, the text vector of the English text can be obtained, and the English text is translated according to the text vector, so that the translation efficiency and accuracy are improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the above technical problems, referring to fig. 1, a first embodiment of the present invention provides an intelligent translation method based on big data, which includes the following steps:
S11, acquiring initial English text data and dividing sentences to obtain English sentences and English words;
S12, counting the occurrence frequency of the English words, and calculating to obtain first vocabulary weight;
S13, calculating the semantic association degree of the English words, and calculating a second vocabulary weight according to the first vocabulary weight and the semantic association degree;
s14, inputting the English words into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight;
s15, calculating part-of-speech confidence of the word through a preset part-of-speech probability, and calculating to obtain a final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight;
S16, generating text vectors of the English sentences according to the final vocabulary weights;
S17, inputting the text vector to a preset translation decoder, and generating a translation result of the target language.
According to the intelligent English translation method based on big data, pre-translation pretreatment is carried out on English texts from vocabulary dimension and sentence dimension, vocabulary part of speech in the English texts and contexts are effectively connected, and therefore accuracy of subsequent translation of the English texts is improved. According to the invention, through the steps of encoder processing, vocabulary set formation and the like, the text vector of the English text can be obtained, and the English text is translated according to the text vector, so that the translation efficiency and accuracy are improved.
In step S11, it is necessary to acquire initial english text data and perform sentence division to obtain english sentences and english words.
First, a large amount of english text data may be obtained from an english corpus. The corpus covers a plurality of fields of literature, news, science and technology and the like, contains English texts of billions of words, and can well ensure the diversity and the comprehensiveness of the corpus. After the corpus data is obtained, preprocessing is needed. And removing non-English characters, numbers, punctuations and the like in the text by using a regular expression and other methods, and converting all letters into a lower case form. And preprocessing to obtain text data composed of pure English words.
Next, sentence division is performed on the text data using a rule-based method. The text is cut into individual sentences by identifying punctuation marks such as periods, question marks, exclamation marks and the like by using a maximum matching algorithm. On the basis, word segmentation processing is carried out on sentence data obtained by division. The commonly used English word segmentation algorithm includes dictionary-based methods such as maximum matching algorithm and minimum word number algorithm, and statistical machine learning-based methods such as Conditional Random Field (CRF) and Hidden Markov Model (HMM). In this embodiment, a maximum matching algorithm based on a dictionary is adopted, and forward and reverse maximum matching segmentation is performed on sentences by using a pre-built english dictionary, so as to obtain an english word sequence contained in each sentence. On the basis of word segmentation, the word data is also required to be normalized, words are converted into stem or root forms, and dead words such as the words, a, an and the like are removed, so that normalized English word data is obtained.
Finally, mapping and corresponding sentence data and corresponding word data to form sentence-word corresponding relation data. The data is subjected to checksum statistical analysis, and the average sentence length is found to be 20 words in 100 tens of thousands of processed sentences, the longest sentence contains 150 words, the shortest sentence contains only 1 word, and the sentence lengths are in normal distribution. Meanwhile, 50 ten thousand different English words are included in the data, wherein the word with the highest occurrence frequency is the word, 100 ten thousand times of the word occur, and the occurrence frequency of 20 ten thousand words is only 1 time. Through the series of processing, high-quality English sentences and English word data are finally obtained.
In step S12, the counting the occurrence frequency of the english word, and calculating to obtain a first vocabulary weight includes:
counting the occurrence frequency of the English words;
According to the occurrence frequency, calculating to obtain the word liveness, wherein the calculation formula is as follows:
Wherein a j is word liveness, and f j is frequency of occurrence;
According to the word liveness, calculating to obtain a first vocabulary weight, wherein the calculation formula is as follows:
wfirst=aj·α
Wherein w first is a first vocabulary weight, alpha is an activity adjusting parameter, and the value range is [0.1,10].
Note that, in this example, the liveness of a word is defined as the inverse of the frequency of occurrence. This means that the lower the frequency of occurrence of words, the higher their liveness. Although this is counterintuitive, since words with high frequency of occurrence are considered more "active". However, in the context of translation, this definition is intended to emphasize words that are not common in a particular text but have a particular meaning. Meanwhile, the formula adjusts the liveness of the words through liveness adjustment parameters, so that the first vocabulary weight is obtained. This weight may reflect the importance of the word in the particular text. Adjusting parameters allows the system to adjust word weights according to the specific needs of the translation task, increasing the flexibility of the method. The word weight can be flexibly adjusted by adjusting parameters so as to adapt to different translation scenes and requirements. The present approach emphasizes the importance of words in a particular context, rather than merely based on their prevalent frequencies in a large corpus. By considering the liveness and weight of the words, the translation system can capture the intention and emotion of the original text more accurately, so that the naturalness and accuracy of translation are improved.
In step S13, calculating the semantic association degree of the english word, and calculating a second vocabulary weight according to the first vocabulary weight and the semantic association degree, including:
Calculating the semantic association degree of the English word, wherein the calculation formula is as follows:
Wherein r j is the semantic association degree of the jth English word and the kth English word, sim (v j,vk) represents the cosine similarity between v j and v k, v j is the jth English word, v k is the kth English word, and m is the total number of English words;
And calculating a second vocabulary weight of the English word according to the semantic association degree, wherein the calculation formula is as follows:
wsecond=wfirst·(1+rj)
Wherein w second is the second vocabulary weight and w first is the first vocabulary weight.
It should be noted that cosine similarity is a method for measuring similarity between text vectors, and it can reflect similarity of two vectors in directions, regardless of their sizes. This approach can capture semantic relationships between words, not just word frequency based statistics. And, this formula calculates the second lexical weight by adding the weight related to the semantic relevance r j on the basis of the first lexical weight, not only taking into account the statistical properties of the words (through w first) but also taking into account the semantic properties of the words (through r j), thereby providing a translation system with more abundant information.
In the embodiment, the semantic understanding capability of the translation system to the text is enhanced by considering the semantic relation among the words, and the intention of the original text can be more accurately captured by combining the statistics and the semantic information, so that the translation accuracy is improved. For words with multiple meanings, this approach can determine the most appropriate translation based on the semantic relevance of the context, thus better handling ambiguities.
In step S14, the step of inputting the english word into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight includes:
Inputting the English words into a machine learning model to calculate to obtain offset values of the words;
Calculating an offset weight according to the offset value and a preset knowledge graph, wherein the preset knowledge graph comprises characteristic importance weights and characteristic quantity of English words, and the calculation mode of the offset weight is as follows:
wherein θ is an offset weight, δ i is an offset value, K j is a characteristic importance weight, and the value range is [0,1];
And combining a preset knowledge graph to obtain a third vocabulary weight, wherein the third vocabulary weight is calculated in the following manner:
wthird=wsecond·(θ·β+(1-θ)·γ)
Wherein w third is the third vocabulary weight, beta is the first offset adjustment parameter, the value range is [0.5,1.5], gamma is the second offset adjustment parameter, and the value range is [0.5,2.0].
Wherein, in this example, the preset knowledge graph is a key step, which involves collecting and organizing the entity and relationship information related to the translation, so as to provide additional semantic information and context understanding during the translation process. Firstly, natural language text data are collected, and preprocessing, cleaning and marking are carried out to prepare data for constructing a knowledge graph. Entities and relationships are extracted from the text data by natural language processing techniques (e.g., named entity recognition, relationship extraction, etc.). Then, an entity-relation-Entity (ERG) triplet is constructed, and the extracted entity and relation are organized into the ERG triplet to form a knowledge graph. These triples contain the type of entity, the relationships between entities, and related attributes or features. Characteristic importance weights are then determined, one for each entity and relationship in the knowledge-graph, which may be determined based on statistical analysis or machine learning models. For example, statistical methods, such as correlation analysis, chi-square test, etc., are used to evaluate the relationship between different features and target variables (e.g., translation quality, information retrieval accuracy, etc.). From these analyses, it can be determined which features have a strong correlation with the target variable, thus assigning them a high weight. Or building machine learning models, such as support vector machines, random forests, etc., to predict the importance of the entity. During model training, feature importance weights may be determined by the importance scores of the features. For example, a random forest model may provide importance scores for features, which may be used to adjust the weights of entities in a knowledge-graph.
The offset weight θ is calculated by multiplying and summing the offset value δ i of each word with the characteristic importance weight K j thereof, and then dividing by the sum of all the characteristic importance weights. This method can adjust the influence of the offset value according to the characteristic importance of the word, thereby reflecting the meaning of the word in a specific context more accurately. Meanwhile, the second vocabulary weight and the offset weight are combined to calculate the third vocabulary weight, the first offset adjustment parameter and the second offset adjustment parameter allow the system to adjust the word weight according to the specific requirements of the translation task, and the flexibility of the method is improved. The embodiment enhances the understanding capability of the translation system to the text context by considering the offset value of the word in the machine learning model and the characteristic importance of the knowledge graph. And meanwhile, the intention of the original text can be captured more accurately by combining the semantic relevance, the offset value and the characteristic importance, so that the naturalness and the accuracy of translation are improved. In particular, for words having various meanings or terms of a specific field, the most appropriate translation can be determined according to the importance of its characteristics, thereby better handling ambiguities and technical terms.
In step S15, calculating a part-of-speech confidence of the word according to the preset part-of-speech probability, and calculating a final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight, including:
Calculating part-of-speech confidence of a word through preset part-of-speech probabilities, wherein the preset part-of-speech probabilities comprise part-of-speech occurrence probabilities and part-of-speech total probabilities, and the part-of-speech confidence is calculated according to the formula:
Wherein conf pos is part-of-speech confidence, F pos is part-of-speech occurrence probability, and F total is part-of-speech total probability, which represents the sum of part-of-speech occurrence probabilities of all words in an English sentence;
and calculating according to the part-of-speech confidence to obtain a final vocabulary weight, wherein the calculation formula is as follows:
wfinal=wthird·(1+confpos)
Where w final is the final vocabulary weight and w third is the third vocabulary weight.
It is worth noting that by calculating part-of-speech confidence by dividing the probability of occurrence of a particular part-of-speech by the total probability of all parts-of-speech, the parts-of-speech of a word belonging to a particular part-of-speech can be quantified, thereby providing the translation system with information about the parts-of-speech of the word. Meanwhile, the final vocabulary weight is calculated by adding the weight related to the part-of-speech confidence on the basis of the third vocabulary weight, and the semantics, statistics and part-of-speech information of the words are combined, so that more comprehensive vocabulary weight is provided for a translation system. For complex linguistic phenomena, such as part-of-speech variations, word ambiguities, etc., this approach can provide finer processing to better accommodate these phenomena. The part-of-speech information is considered in the calculation of the final vocabulary weight, so that the translation result is optimized, and the grammar and the expression habit of the target language are more met.
In step S16, generating a text vector of the english sentence according to the final vocabulary weight, including:
And calculating the sentence weight of the English sentence through the final vocabulary weight, wherein the calculation formula is as follows:
Wherein S i is the sentence weight of the ith English sentence, w final,j is the final vocabulary weight of the jth word in the ith English sentence, and n i is the total number of words in the ith English sentence for representing the sentence length;
Through normalization of sentence length and text length, a text vector is generated, and the calculation mode is as follows:
where V text is a text vector, and p is the total number of sentences in the text, and is used to represent the text length.
It should be noted that, by adding the final vocabulary weights of all the words in the sentence and dividing the final vocabulary weights by the total number of words to calculate the sentence weight, the importance of each word in the sentence can be comprehensively considered, so as to obtain the overall weight of the sentence. Then, the text vector is generated by adding the weights of all sentences and dividing by the total number of sentences, and the text vector can reflect the central trend of the whole text through normalization processing, and meanwhile, the influence of the length of a single sentence on the text vector is reduced.
In the above formula, the vector representation of the entire text is obtained by averaging the sentence weights, thereby taking into account the importance of each sentence in the text. Meanwhile, the weights of all sentences are treated equally when calculating the average value, regardless of their lengths. The final vocabulary weight of each word in the sentence is comprehensively considered, and the method is beneficial to capturing semantic information of the sentence. By normalizing the sentence length and the text length, the text vector can more fairly reflect the semantic features of the entire text rather than being determined solely by the number or length of sentences.
In step S17, the text vector is input to a preset translation decoder, and a translation result of the target language is generated, including:
According to the received text vector, calculating an initial context representation through an attention mechanism, generating probability distribution of a first word by a decoder, and selecting a word with highest probability to be added into a blank target language text sequence;
Calculating the correlation between the text vector input currently and the generated target language text sequence through an attention mechanism to obtain a weighted context representation so as to update the current context representation;
Based on the current context representation, the decoder generates a next word and a probability distribution representing a selection probability of the next word;
Selecting a word with highest probability from the probability distribution, and adding the word into a target language text sequence;
when the generated word is an ending character, the decoder stops generating and outputs the final target language text.
In one implementation, the context representation is first calculated. According to the received text vector, calculating an initial context representation through an attention mechanism, generating a probability distribution of a first word by a decoder, selecting a word with the highest probability to be added into a blank target language text sequence, and calculating the correlation between the currently input text vector and the generated target language text sequence through the attention mechanism to obtain a weighted context representation so as to update the current context representation.
Illustratively, the decoder receives a text vector (or context vector) converted from the source language text, the vector being calculated from the word weights of the source language sentence, and containing the overall semantic information of the source language sentence. The sequence of words in the target language that have been generated are encoded into a vector representing the progress and context information of the current translation. The encoding mode may include a transducer architecture, where the generated word sequence is encoded using a Self-Attention mechanism (Self-Attention), capturing global dependencies in the sequence. For example, assume that the currently generated target language word sequence is [ w1, w2, ], wn ], which is encoded using the self-attention mechanism in the transducer architecture, resulting in an encoded vector sequence [ E1, E2, ], en ]. The correlation between the input text vector and the encoded vector of the generated word sequence [ E1, E2, ], en ] is calculated using an attention mechanism, resulting in a weighted contextual representation. The context representation is then input to the output layer of the decoder (e.g., softmax layer) to obtain the probability distribution of the next word.
This representation is the key for the decoder to understand the text and original text vectors generated so far. Next, words and probability distributions are generated, and the decoder then generates the next word and corresponding probability distribution based on the current context representation. This probability distribution reflects the probability of occurrence of each word given the context. Then, the word with the highest probability is selected, and the word with the highest probability is selected from the probability distribution and added to the text sequence of the target language. This selection process is the core of the decoder's generation of translations, which determines the fluency and accuracy of the translations. And finally judging the ending symbol, stopping the generating process when the word generated by the decoder is the ending symbol (such as a period or a question mark), and outputting the final target language text. In this embodiment, the decoder takes into account the context information when generating each word, which helps to generate a more context-friendly translation. Through probability distribution selection words, the decoder can better process ambiguous and ambiguous words, and translation accuracy is improved. The decoder can adapt to different translation requirements and generate translation texts with different styles and mood. In addition, the whole translation process is automatic, manual intervention is not needed, and the translation efficiency is improved.
Illustratively, assume that there is an English sentence "The quick brown fox jumps over the lazy dog" that is intended to be translated into Chinese. First, a text vector of the sentence is input into a translation decoder. Context representation calculation-the decoder calculates the initial context representation. Word generation the decoder starts to generate a first word, say "quick" (corresponding to "quick"), and gives a probability distribution for the following words. Select and add words the decoder selects the word "brown" (corresponding to "brown") with the highest probability and adds it to the chinese translation sequence. The process is repeated by the decoder until a translation of the entire sentence is generated, such as "fast brown fox skips the lazy dog". Ending translation when the ending symbol is encountered, the decoder stops generating, outputting the complete Chinese translation "fast brown fox skips lazy dog".
To facilitate an understanding of the present invention, the following is a specific example of a translation process:
hypothesized original english text:
"The quick brown fox jumps over the lazy dog."
The translation process is as follows:
step 1, data acquisition and sentence division, namely firstly acquiring the English text by a system and dividing the English text into sentences and words.
And 2, counting word frequencies and calculating first vocabulary weights, namely counting the occurrence frequency of each word, calculating the liveness of each word (the liveness is defined as the reciprocal of the occurrence frequency), and calculating the first vocabulary weights according to the liveness and the liveness adjusting parameter alpha.
And step 3, calculating semantic association degree and second vocabulary weight, namely calculating the semantic association degree among words by utilizing cosine similarity, and calculating the second vocabulary weight by combining the first vocabulary weight.
And 4, the machine learning model calculates an offset value and a third vocabulary weight, namely inputting words into the machine learning model, calculating the offset value, calculating the offset weight by combining the characteristic importance weight in the knowledge graph, and calculating the third vocabulary weight by combining the offset weight and the offset adjustment parameter.
And 5, calculating the part-of-speech confidence and the final vocabulary weight, namely calculating the part-of-speech confidence according to the part-of-speech occurrence probability and the total probability, and combining the third vocabulary weight to optimize to obtain the final vocabulary weight.
And 6, generating a text vector, namely calculating the sentence weight of each sentence according to the final vocabulary weight, and generating the text vector through normalization of the sentence length and the text length.
And 7, the translation decoder generates a translation result, namely inputting the text vector into the translation decoder, generating a next word and probability distribution according to the current context representation by the decoder, and selecting the word with the highest probability to be added into the text sequence of the target language. This process is repeated until an ending symbol is generated, and the final translation result is output.
The hypothesized translation results are that "fast brown fox skips lazy dogs. "
The present embodiment converts english text into target language text with context and semantic accuracy through a series of calculation steps, aiming at enhancing the understanding and translation quality of the original text by the translation system. By considering the context association degree of the vocabulary, the accuracy of translation is improved. Part of speech variations and ambiguities are better handled through the computation of part of speech confidence. And additional semantic information is provided by using the knowledge graph, so that semantic understanding of translation is enhanced. The whole process is automatic, and the translation efficiency is improved.
In summary, the intelligent English translation method based on big data is provided, and the vocabulary part of speech and the context in the English text are effectively connected by preprocessing the English text before translation from the vocabulary dimension and the sentence dimension, so that the accuracy of the subsequent translation of the English text is improved. According to the invention, through the steps of encoder processing, vocabulary set formation and the like, the text vector of the English text can be obtained, and the English text is translated according to the text vector, so that the translation efficiency and accuracy are improved.
Referring to fig. 2, a second embodiment of the present invention provides an intelligent translation system based on big data, including:
The data acquisition module is used for acquiring initial English text data and dividing sentences to obtain English sentences and English words;
the frequency calculation module is used for counting the occurrence frequency of the English words and calculating to obtain first vocabulary weight;
the semantic association module is used for calculating the semantic association of the English word and calculating a second vocabulary weight according to the first vocabulary weight and the semantic association;
The knowledge graph fusion module is used for inputting the English words into a machine learning model to calculate an offset value, calculating an offset weight according to the offset value and a preset knowledge graph, and calculating a third vocabulary weight according to the offset weight and the second vocabulary weight;
The part-of-speech confidence optimization module is used for calculating the part-of-speech confidence of the word through the preset part-of-speech probability and calculating to obtain a final vocabulary weight according to the part-of-speech confidence and the third vocabulary weight;
The text vector generation module is used for generating the text vector of the English sentence according to the final vocabulary weight;
And the translation generation module is used for generating a translation result of the target language based on the text vector.
Preferably, the frequency calculation module is configured to:
counting the occurrence frequency of the English words;
According to the occurrence frequency, calculating to obtain the word liveness, wherein the calculation formula is as follows:
Wherein a j is word liveness, and f j is frequency of occurrence;
According to the word liveness, calculating to obtain a first vocabulary weight, wherein the calculation formula is as follows:
wfirst=aj·α
Wherein w first is a first vocabulary weight, and α is an activity adjustment parameter.
Preferably, the semantic association module is configured to:
Calculating the semantic association degree of the English word, wherein the calculation formula is as follows:
Wherein r j is the semantic association of the jth English word, sim (v j,vk) represents the cosine similarity between v j and v k, v j is the jth English word, v k is the kth English word, and m is the total number of English words;
And calculating a second vocabulary weight of the English word according to the semantic association degree, wherein the calculation formula is as follows:
wsecond=wfirst·(1+rj)
Wherein w second is the second vocabulary weight and w first is the first vocabulary weight.
Preferably, the knowledge graph fusion module is configured to:
Inputting the English words into a machine learning model to calculate to obtain offset values of the words;
Calculating an offset weight according to the offset value and a preset knowledge graph, wherein the preset knowledge graph comprises characteristic importance weights and characteristic quantity of English words, and the calculation mode of the offset weight is as follows:
Wherein θ is an offset weight, δ i is an offset value, and K j is a characteristic importance weight;
And combining a preset knowledge graph to obtain a third vocabulary weight, wherein the third vocabulary weight is calculated in the following manner:
wthird=wsecond·(θ·β+(1-θ)·γ)
wherein w third is a third vocabulary weight, β is a first offset adjustment parameter, and γ is a second offset adjustment parameter.
Preferably, the part-of-speech confidence optimization module is configured to:
Calculating part-of-speech confidence of a word through preset part-of-speech probabilities, wherein the preset part-of-speech probabilities comprise part-of-speech occurrence probabilities and part-of-speech total probabilities, and the part-of-speech confidence is calculated according to the formula:
Wherein conf pos is part-of-speech confidence, F pos is part-of-speech occurrence probability, and F total is part-of-speech total probability;
and calculating according to the part-of-speech confidence to obtain a final vocabulary weight, wherein the calculation formula is as follows:
wfinal=wthird·(1+confpos)
Where w final is the final vocabulary weight and w third is the third vocabulary weight.
Preferably, the text vector generation module is configured to:
And calculating the sentence weight of the English sentence through the final vocabulary weight, wherein the calculation formula is as follows:
Wherein S i is the sentence weight of the ith English sentence, w final,j is the final vocabulary weight of the jth word in the ith English sentence, and n i is the total number of words in the ith English sentence for representing the sentence length;
Through normalization of sentence length and text length, a text vector is generated, and the calculation mode is as follows:
where V text is a text vector, and p is the total number of sentences in the text, and is used to represent the text length.
Preferably, the translation generating module is configured to:
According to the received text vector, calculating an initial context representation through an attention mechanism, generating probability distribution of a first word by a decoder, and selecting a word with highest probability to be added into a blank target language text sequence;
Calculating the correlation between the text vector input currently and the generated target language text sequence through an attention mechanism to obtain a weighted context representation so as to update the current context representation;
Based on the current context representation, the decoder generates a next word and a probability distribution representing a selection probability of the next word;
Selecting a word with highest probability from the probability distribution, and adding the word into a target language text sequence;
when the generated word is an ending character, the decoder stops generating and outputs the final target language text.
In summary, the intelligent English translation method based on big data is provided, and the vocabulary part of speech and the context in the English text are effectively connected by preprocessing the English text before translation from the vocabulary dimension and the sentence dimension, so that the accuracy of the subsequent translation of the English text is improved. According to the invention, through the steps of encoder processing, vocabulary set formation and the like, the text vector of the English text can be obtained, and the English text is translated according to the text vector, so that the translation efficiency and accuracy are improved.
It should be noted that, the big data based intelligent translation system provided by the embodiment of the present invention is used for executing all the flow steps of the big data based intelligent translation method in the above embodiment, and the working principles and beneficial effects of the two correspond one to one, so that the description is omitted.
The embodiment of the invention also provides electronic equipment. The electronic device comprises a processor, a memory and a computer program stored in the memory and executable on the processor, such as a big data based intelligent translation program. The steps of the above embodiments of the intelligent translation method based on big data are implemented when the processor executes the computer program, for example, step S11 shown in fig. 1. Or the processor, when executing the computer program, performs the functions of the modules/units in the system embodiments described above, e.g., a translation generation module.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used for describing the execution of the computer program in the electronic device.
The electronic equipment can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing equipment. The electronic device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of electronic devices and are not limiting of electronic devices, and may include more or fewer components than those described above, or may combine certain components, or different components, e.g., the electronic devices may also include input-output devices, network access devices, buses, etc.
The Processor may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the electronic device, connecting various parts of the overall electronic device using various interfaces and lines.
The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the electronic device by running or executing the computer program and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area which may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area which may store data created according to the use of the cellular phone (such as audio data, a phonebook, etc.), etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the integrated modules/units of the electronic device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or system capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the system embodiments described above are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the system embodiment of the present invention, the connection relationship between the modules represents that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.