Disclosure of Invention
In view of the foregoing, it is desirable to provide a source coding method and apparatus based on context semantics.
A method for source coding and decoding based on context semantics, the method comprising:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
In one embodiment, the step of sorting each word classified by part of speech in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sorting value of each word includes:
and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications.
And in the part-of-speech classification, obtaining corresponding word sequences according to the sequence of the word frequency values from high to low, and obtaining the in-part-of-speech ranking value of each word according to the word sequences.
In one embodiment, the method for establishing the optimal binary tree model includes:
and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node.
And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the method for allocating the no-repeat prefix code to the leaf node includes:
and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result.
And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the method for processing the word sequence set by using the preset context correlation model to obtain the corresponding decoding result data includes:
and obtaining context semantic association characteristics among words in the training corpus.
And obtaining a word sequence with the highest joint occurrence probability value from the word sequence set according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the manner of obtaining the context and semantic association features between words in the training corpus includes:
the LSTM-based neural network model is used for learning the context semantic association characteristics among words in the training corpus.
In one embodiment, the method for obtaining the word sequence with the highest joint occurrence probability value from the word sequence set includes:
a joint probability distribution of word sequences in the set of word sequences is modeled using an N-gram model.
And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model.
And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
A source coding/decoding device based on context semantics, comprising:
the encoding module is used for acquiring the word frequency value of each word in a preset training corpus, ordering each word classified by part of speech in the training corpus in the classification according to the word frequency value to obtain an ordering value in the part of speech of each word, classifying the words with the same ordering value in each part of speech into the same leaf node, obtaining the weighted value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, establishing an optimal binary tree model, and distributing a non-repeated prefix code to the leaf node to obtain the encoded data of the corpus to be encoded.
And the decoding module is used for obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model, and processing the word sequence set by using a preset context association model to obtain corresponding decoding result data.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the classification according to the word frequency value to obtain the in-word sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
Compared with the prior art, the information source coding and decoding method, the information source coding and decoding device, the computer equipment and the storage medium based on the context semantics sequence words of the training corpus in the parts of speech according to the word frequency of the words at the coding end, combine the words with the same sequence in each part of speech into a leaf node, obtain the weight value of the leaf node according to the sum of the word frequencies of all the words in the leaf node, generate the optimal binary tree model, and allocate the non-repetitive prefix code to each leaf node. Obtaining coded data of the corpus to be coded according to the non-repeated prefix code of each leaf node in the optimal binary tree model; and at the decoding end, obtaining a corresponding candidate word set in the binary tree model according to the coded data, and obtaining a corresponding result from the candidate word set as a decoding result according to the context association relation. According to the method and the device, the words are sorted, the semantic dimension is added in the coding and decoding process, the optimal decoding result is obtained by utilizing the semantic association of the context during decoding, the high-efficiency semantic information expression transmission and information recovery capability can be realized, and the transmission overhead is saved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a source coding and decoding method based on context semantics is provided for an encoding side and a decoding side.
The method comprises the following steps at the encoding end:
step 102, obtaining a word frequency value of each word in a preset training corpus.
And 104, sequencing each word classified by part of speech in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
Specifically, the part-of-speech classified words refer to all words in the training corpus: (
Has been divided into according to its part-of-speech tag
The part-of-speech classification specifically includes a noun class η, a verb class v, an adjective class a, and the like. Then, the words in each part of speech classification are arranged according to the descending order of the word frequency value of the words, and the ordering value of each word in the part of speech in the corresponding part of speech classification is obtained.
And step 106, classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And combining the words at the same position (namely, with the same in-part word ranking value) in the word sequence collection of each part of speech classification, and classifying the words into the same leaf node. Specifically, the words with the highest word frequency value in each part-of-speech classification form a leaf node A
0=(η
0,v
0,a
0,..), the next highest of which constitutes a leaf A
1=(η
1,v
1,a
1,..), and so on to obtain M leaf nodes A
i=(η
i,v
i,a
i,...),i=0,...,M-1,
The weight of each leaf node is the sum of the word frequency values of all words contained in that leaf node. And establishing an optimal binary tree model according to the obtained leaf nodes.
Further, the method for establishing the optimal binary tree model includes: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
And 108, distributing the non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
The non-repeated prefix code means that unique coded data corresponding to each leaf node can be obtained according to the non-repeated prefix code of each leaf node in the optimal binary tree model.
For example, all of the corpus will be trained
The individual words are divided into parts of speech by their part-of-speech tags
The major classes, namely the noun class η, the verb class v, the adjective class a, and the other classes o. Ti of noun class set in training corpusme '(appearing 1597 times), is' (appearing 10108 times) of the moving part of speech, new '(1635 times) of the adjective part of speech and the' (69968 times) of other classes respectively appear most frequently in the four classes to obtain leaf nodes A
0Is { time, is, new, the }, and is weighted as the sum 83308 of the frequency sums of the four words. And analogizing in turn to obtain all leaf node nodes, combining two leaf nodes with the lowest weight to generate a new leaf node each time, obtaining the weight of the new leaf node according to the sum of the weights of the two leaf nodes, and constructing the optimal binary tree from bottom to top. Meanwhile, according to the weight values of two leaf nodes to be combined, labels '1' and '0' are respectively set for the two leaf nodes until M leaf nodes of the optimal binary tree are all allocated with code words, the code words are sequences from the root node to the labels of the leaf nodes, and the obtained codes are the non-repeated prefix codes of the leaf nodes.
The decoding end comprises the following steps:
and step 110, obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model.
And step 112, processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
The decoding end receives a group of codes, and corresponding leaf nodes can be obtained in the optimal binary tree model according to the codes. Since each leaf node corresponds to a group of words, each no-repeat prefix code can result in a corresponding set of words. There is a contextual association between contextual words when expressing semantics. Therefore, the context association model can be used to obtain the highest joint probability of the simultaneous occurrence of the context words as the corresponding decoding result.
In the embodiment, the words are sorted, the semantic dimension is added in the coding and decoding process, the context semantic association is used as the prior knowledge to optimize the distribution of the code words and realize the intelligent information recovery during the decoding, the optimal decoding result is obtained from the corresponding word sequence set, the high-efficiency semantic information expression transmission capability can be realized, and the transmission overhead is saved.
In one embodiment, the method for allocating the no-repeat prefix code to the leaf node includes:
and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result.
And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
Specifically, the weights of two leaf nodes to be merged are compared, the label value of the leaf node with a higher weight value is set to 1, and the label value of the leaf node with a lower weight value is set to 0. And (3) iterating the merging process until only two leaf nodes are left to be merged to form a root node, and setting label values for all leaf nodes in the optimal binary tree in the process. According to the label value sequences of all leaf nodes on the path from the root node to a certain leaf node, the non-repeated prefix code of the leaf node can be obtained. The setting mode of the label value can be adjusted according to the coding requirement, as long as the two leaf nodes needing to be combined can be distinguished.
The embodiment provides a simple non-repeated prefix code allocation mode based on the generation process of the optimal binary tree, and has the characteristics of simple coding mode and easy implementation.
In one embodiment, as shown in fig. 2, the N-gram model and the multi-layer LSTM-based neural network model are used to characterize and learn the correlation between contexts, and a state compression dynamic programming method is used to decode a series of adjacent words as contexts together, so as to obtain a global optimal solution for a situation where one code corresponds to a plurality of words. In this embodiment, the manner of obtaining the word sequence with the highest joint occurrence probability value from the word sequence set includes:
step 202, learning context semantic association features among words in the training corpus by using an LSTM-based neural network model.
And step 204, modeling joint probability distribution of the word sequences in the word sequence set by using an N-gram model.
In particular toCorrespond to
Part-of-speech classification, one no-repeat prefix code corresponds at most to
A word, so that a word sequence s of length n, at most, may have
And (5) arranging and combining. Calculating probability value P (w) of each permutation combination
1,w
2,...,w
n). This embodiment uses the N-gram model to pair the joint probability Pr (w)
1w
2...w
n) Modeling is carried out, wherein the process is a Markov chain Pr (w)
1w
2...w
n)=Pr(w
1)Pr(w
2|w
1)...Pr(w
n|w
n-1...w
2w
1) I.e. by
Where each word occurrence is associated with a previous historical character. However, as the distance between the word appearance positions increases, the correlation of the appearance probabilities of two words at farther distances gradually decreases. Thus, with the Markov assumption that each character in the word sequence is only relevant to the top N historical characters, the joint probability formulation can be reduced to
Wherein context semantics relates to characteristics Pr (w)
i|w
i-N...w
i-1) Can be learned by the deep network. As shown in fig. 3, the present embodiment uses a multi-layer LSTM network, which includes LSTM layers I (256 nodes) and II (256 nodes), a sense layer I (256 nodes, the nonlinear activation function is Relu) and a sense layer II (the number of words in the thesaurus is the number of nodes in the layer, and the nonlinear activation function is Softmax). The multi-layer LSTM neural network inputs a one-hot vector (one-hot vector) of several surrounding words of a central word to be predictedFor "one-bit-efficient" coding, i.e. for
Each state is unique. At any time, only one bit is active (taking 1) and the rest take 0
A dimension vector. )
The output of the multi-layer LSTM neural network is a one-hot vector w of a prediction objective function
Onput=w
i. The principle is that the central word is predicted by utilizing the front and back L context words of the central word. Minimizing the loss function of the multi-layer LSTM network in training by gradient descent method (the loss function is E ═ logPr (w)
Output|w
Input) The network output layer derives the probability of the core word based on the context. Multi-layer LSTM network utilizing front and back L words of the core word
To predict the central word w
Onput=w
i. The activation function of the network output layer is a Softmax function, the function maps the outputs of the neurons to a (0,1) interval, and the output value is the calculated probability. The network is trained by a gradient descent method to make the loss function E of the multilayer LSTM network equal to-logPr (w)
Output|w
Input) And (4) minimizing.
Step 206, as shown in fig. 4, when the length of the word sequence is smaller than the preset value of the context window, the word sequence with the highest joint occurrence probability value is obtained by using an enumeration method according to the N-gram model.
Setting the size of a context window as N, N belongs to Z
+. For a word sequence of length n ═ s (w)
1,w
2,...,w
n),n∈Z
+When N is less than or equal to N, an enumeration algorithm is used
Finding the S with the strongest context correlation in the permutation and combination S as the decoding result, i.e. S
*=argmax
s∈SPr(s). When Pr (w)
1w
2...w
n) When the maximum value of the probability value is P, i.e. the corresponding sequence is the decoding result, the process can be expressed as
Step 208, when the length N of the word sequence is greater than the preset value N of the size of the context window, according to the N-gram model, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm, wherein the state transition process can be expressed as follows:
when the length of the word sequence is larger than the preset value N & gt N of the size of the context window, a state compression dynamic programming algorithm is used for firstly solving the solution of the minimum subproblem, namely the first N word combination probability values in the word sequence, and then the scale of the subproblems is gradually increased, namely the global optimal combination of the first N +1 words is gradually considered, and the process is as follows:
by analogy, the global optimal combination of the first N +2 words until the sub-problem i considers the global optimal combination of the first i words until the global optimal solution of the sequence N words is solved, and the specific process is as follows:
(1) first all probability values for the smallest subproblem (i.e., i-N) are calculated and recorded as
The probability value is calculated by the formula
(2) When each sub-problem (i > N) is solved recursively, probability values of a plurality of optimal sub-sequences of the previous sub-problem i-1 are needed. I.e. the optimal probability value P S of the state of the ith sub-problem
i(k
1...k
N)]Equal to selecting the word in the ith-N bit
Make the optimal probability value P [ S ] of the corresponding sub-problem i-1
i-1(l k
1...k
N-1) And by the previous N words
Put out the next word as
Is the maximum of the sum of the probability values, i.e. the state transition formula is
Further, word sequence joint probability formula modeling can be expressed as when Markov assumes that each character in a word sequence is relevant only to the top N historical characters
The maximum probability value of the minimum sub-problem may be expressed as
Maximum probability value of ith sub-problem
Can be written as
I.e., the maximum probability value of the ith sub-problem, can be decomposed into smaller sub-problems, and where the sub-problems are overlapping sub-problems, the optimal results of the sub-problems are stored in a table using a state-compressed dynamic programming algorithm,repeated computations can be avoided, reducing the time complexity of finding a globally optimal solution from a large number of potential combinations.
To illustrate the technical effects of the present application, the Brown word lexicon was tested based on the method provided in one of the above embodiments. The Brown word lexicon is divided into four classes according to part of speech, and specifically comprises 30632 nouns, 10392 verbs, 8054 adjectives and 4331 other classes of words. The number of encodings required is reduced from 53409 to 30632 compared to encoding each word. Fig. 5 is a performance comparison between the method of the present application and the Huffman coding method, and it can be seen that when words in the same corpus are coded, the coding method of the present application is shorter than the dynamic average code length of Huffman coding, and the difference increases with the increase of the number of characters to be coded, thereby verifying the effectiveness of the algorithm.
Fig. 6 to 8 show simulation results of the context windows of the method of the present application with lengths of 3, 4, and 5, respectively. It can be seen that the semantic similarity of the method provided by the present application can peak and remain stable when the context window size is equal to or larger than the feature window size when learning based on the LSTM neural network. As the contextual window increases, the semantic similarity score increases; as the feature window size increases, the semantic similarity score also increases.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, a source coding and decoding device based on context semantics is provided, including:
the encoding module is used for obtaining word frequency values of all words in a preset training corpus, ordering the words classified according to word characteristics in the training corpus in word classes according to the word frequency values to obtain ordering values in word classes of all the words, classifying the words with the same ordering values in the word classes into the same leaf node, obtaining a weighted value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, establishing an optimal binary tree model, distributing a non-repeated prefix code to the leaf node, and obtaining encoded data of a corpus to be encoded.
And the decoding module is used for obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model and obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the encoding module is configured to classify words in the training corpus according to parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining the in-part-of-speech ranking value of each word according to the sequence of the word frequency values from high to low.
In one embodiment, the encoding module is configured to obtain a first leaf node and a second leaf node with a lowest current weight value, and combine the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the encoding module is configured to compare the weight values of the first leaf node and the second leaf node, and obtain the tag values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the decoding module is configured to obtain context semantic association features between words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the decoding module is configured to learn the context-semantic association features between words in the training corpus using an LSTM-based neural network model.
In one embodiment, the encoding module is configured to model a joint probability distribution of a sequence of context words using an N-gram model. And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
For specific limitation of a source coding and decoding device based on context semantics, refer to the above limitation on a source coding and decoding method based on context semantics, which is not described herein again. The modules in the source coding and decoding device based on the context semantics can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the optimal binary tree model and the context association model data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a source coding method based on context semantics.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in the part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a candidate word set corresponding to the coded data according to the optimal binary tree model.
And obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining the in-part-of-speech ranking value of each word according to the sequence of the word frequency value from high to low.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and obtaining context semantic association characteristics among words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and learning context semantic association characteristics among words in the training corpus by using an LSTM-based neural network model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the joint probability distribution of the sequence of context words is modeled using an N-gram model. And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And ordering the words classified by parts of speech in the training corpus in the parts of speech according to the word frequency value to obtain the in-part-of-speech ordering value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the computer program when executed by the processor further performs the steps of: and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining corresponding word sequences according to the sequence of the word frequency values from high to low, and obtaining the in-part-of-speech ranking value of each word according to the word sequences.
In one embodiment, the computer program when executed by the processor further performs the steps of: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the computer program when executed by the processor further performs the steps of: and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the computer program when executed by the processor further performs the steps of: and obtaining context semantic association characteristics among words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the computer program when executed by the processor further performs the steps of: the LSTM-based neural network model is used to learn the context semantic association features between the context words in the training corpus.
In one embodiment, the computer program when executed by the processor further performs the steps of: the joint probability distribution of the sequence of context words is modeled using an N-gram model. And when the length of the word sequence is smaller than a preset context window value, obtaining the context word sequence with the highest joint probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is larger than a preset value, obtaining a context word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.