[go: up one dir, main page]

CN112836506A - A method and device for source coding and decoding based on context semantics - Google Patents

A method and device for source coding and decoding based on context semantics Download PDF

Info

Publication number
CN112836506A
CN112836506A CN202110206745.3A CN202110206745A CN112836506A CN 112836506 A CN112836506 A CN 112836506A CN 202110206745 A CN202110206745 A CN 202110206745A CN 112836506 A CN112836506 A CN 112836506A
Authority
CN
China
Prior art keywords
word
leaf node
value
speech
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110206745.3A
Other languages
Chinese (zh)
Other versions
CN112836506B (en
Inventor
魏急波
赵海涛
张亦弛
熊俊
张姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110206745.3A priority Critical patent/CN112836506B/en
Publication of CN112836506A publication Critical patent/CN112836506A/en
Application granted granted Critical
Publication of CN112836506B publication Critical patent/CN112836506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及一种基于上下文语义的信源编译码方法和装置。所述方法包括:在编码端,根据词频对训练语料库的单词按照词类进行分别排序,将不同词类中排序相同的单词合并为一个叶子结点,并根据一个叶子结点中所有单词的词频总和得到该叶子结点的权重值,生成最优二叉树模型,为叶子结点分配无重复前缀码。根据叶子结点的无重复前缀码得到语料的编码数据。在译码端,根据编码数据在二叉树模型中得到对应的候选单词序列集合,根据上下文关联关系得到上下文关系最紧密的单词序列作为译码结果。本申请在编码过程中加入了语义维度,译码时利用上下文的语义关联得到最优译码结果,能够实现高效的语义信息表达传输能力,并节省传输开销。

Figure 202110206745

The present application relates to a method and apparatus for encoding and decoding information sources based on context semantics. The method includes: at the coding end, sorting the words of the training corpus according to the word class according to the word frequency, merging the same words in different word classes into a leaf node, and obtaining the word frequency according to the sum of the word frequencies of all the words in a leaf node. The weight value of the leaf node generates the optimal binary tree model, and assigns a non-repetitive prefix code to the leaf node. The encoded data of the corpus is obtained according to the non-repetitive prefix codes of the leaf nodes. At the decoding end, the corresponding candidate word sequence set is obtained in the binary tree model according to the encoded data, and the word sequence with the closest contextual relationship is obtained as the decoding result according to the contextual relationship. The present application adds a semantic dimension in the encoding process, and uses the semantic correlation of the context to obtain the optimal decoding result during decoding, which can achieve efficient semantic information expression and transmission capability and save transmission overhead.

Figure 202110206745

Description

Information source coding and decoding method and device based on context semantics
Technical Field
The application relates to the technical field of intelligent body semantic communication, in particular to a method and a device for information source coding and decoding based on context semantics.
Background
With the increasing level of intelligence and external cognitive abilities of communication devices, intelligent semantic communication has become a great research trend in the field of communication. The core of semantic communication is the accurate transmission of data meaning or content communication, rather than targeting the accurate transmission of communication symbols.
At present, data to be transmitted is processed at a sending end according to the understanding and analysis of a communication purpose and a historical communication process, so that a large amount of redundant transmission is avoided at the source. And carrying out intelligent error correction and recovery on the received signal at a receiving end according to the context, the prior information, the purpose of individual communication and other knowledge. However, there are many fundamental problems to be solved in semantic communication, such as how to realize efficient semantic information expression by an efficient coding manner.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a source coding method and apparatus based on context semantics.
A method for source coding and decoding based on context semantics, the method comprising:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
In one embodiment, the step of sorting each word classified by part of speech in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sorting value of each word includes:
and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications.
And in the part-of-speech classification, obtaining corresponding word sequences according to the sequence of the word frequency values from high to low, and obtaining the in-part-of-speech ranking value of each word according to the word sequences.
In one embodiment, the method for establishing the optimal binary tree model includes:
and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node.
And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the method for allocating the no-repeat prefix code to the leaf node includes:
and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result.
And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the method for processing the word sequence set by using the preset context correlation model to obtain the corresponding decoding result data includes:
and obtaining context semantic association characteristics among words in the training corpus.
And obtaining a word sequence with the highest joint occurrence probability value from the word sequence set according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the manner of obtaining the context and semantic association features between words in the training corpus includes:
the LSTM-based neural network model is used for learning the context semantic association characteristics among words in the training corpus.
In one embodiment, the method for obtaining the word sequence with the highest joint occurrence probability value from the word sequence set includes:
a joint probability distribution of word sequences in the set of word sequences is modeled using an N-gram model.
And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model.
And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
A source coding/decoding device based on context semantics, comprising:
the encoding module is used for acquiring the word frequency value of each word in a preset training corpus, ordering each word classified by part of speech in the training corpus in the classification according to the word frequency value to obtain an ordering value in the part of speech of each word, classifying the words with the same ordering value in each part of speech into the same leaf node, obtaining the weighted value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, establishing an optimal binary tree model, and distributing a non-repeated prefix code to the leaf node to obtain the encoded data of the corpus to be encoded.
And the decoding module is used for obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model, and processing the word sequence set by using a preset context association model to obtain corresponding decoding result data.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the classification according to the word frequency value to obtain the in-word sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
Compared with the prior art, the information source coding and decoding method, the information source coding and decoding device, the computer equipment and the storage medium based on the context semantics sequence words of the training corpus in the parts of speech according to the word frequency of the words at the coding end, combine the words with the same sequence in each part of speech into a leaf node, obtain the weight value of the leaf node according to the sum of the word frequencies of all the words in the leaf node, generate the optimal binary tree model, and allocate the non-repetitive prefix code to each leaf node. Obtaining coded data of the corpus to be coded according to the non-repeated prefix code of each leaf node in the optimal binary tree model; and at the decoding end, obtaining a corresponding candidate word set in the binary tree model according to the coded data, and obtaining a corresponding result from the candidate word set as a decoding result according to the context association relation. According to the method and the device, the words are sorted, the semantic dimension is added in the coding and decoding process, the optimal decoding result is obtained by utilizing the semantic association of the context during decoding, the high-efficiency semantic information expression transmission and information recovery capability can be realized, and the transmission overhead is saved.
Drawings
FIG. 1 is a diagram illustrating the steps of a source coding method based on context semantics in one embodiment;
FIG. 2 is a schematic diagram illustrating a data processing flow at a decoding end in a context-based source coding and decoding method according to an embodiment;
FIG. 3 is a schematic diagram of an LSTM-based neural network according to an embodiment;
FIG. 4 is a schematic flow chart illustrating the calculation of word sequence joint probability values using an N-gram model in one embodiment;
fig. 5 is a performance graph of a source coding and decoding method and a Huffman coding method based on context semantics according to the present application;
fig. 6 is a performance curve diagram of a context semantics-based source coding and decoding method provided by the present application when a context window is 3;
fig. 7 is a performance curve diagram of a context semantics-based source coding and decoding method provided by the present application when a context window is 4;
fig. 8 is a performance curve diagram of a context semantics-based source coding and decoding method provided by the present application when a context window is 5;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a source coding and decoding method based on context semantics is provided for an encoding side and a decoding side.
The method comprises the following steps at the encoding end:
step 102, obtaining a word frequency value of each word in a preset training corpus.
And 104, sequencing each word classified by part of speech in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
Specifically, the part-of-speech classified words refer to all words in the training corpus: (
Figure BDA0002951141090000051
Has been divided into according to its part-of-speech tag
Figure BDA0002951141090000052
The part-of-speech classification specifically includes a noun class η, a verb class v, an adjective class a, and the like. Then, the words in each part of speech classification are arranged according to the descending order of the word frequency value of the words, and the ordering value of each word in the part of speech in the corresponding part of speech classification is obtained.
And step 106, classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And combining the words at the same position (namely, with the same in-part word ranking value) in the word sequence collection of each part of speech classification, and classifying the words into the same leaf node. Specifically, the words with the highest word frequency value in each part-of-speech classification form a leaf node A0=(η0,v0,a0,..), the next highest of which constitutes a leaf A1=(η1,v1,a1,..), and so on to obtain M leaf nodes Ai=(ηi,vi,ai,...),i=0,...,M-1,
Figure BDA0002951141090000061
The weight of each leaf node is the sum of the word frequency values of all words contained in that leaf node. And establishing an optimal binary tree model according to the obtained leaf nodes.
Further, the method for establishing the optimal binary tree model includes: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
And 108, distributing the non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
The non-repeated prefix code means that unique coded data corresponding to each leaf node can be obtained according to the non-repeated prefix code of each leaf node in the optimal binary tree model.
For example, all of the corpus will be trained
Figure BDA0002951141090000062
The individual words are divided into parts of speech by their part-of-speech tags
Figure BDA0002951141090000063
The major classes, namely the noun class η, the verb class v, the adjective class a, and the other classes o. Ti of noun class set in training corpusme '(appearing 1597 times), is' (appearing 10108 times) of the moving part of speech, new '(1635 times) of the adjective part of speech and the' (69968 times) of other classes respectively appear most frequently in the four classes to obtain leaf nodes A0Is { time, is, new, the }, and is weighted as the sum 83308 of the frequency sums of the four words. And analogizing in turn to obtain all leaf node nodes, combining two leaf nodes with the lowest weight to generate a new leaf node each time, obtaining the weight of the new leaf node according to the sum of the weights of the two leaf nodes, and constructing the optimal binary tree from bottom to top. Meanwhile, according to the weight values of two leaf nodes to be combined, labels '1' and '0' are respectively set for the two leaf nodes until M leaf nodes of the optimal binary tree are all allocated with code words, the code words are sequences from the root node to the labels of the leaf nodes, and the obtained codes are the non-repeated prefix codes of the leaf nodes.
The decoding end comprises the following steps:
and step 110, obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model.
And step 112, processing the word sequence set by using a preset context correlation model to obtain corresponding decoding result data.
The decoding end receives a group of codes, and corresponding leaf nodes can be obtained in the optimal binary tree model according to the codes. Since each leaf node corresponds to a group of words, each no-repeat prefix code can result in a corresponding set of words. There is a contextual association between contextual words when expressing semantics. Therefore, the context association model can be used to obtain the highest joint probability of the simultaneous occurrence of the context words as the corresponding decoding result.
In the embodiment, the words are sorted, the semantic dimension is added in the coding and decoding process, the context semantic association is used as the prior knowledge to optimize the distribution of the code words and realize the intelligent information recovery during the decoding, the optimal decoding result is obtained from the corresponding word sequence set, the high-efficiency semantic information expression transmission capability can be realized, and the transmission overhead is saved.
In one embodiment, the method for allocating the no-repeat prefix code to the leaf node includes:
and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result.
And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
Specifically, the weights of two leaf nodes to be merged are compared, the label value of the leaf node with a higher weight value is set to 1, and the label value of the leaf node with a lower weight value is set to 0. And (3) iterating the merging process until only two leaf nodes are left to be merged to form a root node, and setting label values for all leaf nodes in the optimal binary tree in the process. According to the label value sequences of all leaf nodes on the path from the root node to a certain leaf node, the non-repeated prefix code of the leaf node can be obtained. The setting mode of the label value can be adjusted according to the coding requirement, as long as the two leaf nodes needing to be combined can be distinguished.
The embodiment provides a simple non-repeated prefix code allocation mode based on the generation process of the optimal binary tree, and has the characteristics of simple coding mode and easy implementation.
In one embodiment, as shown in fig. 2, the N-gram model and the multi-layer LSTM-based neural network model are used to characterize and learn the correlation between contexts, and a state compression dynamic programming method is used to decode a series of adjacent words as contexts together, so as to obtain a global optimal solution for a situation where one code corresponds to a plurality of words. In this embodiment, the manner of obtaining the word sequence with the highest joint occurrence probability value from the word sequence set includes:
step 202, learning context semantic association features among words in the training corpus by using an LSTM-based neural network model.
And step 204, modeling joint probability distribution of the word sequences in the word sequence set by using an N-gram model.
In particular toCorrespond to
Figure BDA0002951141090000071
Part-of-speech classification, one no-repeat prefix code corresponds at most to
Figure BDA0002951141090000072
A word, so that a word sequence s of length n, at most, may have
Figure BDA0002951141090000087
And (5) arranging and combining. Calculating probability value P (w) of each permutation combination1,w2,...,wn). This embodiment uses the N-gram model to pair the joint probability Pr (w)1w2...wn) Modeling is carried out, wherein the process is a Markov chain Pr (w)1w2...wn)=Pr(w1)Pr(w2|w1)...Pr(wn|wn-1...w2w1) I.e. by
Figure BDA0002951141090000081
Where each word occurrence is associated with a previous historical character. However, as the distance between the word appearance positions increases, the correlation of the appearance probabilities of two words at farther distances gradually decreases. Thus, with the Markov assumption that each character in the word sequence is only relevant to the top N historical characters, the joint probability formulation can be reduced to
Figure BDA0002951141090000082
Wherein context semantics relates to characteristics Pr (w)i|wi-N...wi-1) Can be learned by the deep network. As shown in fig. 3, the present embodiment uses a multi-layer LSTM network, which includes LSTM layers I (256 nodes) and II (256 nodes), a sense layer I (256 nodes, the nonlinear activation function is Relu) and a sense layer II (the number of words in the thesaurus is the number of nodes in the layer, and the nonlinear activation function is Softmax). The multi-layer LSTM neural network inputs a one-hot vector (one-hot vector) of several surrounding words of a central word to be predictedFor "one-bit-efficient" coding, i.e. for
Figure BDA0002951141090000085
Each state is unique. At any time, only one bit is active (taking 1) and the rest take 0
Figure BDA0002951141090000086
A dimension vector. )
Figure BDA0002951141090000083
The output of the multi-layer LSTM neural network is a one-hot vector w of a prediction objective functionOnput=wi. The principle is that the central word is predicted by utilizing the front and back L context words of the central word. Minimizing the loss function of the multi-layer LSTM network in training by gradient descent method (the loss function is E ═ logPr (w)Output|wInput) The network output layer derives the probability of the core word based on the context. Multi-layer LSTM network utilizing front and back L words of the core word
Figure BDA0002951141090000084
To predict the central word wOnput=wi. The activation function of the network output layer is a Softmax function, the function maps the outputs of the neurons to a (0,1) interval, and the output value is the calculated probability. The network is trained by a gradient descent method to make the loss function E of the multilayer LSTM network equal to-logPr (w)Output|wInput) And (4) minimizing.
Step 206, as shown in fig. 4, when the length of the word sequence is smaller than the preset value of the context window, the word sequence with the highest joint occurrence probability value is obtained by using an enumeration method according to the N-gram model.
Setting the size of a context window as N, N belongs to Z+. For a word sequence of length n ═ s (w)1,w2,...,wn),n∈Z+When N is less than or equal to N, an enumeration algorithm is used
Figure BDA0002951141090000091
Finding the S with the strongest context correlation in the permutation and combination S as the decoding result, i.e. S*=argmaxs∈SPr(s). When Pr (w)1w2...wn) When the maximum value of the probability value is P, i.e. the corresponding sequence is the decoding result, the process can be expressed as
Figure BDA0002951141090000092
Step 208, when the length N of the word sequence is greater than the preset value N of the size of the context window, according to the N-gram model, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm, wherein the state transition process can be expressed as follows:
Figure BDA0002951141090000093
when the length of the word sequence is larger than the preset value N & gt N of the size of the context window, a state compression dynamic programming algorithm is used for firstly solving the solution of the minimum subproblem, namely the first N word combination probability values in the word sequence, and then the scale of the subproblems is gradually increased, namely the global optimal combination of the first N +1 words is gradually considered, and the process is as follows:
Figure BDA0002951141090000094
by analogy, the global optimal combination of the first N +2 words until the sub-problem i considers the global optimal combination of the first i words until the global optimal solution of the sequence N words is solved, and the specific process is as follows:
(1) first all probability values for the smallest subproblem (i.e., i-N) are calculated and recorded as
Figure BDA0002951141090000095
The probability value is calculated by the formula
Figure BDA0002951141090000096
(2) When each sub-problem (i > N) is solved recursively, probability values of a plurality of optimal sub-sequences of the previous sub-problem i-1 are needed. I.e. the optimal probability value P S of the state of the ith sub-problemi(k1...kN)]Equal to selecting the word in the ith-N bit
Figure BDA0002951141090000101
Make the optimal probability value P [ S ] of the corresponding sub-problem i-1i-1(l k1...kN-1) And by the previous N words
Figure BDA0002951141090000102
Put out the next word as
Figure BDA0002951141090000103
Is the maximum of the sum of the probability values, i.e. the state transition formula is
Figure BDA0002951141090000104
Further, word sequence joint probability formula modeling can be expressed as when Markov assumes that each character in a word sequence is relevant only to the top N historical characters
Figure BDA0002951141090000105
The maximum probability value of the minimum sub-problem may be expressed as
Figure BDA0002951141090000106
Maximum probability value of ith sub-problem
Figure BDA0002951141090000107
Can be written as
Figure BDA0002951141090000108
I.e., the maximum probability value of the ith sub-problem, can be decomposed into smaller sub-problems, and where the sub-problems are overlapping sub-problems, the optimal results of the sub-problems are stored in a table using a state-compressed dynamic programming algorithm,repeated computations can be avoided, reducing the time complexity of finding a globally optimal solution from a large number of potential combinations.
To illustrate the technical effects of the present application, the Brown word lexicon was tested based on the method provided in one of the above embodiments. The Brown word lexicon is divided into four classes according to part of speech, and specifically comprises 30632 nouns, 10392 verbs, 8054 adjectives and 4331 other classes of words. The number of encodings required is reduced from 53409 to 30632 compared to encoding each word. Fig. 5 is a performance comparison between the method of the present application and the Huffman coding method, and it can be seen that when words in the same corpus are coded, the coding method of the present application is shorter than the dynamic average code length of Huffman coding, and the difference increases with the increase of the number of characters to be coded, thereby verifying the effectiveness of the algorithm.
Fig. 6 to 8 show simulation results of the context windows of the method of the present application with lengths of 3, 4, and 5, respectively. It can be seen that the semantic similarity of the method provided by the present application can peak and remain stable when the context window size is equal to or larger than the feature window size when learning based on the LSTM neural network. As the contextual window increases, the semantic similarity score increases; as the feature window size increases, the semantic similarity score also increases.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, a source coding and decoding device based on context semantics is provided, including:
the encoding module is used for obtaining word frequency values of all words in a preset training corpus, ordering the words classified according to word characteristics in the training corpus in word classes according to the word frequency values to obtain ordering values in word classes of all the words, classifying the words with the same ordering values in the word classes into the same leaf node, obtaining a weighted value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, establishing an optimal binary tree model, distributing a non-repeated prefix code to the leaf node, and obtaining encoded data of a corpus to be encoded.
And the decoding module is used for obtaining a word sequence set corresponding to the encoded data according to the optimal binary tree model and obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the encoding module is configured to classify words in the training corpus according to parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining the in-part-of-speech ranking value of each word according to the sequence of the word frequency values from high to low.
In one embodiment, the encoding module is configured to obtain a first leaf node and a second leaf node with a lowest current weight value, and combine the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the encoding module is configured to compare the weight values of the first leaf node and the second leaf node, and obtain the tag values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the decoding module is configured to obtain context semantic association features between words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the decoding module is configured to learn the context-semantic association features between words in the training corpus using an LSTM-based neural network model.
In one embodiment, the encoding module is configured to model a joint probability distribution of a sequence of context words using an N-gram model. And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
For specific limitation of a source coding and decoding device based on context semantics, refer to the above limitation on a source coding and decoding method based on context semantics, which is not described herein again. The modules in the source coding and decoding device based on the context semantics can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the optimal binary tree model and the context association model data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a source coding method based on context semantics.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And sequencing each word with part of speech classification in the training corpus in the part of speech according to the word frequency value to obtain the in-part-of-speech sequencing value of each word.
And classifying the words with the same rank value in the part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a candidate word set corresponding to the coded data according to the optimal binary tree model.
And obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining the in-part-of-speech ranking value of each word according to the sequence of the word frequency value from high to low.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and obtaining context semantic association characteristics among words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and learning context semantic association characteristics among words in the training corpus by using an LSTM-based neural network model.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the joint probability distribution of the sequence of context words is modeled using an N-gram model. And when the length of the word sequence is smaller than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is greater than the preset value of the context window, obtaining the word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
at the encoding end:
and acquiring the word frequency value of each word in a preset training corpus.
And ordering the words classified by parts of speech in the training corpus in the parts of speech according to the word frequency value to obtain the in-part-of-speech ordering value of each word.
And classifying the words with the same rank value in each part of speech into the same leaf node, obtaining the weight value of the leaf node according to the sum of the word frequency values of the words corresponding to the leaf node, and establishing an optimal binary tree model.
And allocating non-repeated prefix codes to the leaf nodes to obtain the coded data of the linguistic data to be coded.
At the decoding end:
and obtaining a word sequence set corresponding to the coded data according to the optimal binary tree model.
And obtaining corresponding decoding result data by using a preset context correlation model.
In one embodiment, the computer program when executed by the processor further performs the steps of: and classifying the words in the training corpus according to the parts of speech to obtain corresponding part of speech classifications. Part-of-speech classifications include noun classifications, verb classifications, adjective classifications, adverb classifications, and conjunctive classifications. And in the part-of-speech classification, obtaining corresponding word sequences according to the sequence of the word frequency values from high to low, and obtaining the in-part-of-speech ranking value of each word according to the word sequences.
In one embodiment, the computer program when executed by the processor further performs the steps of: and acquiring a first leaf node and a second leaf node with the lowest current weight value, and combining the first leaf node and the second leaf node to obtain a third leaf node. And obtaining the weight value of the third leaf node according to the sum of the weight values of the first leaf node and the second leaf node.
In one embodiment, the computer program when executed by the processor further performs the steps of: and comparing the weight values of the first leaf node and the second leaf node, and respectively obtaining the label values of the first leaf node and the second leaf node according to the comparison result. And obtaining the non-repeated prefix code of the first leaf node according to the label value sequences of all the leaf nodes from the root node to the first leaf node in the optimal binary tree model.
In one embodiment, the computer program when executed by the processor further performs the steps of: and obtaining context semantic association characteristics among words in the training corpus. And obtaining a word sequence with the highest joint occurrence probability value according to the context semantic association characteristics to obtain corresponding decoding result data.
In one embodiment, the computer program when executed by the processor further performs the steps of: the LSTM-based neural network model is used to learn the context semantic association features between the context words in the training corpus.
In one embodiment, the computer program when executed by the processor further performs the steps of: the joint probability distribution of the sequence of context words is modeled using an N-gram model. And when the length of the word sequence is smaller than a preset context window value, obtaining the context word sequence with the highest joint probability value by using an enumeration method according to the N-gram model. And when the length of the word sequence is larger than a preset value, obtaining a context word sequence with the highest joint occurrence probability value by using a state compression dynamic programming algorithm according to the N-gram model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1.一种基于上下文语义的信源编译码方法,其特征在于,所述方法包括:1. a source coding and decoding method based on context semantics, is characterized in that, described method comprises: 在编码端:On the encoding side: 获取预设的训练语料库中各个单词的词频值;Obtain the word frequency value of each word in the preset training corpus; 将所述训练语料库中各个已词性分类的单词按照所述词频值在词类内排序,得到各个单词的词类内排序值;Sorting the words classified by each part of speech in the training corpus according to the word frequency value in the part of speech to obtain the sorting value in the part of speech of each word; 将各个所述词类内排序值相等的单词归入同一叶子结点,根据所述叶子结点对应单词的词频值总和得到所述叶子结点的权重值,建立最优二叉树模型;The words with equal sorting values in each of the word classes are classified into the same leaf node, and the weight value of the leaf node is obtained according to the sum of the word frequency values of the corresponding words of the leaf node, and an optimal binary tree model is established; 为所述叶子结点分配无重复前缀码,得到待编码语料的编码数据;Allocating non-repetitive prefix codes for the leaf nodes to obtain the encoded data of the corpus to be encoded; 在译码端:On the decoding side: 根据所述最优二叉树模型得到所述编码数据对应的单词序列集合;Obtain the word sequence set corresponding to the encoded data according to the optimal binary tree model; 使用预设的上下文关联模型处理所述单词序列集合,得到对应的译码结果数据。The word sequence set is processed by using a preset context association model to obtain corresponding decoding result data. 2.根据权利要求1所述的方法,其特征在于,将所述训练语料库中各个已词性分类的单词按照所述词频值在词类内排序,得到各个单词的词类内排序值的步骤包括:2. method according to claim 1, it is characterised in that in the described training corpus, the word of each part-of-speech classification is sorted according to the word frequency value in the part-of-speech, the step of obtaining the sorting value in the part-of-speech of each word comprises: 将所述训练语料库中的单词按照词性进行分类,得到对应的词性分类;所述词性分类包括名词分类、动词分类、形容词分类、副词分类和连词分类;Classifying the words in the training corpus according to part of speech to obtain the corresponding part of speech classification; the part of speech classification includes noun classification, verb classification, adjective classification, adverb classification and conjunction classification; 在所述词性分类中,按照所述词频值由高到低的顺序,得到各个单词的词类内排序值。In the part-of-speech classification, according to the order of the word frequency values from high to low, the intra-part-of-speech ranking value of each word is obtained. 3.根据权利要求1所述的方法,其特征在于,建立最优二叉树模型的方式包括:3. The method according to claim 1, wherein the method for establishing an optimal binary tree model comprises: 获取当前权重值最低的第一叶子结点和第二叶子结点,合并所述第一叶子结点和所述第二叶子结点得到第三叶子结点;Obtain the first leaf node and the second leaf node with the lowest current weight value, and combine the first leaf node and the second leaf node to obtain the third leaf node; 根据所述第一叶子结点和所述第二叶子结点的权重值的和,得到所述第三叶子结点的权重值。The weight value of the third leaf node is obtained according to the sum of the weight values of the first leaf node and the second leaf node. 4.根据权利要求3所述的方法,其特征在于,为所述叶子结点分配无重复前缀码的方式包括:4. The method according to claim 3, wherein the method of allocating the non-repetitive prefix code to the leaf node comprises: 比较所述第一叶子结点和所述第二叶子结点的权重值,根据比较结果分别得到所述第一叶子结点和所述第二叶子结点的标签值;Compare the weight values of the first leaf node and the second leaf node, and obtain the label values of the first leaf node and the second leaf node respectively according to the comparison result; 根据所述最优二叉树模型中从根结点到所述第一叶子结点经历的所有叶子结点的标签值序列,得到所述第一叶子结点的无重复前缀码。According to the label value sequence of all leaf nodes experienced from the root node to the first leaf node in the optimal binary tree model, the non-repetitive prefix code of the first leaf node is obtained. 5.根据权利要求1所述的方法,其特征在于,使用预设的上下文关联模型处理所述单词序列集合,得到对应的译码结果数据的方式包括:5. The method according to claim 1, wherein the method of using a preset contextual association model to process the set of word sequences to obtain corresponding decoding result data comprises: 获取所述训练语料库中各个单词间的上下文语义关联特征;Obtain the contextual semantic correlation features between each word in the training corpus; 根据上下文语义关联特征得到联合出现概率值最高的单词序列作为译码结果数据。According to the context semantic correlation feature, the word sequence with the highest joint occurrence probability value is obtained as the decoding result data. 6.根据权利要求5所述的方法,其特征在于,获取所述训练语料库中单词间的上下文语义关联特征的方式包括:6. The method according to claim 5, wherein the method for obtaining the contextual semantic correlation features between words in the training corpus comprises: 使用基于LSTM的神经网络模型学习所述训练语料库中各个单词间的上下文语义关联特征。Use the LSTM-based neural network model to learn the contextual semantic correlation features between the words in the training corpus. 7.根据权利要求6所述的方法,其特征在于,得到联合出现概率值最高的单词序列的方式包括:7. The method according to claim 6, wherein the method for obtaining the word sequence with the highest joint occurrence probability value comprises: 使用N-gram模型对所述单词序列集合中单词序列的联合概率分布建模;modeling the joint probability distribution of word sequences in the set of word sequences using an N-gram model; 当所述单词序列的长度小于上下文窗口的预设值时,根据所述N-gram模型,使用枚举法得到联合出现概率值最高的单词序列;When the length of the word sequence is less than the preset value of the context window, according to the N-gram model, use the enumeration method to obtain the word sequence with the highest joint occurrence probability value; 当所述单词序列的长度大于上下文窗口的预设值时,根据所述N-gram模型,使用状态压缩动态规划算法得到联合出现概率值最高的单词序列。When the length of the word sequence is greater than the preset value of the context window, according to the N-gram model, the state compression dynamic programming algorithm is used to obtain the word sequence with the highest joint occurrence probability value. 8.一种基于上下文语义的信源编译码装置,其特征在于,所述装置包括:8. An apparatus for coding and decoding information sources based on context semantics, wherein the apparatus comprises: 编码模块,用于获取预设的训练语料库中各个单词的词频值,将所述训练语料库中各个已词性分类的单词按照所述词频值在词类内排序,得到各个单词的词类内排序值,将各个所述词类内排序值相等的单词归入同一叶子结点,根据所述叶子结点对应单词的词频值总和得到所述叶子结点的权重值,建立最优二叉树模型,为所述叶子结点分配无重复前缀码,得到待编码语料的编码数据;The coding module is used to obtain the word frequency value of each word in the preset training corpus, sort the words in the training corpus that have been classified by the part of speech according to the word frequency value in the part of speech, obtain the order value in the part of speech of each word, and Words with equal sorting values in each of the parts of speech are classified into the same leaf node, and the weight value of the leaf node is obtained according to the sum of the word frequency values of the words corresponding to the leaf node, and an optimal binary tree model is established to be the leaf node. Point allocation without repeating prefix code to obtain the encoded data of the corpus to be encoded; 译码模块,用于根据所述最优二叉树模型得到所述编码数据对应的单词序列集合,使用预设的上下文关联模型处理所述单词序列集合,得到对应的译码结果数据。A decoding module, configured to obtain a set of word sequences corresponding to the encoded data according to the optimal binary tree model, and process the set of word sequences by using a preset contextual correlation model to obtain corresponding decoding result data. 9.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述方法的步骤。9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when the processor executes the computer program . 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述方法的步骤。10. A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by a processor.
CN202110206745.3A 2021-02-24 2021-02-24 Information source coding and decoding method and device based on context semantics Active CN112836506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110206745.3A CN112836506B (en) 2021-02-24 2021-02-24 Information source coding and decoding method and device based on context semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110206745.3A CN112836506B (en) 2021-02-24 2021-02-24 Information source coding and decoding method and device based on context semantics

Publications (2)

Publication Number Publication Date
CN112836506A true CN112836506A (en) 2021-05-25
CN112836506B CN112836506B (en) 2024-06-28

Family

ID=75933215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110206745.3A Active CN112836506B (en) 2021-02-24 2021-02-24 Information source coding and decoding method and device based on context semantics

Country Status (1)

Country Link
CN (1) CN112836506B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328939A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing model construction method based on big data
CN114519346A (en) * 2022-02-11 2022-05-20 中国人民解放军国防科技大学 Decoding processing method, device, equipment and medium based on language model
CN115146125A (en) * 2022-05-27 2022-10-04 北京科技大学 Method and device for data filtering at receiver end in semantic communication multiple access scenario
CN115883018A (en) * 2022-11-03 2023-03-31 北京邮电大学 semantic communication system
CN115955297A (en) * 2023-03-14 2023-04-11 中国人民解放军国防科技大学 Semantic coding method, semantic coding device, semantic decoding method and device
WO2024234673A1 (en) * 2023-05-18 2024-11-21 中兴通讯股份有限公司 Vector matrix determination method and system, storage medium, and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6230168B1 (en) * 1997-11-26 2001-05-08 International Business Machines Corp. Method for automatically constructing contexts in a hypertext collection
US20070061720A1 (en) * 2005-08-29 2007-03-15 Kriger Joshua K System, device, and method for conveying information using a rapid serial presentation technique
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN109376235A (en) * 2018-07-24 2019-02-22 西安理工大学 Feature selection method based on document-level word frequency reordering
CN109858020A (en) * 2018-12-29 2019-06-07 航天信息股份有限公司 A kind of method and system obtaining taxation informatization problem answers based on grapheme
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6230168B1 (en) * 1997-11-26 2001-05-08 International Business Machines Corp. Method for automatically constructing contexts in a hypertext collection
US20070061720A1 (en) * 2005-08-29 2007-03-15 Kriger Joshua K System, device, and method for conveying information using a rapid serial presentation technique
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN109376235A (en) * 2018-07-24 2019-02-22 西安理工大学 Feature selection method based on document-level word frequency reordering
CN109858020A (en) * 2018-12-29 2019-06-07 航天信息股份有限公司 A kind of method and system obtaining taxation informatization problem answers based on grapheme
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金瑜, 陆启明, 高峰: "基于上下文相关的最大概率汉语自动分词算法", 计算机工程, no. 16, 5 April 2005 (2005-04-05), pages 146 - 148 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519346A (en) * 2022-02-11 2022-05-20 中国人民解放军国防科技大学 Decoding processing method, device, equipment and medium based on language model
CN114328939A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing model construction method based on big data
CN114328939B (en) * 2022-03-17 2022-05-27 天津思睿信息技术有限公司 Natural language processing model construction method based on big data
CN115146125A (en) * 2022-05-27 2022-10-04 北京科技大学 Method and device for data filtering at receiver end in semantic communication multiple access scenario
CN115146125B (en) * 2022-05-27 2023-02-03 北京科技大学 Receiving end data filtering method and device under semantic communication multi-address access scene
CN115883018A (en) * 2022-11-03 2023-03-31 北京邮电大学 semantic communication system
CN115955297A (en) * 2023-03-14 2023-04-11 中国人民解放军国防科技大学 Semantic coding method, semantic coding device, semantic decoding method and device
WO2024234673A1 (en) * 2023-05-18 2024-11-21 中兴通讯股份有限公司 Vector matrix determination method and system, storage medium, and electronic device

Also Published As

Publication number Publication date
CN112836506B (en) 2024-06-28

Similar Documents

Publication Publication Date Title
CN112836506B (en) Information source coding and decoding method and device based on context semantics
CN108986908B (en) Method and device for processing inquiry data, computer equipment and storage medium
US11314939B2 (en) Method and apparatus for performing hierarchiacal entity classification
US20210200961A1 (en) Context-based multi-turn dialogue method and storage medium
US12380142B2 (en) Sequenced data processing method and device, and text processing method and device
EP3912042B1 (en) A deep learning model for learning program embeddings
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
JP2020520492A (en) Document abstract automatic extraction method, device, computer device and storage medium
CN111226222A (en) Deep Context-Based Grammatical Error Correction Using Artificial Neural Networks
CN112101042B (en) Text emotion recognition method, device, terminal equipment and storage medium
CN113761868A (en) Text processing method and device, electronic equipment and readable storage medium
CN112686306B (en) ICD operation classification automatic matching method and system based on graph neural network
WO2021223882A1 (en) Prediction explanation in machine learning classifiers
CN112000777A (en) Text generation method and device, computer equipment and storage medium
CN113918696A (en) Question-answer matching method, device, equipment and medium based on K-means clustering algorithm
US20220374426A1 (en) Semantic reasoning for tabular question answering
CN114090747A (en) Automatic question answering method, device, equipment and medium based on multiple semantic matching
CN111737406B (en) Text retrieval method, device and equipment and training method of text retrieval model
CN111507108A (en) Alias generation method and device, electronic equipment and computer readable storage medium
US20220138425A1 (en) Acronym definition network
CN115062619B (en) Chinese entity linking method, device, equipment and storage medium
Chen et al. Improving the prediction of therapist behaviors in addiction counseling by exploiting class confusions
CN111191439A (en) Natural sentence generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant