CN106598999B - Method and device for calculating text theme attribution degree - Google Patents
Method and device for calculating text theme attribution degree Download PDFInfo
- Publication number
- CN106598999B CN106598999B CN201510680602.0A CN201510680602A CN106598999B CN 106598999 B CN106598999 B CN 106598999B CN 201510680602 A CN201510680602 A CN 201510680602A CN 106598999 B CN106598999 B CN 106598999B
- Authority
- CN
- China
- Prior art keywords
- node
- topic
- model
- text
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method and a device for calculating the attribution degree of a text theme, relates to the technical field of computers, and solves the problem that the attribution degree calculation has a large error due to the fact that a theme keyword appearing in a text is irrelevant to the text theme. The main technical scheme of the invention is as follows: selecting a corresponding theme model with a tree structure according to the service type, wherein nodes in the theme model are used for dividing the category of the theme key words, each node in the theme model comprises at least one theme key word, and each node is provided with a node weight value; the method comprises the steps of carrying out sentence segmentation on a text to be detected to obtain a sentence list; counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords and the sentence list of each node in the topic model; and calculating the topic attribution degree of the text to be detected according to the node weight value and the clause number of each node in the topic model. The method is mainly used for calculating the attribution degree of the text theme.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for calculating text theme attribution degree.
Background
In the context of big data, relevant information extraction is an important issue. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. The topic content expressed by the article is determined by extracting the characteristic keywords in the article.
Most of the existing related information extraction algorithms determine whether the content expressed by an article belongs to a certain topic by determining whether the article has a feature keyword related to the topic. Although the related information in the article can be obtained relatively comprehensively by using whether the keywords appear in the article as features, the extracted information may have a lot of noise because not all words in the article are closely related to the subject. Therefore, when the topic expressed by the article is finally judged, the opposite judgment result may be obtained, and larger errors of the subsequent analysis are caused.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for calculating text topic attribution, and mainly aims to configure weights for topic keyword classification by a preset topic attribution model, thereby comprehensively calculating the topic attribution of a text to be detected, and achieving the purpose of improving the judgment accuracy.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
selecting a corresponding theme model with a tree structure according to the service type, wherein nodes in the theme model are used for dividing the category of theme keywords, each node in the theme model comprises at least one theme keyword, each node is provided with a node weight value, and the node weight value is used for expressing the correlation degree of the node and a father node thereof;
the method comprises the steps of carrying out sentence segmentation on a text to be detected to obtain a sentence list;
counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords of each node in the topic model and the sentence list;
and calculating the topic attribution degree of the text to be detected according to the node weight value and the clause number of each node in the topic model.
On the other hand, the invention also provides a device for calculating the attribution degree of the text theme, which comprises the following steps:
the system comprises a selecting unit, a selecting unit and a judging unit, wherein the selecting unit is used for selecting a corresponding theme model with a tree structure according to service types, nodes in the theme model are used for dividing categories of theme keywords, each node in the theme model comprises at least one theme keyword, each node is provided with a node weight value, and the node weight value is used for expressing the correlation degree of the node and a father node thereof;
the sentence dividing unit is used for dividing the sentence of the text to be detected to obtain a sentence list;
the statistical unit is used for counting the number of sentences in the text to be tested contained in each node in the topic model according to the topic keywords of each node in the topic model selected by the selection unit and the sentence list obtained by the sentence segmentation unit;
and the calculating unit is used for calculating the topic attribution degree of the text to be tested according to the node weight value of each node in the topic model and the number of the clauses counted by the counting unit.
According to the method and the device for calculating the text topic attribution degree provided by the invention, the topic attribution degree of the text to be detected is calculated by selecting a preset topic model, in the topic model, different topic keys are classified according to categories, different nodes are created in the topic model according to different classifications and the relation among the categories, and different weight values are set for the nodes. When the topic attribution degree of the text to be detected is calculated, after the text is divided into sentences, the node weight values in the topic models corresponding to the sentences are determined according to topic keywords contained in each sentence, after the node weight values are distributed to each sentence, the number of sentences contained in the root nodes in the models is calculated by counting the number of sentences contained in each node by using the structure of the topic model, and the ratio of the number of sentences to the total number of sentences in the text to be detected is the topic attribution degree of the text to be detected relative to the topic model. Compared with the existing topic attribution degree calculation method, the topic attribution degree calculation method has the advantages that the topic model is established to classify the topic keys and different weighted values are set, so that the correlation degree of the topic keywords and the test topic is refined, the weight proportion of the keywords contained in the text is comprehensively calculated through matching with the text to be tested, the calculation of the topic attribution degree is related to the weighted values of the topic keywords and the times of appearance in the text, and the accuracy of the topic attribution degree calculation is improved. In addition, the topic attribution degree calculation result is a probability value, the defect that the result is too absolute in the existing dichotomy calculation method is overcome, and the relevance degree of the text to be tested and the test topic is expressed in a probability value mode to be more visual and clear.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for calculating text topic attribution according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for calculating text topic attribution according to the embodiment of the invention;
FIG. 3 is a schematic diagram of a topic model structure proposed by an embodiment of the invention;
FIG. 4 is a block diagram illustrating an apparatus for calculating attribution of a text topic according to an embodiment of the present invention;
FIG. 5 is a block diagram illustrating an apparatus for calculating attribution of a text topic according to an embodiment of the present invention;
fig. 6 is a block diagram illustrating a third apparatus for calculating attribution of a text topic according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The embodiment of the invention provides a method for calculating text theme attribution degree, which comprises the following specific steps as shown in figure 1:
101. and selecting a corresponding theme model with a tree structure according to the service type.
When calculating the topic attribution degree of the text, generally, a predetermined topic and a plurality of topic keywords related to the topic are given, and then whether the text contains the topic keywords is judged to judge the relevance of the text and the predetermined topic. The categories of the predetermined subjects are distinguished according to different industries, disciplines or business ranges. In this embodiment, the topics are divided according to the types of services, and for different service types, a corresponding topic model may be selected to test the text to be tested to calculate the correlation between the text and the topics.
The topic model is built through a tree structure, wherein the tree structure refers to a data structure with one-to-many attribute relationship among data elements. According to this structure, there are several nodes in the topic model, and the outward divergence is based on the root node. Wherein, the relationship between the parent node and the child node is the relationship between the inclusion and the subordinate. Each node in the topic model comprises at least one topic keyword, and the corresponding relation between the topic keywords and the nodes is determined according to the classification of the topic keywords and the inclusion relation between the categories.
In addition to the topic model in this embodiment including the corresponding topic keyword, a node weight value is also set for each node in the topic model to indicate the degree of correlation between the node and the topic. It should be noted that the node weight value of the topic model is a relative weight value, that is, the weight value of a node is a weight value relative to the parent node of the node. For example, if node 1 includes 2 sub-nodes, node 2 and node 3, respectively, then the node weight values of node 2 and node 3 are both set relative to node 1, and if the total weight of node 1 is defined as 1, then the sum of the node 2 and node 3 weight values is 1, and the node 2 and node 3 weight values can be freely set in association. Therefore, the service range of the topic model can be expanded and extended through setting the node weight value, and related topic keywords are increased, so that the calculation of the topic attribution degree is more comprehensive.
102. And carrying out sentence segmentation on the text to be detected to obtain a sentence list.
After the theme model is selected, the attribution degree of the related theme of the text to be tested can be calculated. During calculation, firstly, the text to be detected is subjected to sentence division processing to obtain a sentence list. Compared with the method for performing word segmentation processing on the text in the prior art, the sentence segmentation processing method and the device have the advantages that when the text to be detected is processed, the sentence segmentation mode is simpler to implement, and the execution speed is higher. Moreover, the word segmentation technology has the problem of inaccurate word segmentation in the Chinese text, and the sentence segmentation can be realized accurately only according to fixed punctuation characters. Thus, clauses are simpler and more efficient to implement than word-segmentation.
And counting the number of sentences in the sentence list obtained after sentence segmentation for subsequent calculation of topic attribution.
103. And counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords and the sentence list of each node in the topic model.
After the text to be tested is divided into sentences, the sentences in the sentence list are brought into the selected topic model one by one, matched with the topic keywords in the topic model, and whether the sentences contain the topic keywords is checked. If yes, determining a node where the keyword is located, and adding 1 to a counter of the node, wherein the counter is used for recording the number of sentences in the text to be tested appearing in the node, and when the sentences contain the topic keyword in the node, recording the sentences in the node, namely adding 1 to the counter.
It should be noted that after the sentences in the sentence list are brought into the topic model, only one node is matched for the sentence, i.e. the sentence cannot be repeatedly recorded. That is, when there are a plurality of topic keywords in a sentence, the topic model determines a main keyword among the plurality of keywords to determine the correspondence between the sentence and the node, and adds 1 to the counter of the node.
Through the steps, all sentences containing the keywords in the sentence list can be matched into a unique node in the topic model. Therefore, in the topic model, the distribution condition of sentences in the document to be tested in each node can be checked.
104. And calculating the topic attribution degree of the text to be detected according to the node weight value and the clause number of each node in the topic model.
After the number of each node of the sentence of the text to be tested in the topic model is counted, the number of the sentences of the node can be converted into the number value of the sentences of the parent node according to the relativity of the node weight values, and by analogy, the number of the sentences in the text to be tested appearing at the root node of the topic model can be calculated. And then the topic attribution degree of the text to be tested relative to the topic model can be obtained according to the occupation ratio of the total number of the sentences in the sentence list.
Further, as an extension of the above calculation method, the topic model may be split, and a parent node and its child node and then a node associated downward may form a topic model separately to calculate the attribution degree of the text to be measured and the parent node. Therefore, when the topic model is created, a topic name can be specified for each node, so that the topic attribution degrees of a plurality of related topics can be calculated under the same topic model according to requirements.
It can be seen from the foregoing implementation manner that, in the method for calculating a text topic attribution degree adopted in the embodiment of the present invention, the topic attribution degree of a text to be detected is calculated by selecting a preset topic model, in the topic model, different topic keywords are classified according to categories, different nodes are created in the topic model according to different classifications and relationships among the categories, and different weight values are set for the nodes. When the topic attribution degree of the text to be detected is calculated, after the text is divided into sentences, the node weight values in the topic models corresponding to the sentences are determined according to topic keywords contained in each sentence, after the node weight values are distributed to each sentence, the number of sentences contained in the root nodes in the models is calculated by counting the number of sentences contained in each node by using the structure of the topic model, and the ratio of the number of sentences to the total number of sentences in the text to be detected is the topic attribution degree of the text to be detected relative to the topic model. Compared with the existing topic attribution degree calculation method, the topic attribution degree calculation method has the advantages that the topic model is established to classify the topic keys and different weighted values are set, so that the correlation degree of the topic keywords and the test topic is refined, the weight proportion of the keywords contained in the text is comprehensively calculated through matching with the text to be tested, the calculation of the topic attribution degree is related to the weighted values of the topic keywords and the times of appearance in the text, and the accuracy of the topic attribution degree calculation is improved. In addition, the topic attribution degree calculation result is a probability value, the defect that the result is too absolute in the existing dichotomy calculation method is overcome, and the relevance degree of the text to be tested and the test topic is expressed in a probability value mode to be more visual and clear.
In order to describe the method for calculating the attribution degree of the text topic in more detail, the embodiment of the present invention will be described by a specific implementation manner, as shown in fig. 2, the method includes the following steps when calculating the attribution degree of the text topic:
201. a topic model having a tree structure is created.
According to the description in 101 above, different services have different themes. Therefore, to create a topic model, it is necessary to first obtain related topic keywords according to the business scope to which the topic belongs, and then create a topic model with a tree structure according to the classification of the topic keywords. In this embodiment, a theme model is created by taking theme travel as an example, as shown in fig. 3, first, a theme keyword related to travel is obtained, which includes: scenic spots, destinations, hotels, guests, fares, etc. Then, tourism is taken as a root node of the model, and child nodes of the model are provided with: scenic spots, hotels, tourists and the like, and the child nodes of the scenic spots are further provided with: scene names, consumption, etc. After the nodes in the topic model are set, the acquired topic keywords are distributed to the corresponding nodes, each node is ensured to contain at least one topic keyword, and thus, the main framework of the topic model is already established. Then, it is also necessary to set a corresponding node weight value for the node in the topic model, where it is noted that the node weight value is a degree of correlation between the node and its parent node, but not a degree of correlation between the node and the topic, that is, the node weight value is a relative weight value of the parent node, but not an absolute weight value of the topic.
It should be noted that, the setting of the node weight value may be automatically assigned by a computer according to a certain algorithm, or may be manually set according to experience, and for a specific setting manner, this embodiment is not specifically limited.
202. And selecting a corresponding theme model with a tree structure according to the service type.
Before the calculation of the attribution degree, a topic model belonging to a topic is selected according to the topic to be tested. For the specific selection mode, the selection mode may be selected by a computer according to a specific algorithm, or a specific topic model may be manually specified, which is not specifically limited in this embodiment.
203. And carrying out sentence segmentation on the text to be detected to obtain a sentence list.
The step is the same as the step 102, and the specific contents refer to the contents in step 102, which are not described herein again.
204. And counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords and the sentence list of each node in the topic model.
And calculating the topic attribution degree of the text to be detected by using the topic model, firstly, bringing the sentence in the text to be detected into the topic model, and judging whether the sentence contains the topic keywords contained in the topic model. The specific implementation mode may be that the sentence is firstly subjected to word segmentation, and after the sentence is segmented into a plurality of words, the sentence is matched with all the topic keywords in the topic model one by one. In addition, the topic keywords can be brought into the sentence to be compared word by word, and whether the sentence contains the keywords or not can be judged. The above two ways are already widely used in the prior art, and therefore, the embodiment will not be described again for the implementation details of the specific technology.
Secondly, whether the sentence contains the topic keyword can be determined through judgment, when the sentence contains the topic keyword, the topic model determines the node where the topic keyword is located, and the number of the sentences recorded by the node is added with 1. And when the sentence contains a plurality of topic keywords as a result of the judgment, the topic model firstly determines the nodes where the keywords are located, selects one of the nodes according to different positions of the nodes, and updates the sentence quantity value of the node. The specific selection mode is as follows: and judging the positions of the nodes where the topic keywords are positioned, and determining that the node is the node where the sentence is positioned when the topic keywords are positioned in one node. When the topic keywords belong to different nodes, whether the different nodes are child nodes of the same father node or not needs to be continuously judged, if yes, the node with a large node weight value is selected as the node where the clause is located, and the higher the node weight value in the same level is, the higher the correlation degree of the node with the father node and the root node is; if not, selecting the node closest to the root node as the node where the clause is located, wherein the closer the nodes in different hierarchies are to the node of the root node, the higher the relevance of the nodes to the topic is. Taking the topic model shown in fig. 3 as an example, when the keywords contained in a sentence are: the number of sentences is added to the nodes with large node weight value when the district is Yihe and the price of the ticket; and when a sentence contains the following keywords: when the preschool park and the tourists are available, the number of the sentences is added to the node where the tourists are located.
By the aid of the matching judgment method, the problem that the sentence is repeatedly counted when the sentence contains a plurality of keywords can be solved, and each sentence containing the topic keywords in the sentence list corresponds to a unique node in the topic model.
205. And calculating the topic attribution degree of the text to be detected according to the node weight value and the clause number of each node in the topic model.
After the sentence number recorded by each node in the topic model is determined, the sentence number of the node can be converted into a parent node of the node by combining the node weight value of each node, and the total sentence number value of the parent node is the sum of the number value of the node and the number value converted by all child nodes. The specific calculation formula is as follows:
wherein FrejAs a total value of the number of J-node sentences, sentFrejNumber of sentences for J node, WeightjA node weight value of J node, sentFreiNumber of sentences for I node, WeightiThe I node is the node weight value of the I node, and the I node is the child node of the J node.
Through the formula, the total value of the sentence number of the root node in the topic model can be calculated, and the ratio of the total value to the total number of the sentences is defined as the topic attribution degree of the text to be tested relative to the topic model. The value of the topic attribution degree is a probability value which is used for representing the similarity degree of the topic content or the central thought expressed by the text to be tested and the topic specified by the topic model. The relevance degree of the text to be detected and the theme model is comprehensively analyzed through the nodes of different levels in the theme model and different weighted values of the nodes, and the judgment accuracy is greatly improved.
Further, as an implementation of the foregoing method, an embodiment of the present invention further provides a device for calculating a text topic attribution degree, as shown in fig. 4, where the embodiment of the device corresponds to the foregoing method embodiment, and details in the foregoing method embodiment are not repeated in this device embodiment for convenience of reading, but it should be clear that the device in this embodiment can correspondingly implement all the contents in the foregoing method embodiment. The device includes:
a selecting unit 41, configured to select a corresponding topic model with a tree structure according to a service type, where nodes in the topic model are used to divide categories of topic keywords, where each node in the topic model includes at least one topic keyword, and each node is provided with a node weight value, where the node weight value is used to represent a correlation between the node and a parent node thereof;
a sentence dividing unit 42, configured to divide a sentence of the text to be detected to obtain a sentence list;
a counting unit 43, configured to count the number of sentences in the text to be tested included in each node in the topic model according to the topic keywords of each node in the topic model selected by the selecting unit 41 and the sentence list obtained by the sentence dividing unit 42;
and the calculating unit 44 is configured to calculate the topic attribution degree of the text to be tested according to the node weight value of each node in the topic model and the number of clauses counted by the counting unit 43.
Further, as shown in fig. 5, the apparatus further includes:
an obtaining unit 45, configured to obtain a corresponding topic keyword according to a service type before the selecting unit 41 selects a corresponding topic model with a tree structure according to the service type;
a creating unit 46 configured to create a topic model having a tree structure according to the classification of the topic keyword acquired by the acquiring unit 45;
a setting unit 47, configured to set a node weight value of a node relative to its parent node according to a degree of correlation between the node in the topic model created by the creating unit 46 and its parent node.
Further, as shown in fig. 6, the statistical unit 43 includes:
a judging module 431, configured to judge whether a clause in the sentence list contains a topic keyword in the topic model;
a determining module 432, configured to determine a node in the topic model where the topic keyword is located when the clause determined by the determining module 431 contains the topic keyword;
a counting module 433, configured to count the clauses in the number of clauses included in the node, and update the number of clauses included in the node determined by the determining module 432.
Further, as shown in fig. 6, the determining module 432 includes:
a judging submodule 4321, configured to, when the clause includes a topic keyword of multiple different nodes, judge whether the multiple different nodes are child nodes of the same parent node;
a selecting sub-module 4322, configured to select, when the judgment result of the judging sub-module 4321 is yes, a node with a large node weight value as a node where the clause is located;
the selecting sub-module 4322 is further configured to, when the judgment result of the judging sub-module 4321 is that the clause does not belong to the root node, select a node closest to the root node as the node where the clause is located.
Further, as shown in fig. 6, the determining module 431 includes:
a word segmentation sub-module 4311, configured to perform word segmentation on the sentence;
a matching sub-module 4312, configured to match the participles obtained by the participle sub-module 4311 with the topic keywords in the topic model one by one.
Further, as shown in fig. 6, the calculation unit 44 of the apparatus includes:
a conversion module 441, configured to convert the clause number of a child node into the clause number of its parent node according to the node weight value of each node;
a calculating module 442, configured to calculate, by using a recursive algorithm, the number of clauses at a root node in the topic model, and then calculate a quotient between the number of clauses at the root node and the number of clauses in the sentence list, so as to obtain a topic attribution degree of the text to be tested with respect to the topic model.
In summary, the method and the apparatus for calculating text topic attribution according to the embodiments of the present invention calculate topic attribution of a text to be detected by selecting a preset topic model, classify different topic keys according to categories in the topic model, create different nodes in the topic model according to different classifications and relationships between the categories, and set different weight values for the nodes. When the topic attribution degree of the text to be detected is calculated, after the text is divided into sentences, the node weight values in the topic models corresponding to the sentences are determined according to topic keywords contained in each sentence, after the node weight values are distributed to each sentence, the number of sentences contained in the root nodes in the models is calculated by counting the number of sentences contained in each node by using the structure of the topic model, and the ratio of the number of sentences to the total number of sentences in the text to be detected is the topic attribution degree of the text to be detected relative to the topic model. Compared with the existing topic attribution degree calculation method, the topic attribution degree calculation method has the advantages that the topic model is established to classify the topic keys and different weighted values are set, so that the correlation degree of the topic keywords and the test topic is refined, the weight proportion of the keywords contained in the text is comprehensively calculated through matching with the text to be tested, the calculation of the topic attribution degree is related to the weighted values of the topic keywords and the times of appearance in the text, and the accuracy of the topic attribution degree calculation is improved. In addition, the topic attribution degree calculation result is a probability value, the defect that the result is too absolute in the existing dichotomy calculation method is overcome, and the relevance degree of the text to be tested and the test topic is expressed in a probability value mode to be more visual and clear.
The device for calculating the attribution degree of the text theme comprises a processor and a memory, wherein the selection unit, the clause dividing unit, the statistical unit, the calculation unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the topic attribution degree of the test text relative to the topic model is calculated by adjusting the kernel parameters, so that the accuracy of judging the topic attribution degree is improved.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: selecting a corresponding theme model with a tree structure according to the service type, wherein nodes in the theme model are used for dividing the category of theme keywords, each node in the theme model comprises at least one theme keyword, each node is provided with a node weight value, and the node weight value is used for expressing the correlation degree of the node and a father node thereof; the method comprises the steps of carrying out sentence segmentation on a text to be detected to obtain a sentence list; counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords of each node in the topic model and the sentence list; and calculating the topic attribution degree of the text to be detected according to the node weight value and the clause number of each node in the topic model.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (10)
1. A method for calculating attribution of a text topic, the method comprising:
selecting a corresponding theme model with a tree structure according to the service type, wherein nodes in the theme model are used for dividing the category of theme keywords, each node in the theme model comprises at least one theme keyword, each node is provided with a node weight value, and the node weight value is used for expressing the correlation degree of the node and a father node thereof;
the method comprises the steps of carrying out sentence segmentation on a text to be detected to obtain a sentence list;
counting the number of sentences in the text to be detected contained in each node in the topic model according to the topic keywords of each node in the topic model and the sentence list;
calculating the topic attribution degree of the text to be tested according to the node weight value and the clause number of each node in the topic model, and specifically comprises the following steps: converting the clause number of the child node into the clause number of the father node of the child node according to the node weight value of each node; and calculating the clause number of the root node in the topic model by using a recursive algorithm, and calculating the quotient of the clause number of the root node and the clause number in the sentence list to obtain the topic attribution degree of the text to be detected relative to the topic model.
2. The method according to claim 1, wherein before said selecting a corresponding topic model having a tree structure according to a service type, the method further comprises:
acquiring corresponding theme keywords according to the service type;
creating a theme model with a tree structure according to the classification of the theme keywords;
and setting the node weight value of the node relative to the father node of the node according to the correlation degree of the node in the theme model and the father node of the node.
3. The method according to claim 2, wherein the counting the number of sentences in the text to be tested included in each node in the topic model according to the topic keywords and the sentence list of each node in the topic model comprises:
judging whether the clauses in the sentence list contain the topic keywords in the topic model or not;
if yes, determining a node in the topic model where the topic keyword is located;
and counting the clauses in the number of clauses contained in the node, and updating the number of clauses contained in the node.
4. The method of claim 3, wherein the determining the node in the topic model where the topic keyword is located comprises:
when the clause contains the topic key words of a plurality of different nodes, judging whether the different nodes are child nodes of the same father node;
if the clauses belong to the nodes, selecting the nodes with the large node weight values as the nodes where the clauses are located;
and if not, selecting the node closest to the root node as the node where the clause is located.
5. The method of claim 3, wherein the determining whether the clauses in the sentence list contain the topic keyword in the topic model comprises:
performing word segmentation processing on the sentence;
and matching the word segmentation with the topic keywords in the topic model one by one.
6. An apparatus for calculating attribution of a text topic, the apparatus comprising:
the system comprises a selecting unit, a selecting unit and a judging unit, wherein the selecting unit is used for selecting a corresponding theme model with a tree structure according to service types, nodes in the theme model are used for dividing categories of theme keywords, each node in the theme model comprises at least one theme keyword, each node is provided with a node weight value, and the node weight value is used for expressing the correlation degree of the node and a father node thereof;
the sentence dividing unit is used for dividing the sentence of the text to be detected to obtain a sentence list;
the statistical unit is used for counting the number of sentences in the text to be tested contained in each node in the topic model according to the topic keywords of each node in the topic model selected by the selection unit and the sentence list obtained by the sentence segmentation unit;
the calculation unit is used for calculating the topic attribution degree of the text to be tested according to the node weight value of each node in the topic model and the number of the clauses counted by the counting unit;
the computing unit specifically includes:
the conversion module is used for converting the clause number of the child node into the clause number of the father node of the child node according to the node weight value of each node;
and the calculation module is used for calculating the clause number of the root node in the topic model by using a recursive algorithm, and then calculating the quotient of the clause number of the root node and the clause number in the sentence list to obtain the topic attribution degree of the text to be detected relative to the topic model.
7. The apparatus of claim 6, further comprising:
the acquisition unit is used for acquiring corresponding theme key words according to the service types before the selection unit selects the corresponding theme models with the tree structures according to the service types;
a creating unit configured to create a topic model having a tree structure according to the classification of the topic keyword acquired by the acquiring unit;
and the setting unit is used for setting the node weight value of the node relative to the father node of the node according to the correlation degree of the node in the theme model created by the creating unit and the father node of the node.
8. The apparatus of claim 7, wherein the statistical unit comprises:
the judging module is used for judging whether the clauses in the sentence list contain the topic keywords in the topic model;
the determining module is used for determining a node in a topic model where the topic keyword is located when the clause judged by the judging module contains the topic keyword;
and the counting module is used for counting the clauses in the clause number contained in the node and updating the clause number contained in the node determined by the determining module.
9. A storage medium, comprising a stored program, wherein when the program runs, the storage medium is controlled by a device to execute a method for calculating text topic attribution as claimed in any one of claims 1 to 5.
10. A processor, configured to execute a program, wherein the program executes a method for calculating text topic attribution according to any one of claims 1 to 5.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510680602.0A CN106598999B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510680602.0A CN106598999B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106598999A CN106598999A (en) | 2017-04-26 |
| CN106598999B true CN106598999B (en) | 2020-02-04 |
Family
ID=58554937
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510680602.0A Active CN106598999B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106598999B (en) |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107193973B (en) | 2017-05-25 | 2021-07-20 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for identifying field of semantic analysis information and readable medium |
| CN107247707B (en) * | 2017-06-27 | 2020-08-04 | 鼎富智能科技有限公司 | Enterprise association relation information extraction method and device based on completion strategy |
| CN107392433B (en) * | 2017-06-27 | 2018-09-04 | 北京神州泰岳软件股份有限公司 | A kind of method and apparatus of extraction enterprise incidence relation information |
| CN107562854B (en) * | 2017-08-28 | 2020-09-22 | 云南大学 | Modeling method for quantitatively analyzing party building data |
| CN110209829B (en) * | 2018-02-12 | 2021-06-29 | 百度在线网络技术(北京)有限公司 | Information processing method and device |
| CN110659655B (en) * | 2018-06-28 | 2021-03-02 | 北京三快在线科技有限公司 | Index classification method and device and computer readable storage medium |
| CN109815314B (en) * | 2019-01-04 | 2023-08-08 | 平安科技(深圳)有限公司 | Intent recognition method, recognition device and computer readable storage medium |
| CN110705308B (en) * | 2019-09-18 | 2024-09-20 | 平安科技(深圳)有限公司 | Voice information domain identification method and device, storage medium and electronic equipment |
| CN112100360B (en) * | 2020-10-30 | 2024-02-02 | 北京淇瑀信息科技有限公司 | A vector retrieval-based dialogue response method, device and system |
| CN113051369A (en) * | 2021-03-31 | 2021-06-29 | 北京大米科技有限公司 | Text content identification method and device, readable storage medium and electronic equipment |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
| CN101727487A (en) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Network criticism oriented viewpoint subject identifying method and system |
| CN102254011A (en) * | 2011-07-18 | 2011-11-23 | 哈尔滨工业大学 | Method for modeling dynamic multi-document abstracts |
| CN103226580A (en) * | 2013-04-02 | 2013-07-31 | 西安交通大学 | Interactive-text-oriented topic detection method |
| CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7251781B2 (en) * | 2001-07-31 | 2007-07-31 | Invention Machine Corporation | Computer based summarization of natural language documents |
-
2015
- 2015-10-19 CN CN201510680602.0A patent/CN106598999B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
| CN101727487A (en) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Network criticism oriented viewpoint subject identifying method and system |
| CN102254011A (en) * | 2011-07-18 | 2011-11-23 | 哈尔滨工业大学 | Method for modeling dynamic multi-document abstracts |
| CN103226580A (en) * | 2013-04-02 | 2013-07-31 | 西安交通大学 | Interactive-text-oriented topic detection method |
| CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106598999A (en) | 2017-04-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106598999B (en) | Method and device for calculating text theme attribution degree | |
| TWI718643B (en) | Method and device for identifying abnormal groups | |
| CN114817553B (en) | Knowledge graph construction method, knowledge graph construction system and computing device | |
| CN110276066B (en) | Entity association relation analysis method and related device | |
| KR102178295B1 (en) | Decision model construction method and device, computer device and storage medium | |
| CN104346406B (en) | Training corpus expanding device and training corpus extending method | |
| US11928879B2 (en) | Document analysis using model intersections | |
| CN110019792A (en) | File classification method and device and sorter model training method | |
| HK1206458A1 (en) | Method for processing data and device thereof | |
| CN106897262A (en) | A kind of file classification method and device and treating method and apparatus | |
| WO2017148267A1 (en) | Text information clustering method and text information clustering system | |
| CN110659352B (en) | Test question examination point identification method and system | |
| BR112012011091B1 (en) | method and apparatus for extracting and evaluating word quality | |
| CN109597983B (en) | Spelling error correction method and device | |
| CN107463548A (en) | Short phrase picking method and device | |
| CN111680506A (en) | Method, device, electronic device and storage medium for foreign key mapping of database table | |
| CN110598209B (en) | Method, system and storage medium for extracting keywords | |
| CN106610931B (en) | Topic name extraction method and device | |
| WO2023196554A1 (en) | Systems and methods for generating codes and code books using cosine proximity | |
| CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
| CN107291840A (en) | A method and device for constructing a user attribute prediction model | |
| CN110674297A (en) | Public opinion text classification model construction and public opinion text classification method, device and equipment | |
| CN106598997B (en) | Method and device for calculating text theme attribution degree | |
| CN118733717A (en) | File duplication checking method, device, equipment, storage medium and program product | |
| CN114003666A (en) | Data table field map generation method and device, electronic equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
| GR01 | Patent grant | ||
| GR01 | Patent grant |