[go: up one dir, main page]

CN118626649A - A method and system for generating event context based on topic heat index - Google Patents

A method and system for generating event context based on topic heat index Download PDF

Info

Publication number
CN118626649A
CN118626649A CN202410829338.1A CN202410829338A CN118626649A CN 118626649 A CN118626649 A CN 118626649A CN 202410829338 A CN202410829338 A CN 202410829338A CN 118626649 A CN118626649 A CN 118626649A
Authority
CN
China
Prior art keywords
topic
text
similarity
tags
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410829338.1A
Other languages
Chinese (zh)
Inventor
宋金宝
何雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202410829338.1A priority Critical patent/CN118626649A/en
Publication of CN118626649A publication Critical patent/CN118626649A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及文本处理技术领域,具体涉及一种基于话题热度指数的事件脉络生成方法及系统;本发明方法包括基于社交媒体平台中某一关键词下每条文本的在话题下的话题热度指数,对文本进行筛选,再通过语义相似度对筛选后的文本的话题标签进行筛选,然后将每条文本的话题标签向量和文本正文向量进行拼接累加再平均,得到每条文本话题标签的特征表示,计算话题标签间的特征表示的相似度和时间相似度,将文本的话题标签进行合并,得到话题标签节点;基于时间轴的改进层次聚类算法,将话题标签节点纳入相应的分支;本发明综合计算话题标签之间的相似度,有效地合并相似或重复的话题标签,采用基于时间轴的层次聚类算法,实现清晰有序的事件脉络的生成。

The invention relates to the technical field of text processing, and in particular to a method and system for generating an event context based on a topic heat index. The method comprises the following steps: based on the topic heat index of each text under a certain keyword in a social media platform, filtering the text, filtering the topic tags of the filtered text by semantic similarity, concatenating and accumulating the topic tag vectors and the text body vectors of each text, and then averaging them to obtain the feature representation of the topic tag of each text, calculating the similarity and time similarity of the feature representations between the topic tags, merging the topic tags of the text, and obtaining the topic tag nodes; based on an improved hierarchical clustering algorithm of a time axis, the topic tag nodes are incorporated into corresponding branches; the invention comprehensively calculates the similarity between the topic tags, effectively merges similar or repeated topic tags, and adopts a hierarchical clustering algorithm based on a time axis to realize the generation of a clear and orderly event context.

Description

Event context generation method and system based on topic heat index
Technical Field
The invention relates to the technical field of text processing, in particular to an event context generation method and system based on topic heat index.
Background
And extracting event-related story venues from social media platforms such as massive microblog data, wherein the main work is divided into event detection and story venues construction. The event detection aims to collect and integrate information related to the events and sub-events thereof, and the story context construction aims to acquire the relation between the events.
For event detection, event detection tasks are divided into four types according to different detection methods: detection methods based on keyword relevance, detection methods based on topic modeling, detection methods based on incremental clustering and hybrid detection methods.
At present, the method for extracting the microblog event keywords mainly comprises the following steps: frequency statistics based, TF-IDF based, and text clustering based. And (3) calculating the occurrence frequency of words in the microblog text by a frequency statistics-based method, and selecting words with higher frequency as event keywords. The method is simple and direct, but ignores the semantic information of the words and is easy to be interfered by common words. The method based on TF-IDF uses word frequency-inverse document frequency, and selects words with higher TF-IDF values as event keywords by calculating the importance of the words in the microblog data set. This approach takes into account the frequency of words and the importance in the text collection, enabling better capture of keywords. The text clustering method is adopted to divide the text into different clusters, and then representative words are selected from each cluster to serve as event keywords. The method can extract representative words through a clustering structure, but has higher quality requirements on a clustering algorithm and a clustering result. In this way, the traditional method based on frequency statistics and TF-IDF ignores the semantic information of the words, is easily interfered by some common words, and is difficult to accurately capture the semantic features of the event keywords. When selecting representative words from the clustering results, the relationship between diversity and representativeness needs to be balanced. Some approaches may tend to select words that are representative but highly repetitive, while ignoring more semantic information.
The detection method based on the topic modeling is used for finding out topics and topics in text data by establishing topic models such as PLSA, LDA and the like, so that event detection is completed. Because social media data is generally small and sparse, traditional topic models pose high-dimensionality and sparsity problems. Shi et al propose a topic discovery method based on RNNs and topic models, which uses relationships among RNNs learning words as priori knowledge of topic models and constructs word pairs to solve the sparsity problem of text topic modeling. However, topic-based models are more prone to high frequency feature words, resulting in insufficient text discrimination. And topic models generally assume that the words in the text are independent, ignoring complex context relationships between the words, and limiting their application in some complex contexts.
Unlike topic modeling, which models the entire dataset at once, incremental clustering can process new text data in real-time. It is intended to discover new event types that occur over time by clustering and categorizing the data. It overcomes some of the limitations of existing event detection models, such as few predefined event types and the inability to handle new types of events. The common incremental clustering algorithms are K-means, DBSCAN and the like. The algorithm firstly carries out vectorization on the text and then clusters, and each cluster represents an event, so that the purpose of event detection is achieved. In the detection method based on incremental clustering, the vectorized representation of the text is the key of the whole method. The accurate and effective vectorization expression of the microblog text has critical influence on the final text clustering.
In recent years, with the rapid progress of deep learning technology, the application of the deep learning technology is also expanded to the field of event detection. These deep learning methods have achieved significant results in terms of improving accuracy and reducing redundancy of artificial design features, but need to face higher training costs, consuming a lot of time and computing resources. Furthermore, once training is complete, the learning parameters of the model cannot be modified unless retraining is performed again.
For event context construction, in view of the huge volume of social media data and the high amount of noise, it is obviously a quite challenging task to directly obtain clear story context from a social platform. And the manual arrangement of story lines not only requires a great deal of time and effort. Thus, automatically building story context is an important task in event context analysis.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an event context generation method and system based on topic heat index, so as to solve the problems of improper keyword or event detection and inaccurate event context construction aiming at a social media platform.
In order to solve the problems, the invention adopts the following technical scheme:
An event context generation method based on topic heat index comprises the following steps:
The method comprises the steps of screening texts based on topic heat indexes under topics of each text under a certain keyword in a social media platform;
Aiming at the screened texts, screening topic labels of the screened texts according to semantic similarity of topic labels of each text and text texts;
The topic label vectors and the text vectors of each text are spliced, accumulated and then averaged to obtain the characteristic representation of each text topic label, the similarity of the characteristic representations among the topic labels is calculated by adopting cosine similarity, and then the time similarity among the topic labels is calculated based on the time sequence characteristics of the topic labels;
Combining the topic labels of the text screened again based on the similarity and the time similarity of the feature representation among the topic labels to obtain topic label nodes;
And (3) based on an improved hierarchical clustering algorithm of a time axis, selecting the topic label node with the earliest time from the topic label nodes as a root node of an event context, comparing the similarity of the next topic label node and the existing nodes in the current event context one by one according to a time sequence, selecting the topic label node with the highest similarity as a father node and incorporating the father node into a corresponding branch until each topic label node is fused into the event context.
As an implementation manner, the filtering the text based on the topic heat index under the topic of each text in the social media platform includes:
Calculating topic heat index under topic for each text in a social media platform
Wherein the method comprises the steps ofIs a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
As an implementation manner, the rescreening the screened text according to the semantic similarity between the topic label and the text body of each text includes:
Aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
As an implementation manner, the merging the topic labels of the text based on the similarity and the time similarity of the feature representation between the topic labels to obtain a topic label node includes:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
As an implementation manner, the integrated similarityThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
An event context generation system based on topic heat index comprises a first screening module, a second screening module, a topic label similarity calculation module, a topic label node construction module and an event context generation module;
The first screening module is used for screening the texts based on topic heat indexes under topics of each text under a certain keyword in the social media platform;
the second screening module is used for screening the topic labels of the screened texts according to the semantic similarity of the topic labels of each text and the text body;
The topic label similarity calculation module is used for carrying out splicing, accumulation and averaging on topic label vectors and text vectors of each text to obtain feature representations of each text topic label, calculating similarity of feature representations among topic labels by adopting cosine similarity, and calculating time similarity among topic labels based on time sequence features of the topic labels;
The topic label node construction module is used for combining topic labels of texts based on the similarity and the time similarity of feature representation among the topic labels to obtain topic label nodes;
the event context generation module is used for screening the topic label node with the earliest time from the topic label nodes based on an improved hierarchical clustering algorithm of a time axis to serve as a root node of the event context, comparing the similarity of the next topic label node and the existing nodes in the current event context one by one according to a time sequence, selecting the topic label node with the highest similarity as a father node and bringing the father node into a corresponding branch until each topic label node is fused into the event context.
As an implementation manner, the filtering the text based on the topic heat index under the topic of each text in the social media platform includes:
Calculating topic heat index under topic for each text in a social media platform
Wherein the method comprises the steps ofIs a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
As an implementation manner, the rescreening the screened text according to the semantic similarity between the topic label and the text body of each text includes:
Aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
As an implementation manner, the merging the topic labels of the text based on the similarity and the time similarity of the feature representation between the topic labels to obtain a topic label node includes:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
As an implementation manner, the integrated similarityThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
The invention has the beneficial effects that: according to the method, the microblog text is screened by using the topic heat index, so that the corpus quality obtained from massive social media platform data is higher, and the accuracy of subsequent analysis is improved; the semantic similarity between the topic labels and the texts is calculated, texts irrelevant to the topic labels are screened out, and topic labels with event relevance are used as candidate nodes of event venation; the topic labels are distributed in the time axis in a time sequence and are used as time characteristics, and the similarity among the topic labels is comprehensively calculated by combining the time characteristics with the characteristic representation of the topic labels, so that similar or repeated topic labels are effectively combined; and (3) carrying out hierarchical clustering on the key topic labels by using a hierarchical clustering algorithm based on a time axis to form a clear and ordered event context.
Drawings
Fig. 1 is a schematic flow chart of an event context generating method based on topic popularity index in an embodiment of the invention.
Fig. 2 is a schematic diagram of an event context generating system based on topic popularity index in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, and not for limiting the present invention, and simple modifications of the method under the premise of the inventive concept are all within the scope of the claimed invention.
Referring to fig. 1, a topic heat index-based event context generation method includes:
The social media platform takes microblog as an example, and collects related topics from massive microblog texts (microblog texts under a plurality of topics can be obtained based on a certain keyword), but a great amount of noise is mixed in the texts. These noises may include content that is irrelevant, of low quality, or spurious to the topic, which makes understanding and analysis of the topic difficult. Therefore, the acquired text needs to be preprocessed first.
S100, filtering the texts based on topic heat indexes under topics of each text under a certain keyword in the social media platform.
Specifically, calculating the topic heat index under the topic for each text in the social media platform
Wherein the method comprises the steps ofIs a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Is praise for text under the topic. Considering that in microblog use, the action of praise is easier to generate than comment and forwarding, and comment and forwarding can more represent the true influence degree of texts and the attention degree of users. Thus, is provided hereinIt may be that the amount of the catalyst is 0.4,It may be that the amount of the catalyst is 0.4,0.2.
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text. Note that microblogs that truly cover important event content tend to be at relatively high social media popularity, so the topic popularity index threshold may be set to the top 10% in order to screen out microblog text that is at relatively high interest in social media.
The method is characterized in that topic popularity indexes of each microblog text are calculated, social interaction information such as forwarding, commentary, praise number and the like are combined with the indexes, and relevance of each text and topics is quantified. The relevance of the text can be evaluated more accurately, and the text closely related to the topic can be distinguished, so that noise information irrelevant to the topic is removed.
In processing microblog content, publishers may add topic labels to the microblog text on their own, due to the high uncertainty of the content, with the aim of increasing exposure and attracting more readers. However, this behavior results in a population of microblog data sets that are populated with a large number of tags that are not related to keyword events, and therefore require efficient screening.
S200, aiming at the screened texts, screening the topic labels of the screened texts according to the semantic similarity of the topic labels of each text and the text.
The method specifically comprises the following steps: aiming at the filtered texts, calculating the topic label vector of each text by adopting a pre-trained word vector model RBT3And text body vectorThe method is characterized by comprising the following steps:
and calculates the topic label vector of each text by cosine similarity And text body vectorSemantic similarity of (c):
And comparing the semantic similarity with a set semantic similarity threshold value, and screening the topic labels of the screened texts.
The method aims to effectively reduce noise interference of irrelevant topic labels and ensure that the extracted label set is more accurate and representative. The method can more accurately capture the key topic labels related to the event, serve as context candidate nodes and provide a more reliable data basis for subsequent event context analysis.
Since there are different expressions of the same event in a microblog, it is necessary to effectively merge these similar or duplicate topic tags in order to improve tag consistency and overall data quality. Aiming at the problem of label expression diversity in microblog data, the method aims at more comprehensively processing and combining similar or repeated labels by combining means of text feature extraction, time sequence analysis, semantic similarity calculation and the like, so that the consistency and quality of the data are improved.
S300, carrying out topic label vector on each textAnd text body vectorAnd performing splicing accumulation and re-averaging to obtain the characteristic representation of each text topic label:
Calculating the similarity of feature representations among topic labels by adopting cosine similarity:
Based on the time sequence characteristics of the topic labels, calculating the time similarity among the topic labels:
Indicating the number of occurrences of the tag and the distribution over the time axis. Temporal similarity concerns the distribution and pattern of occurrence of tags on the time axis. Tags with high similarity are also similar in time sequence. Thus, by using time features, the behavior and trends of the tags in the time dimension can be analyzed to better process and merge similar or duplicate tags.
S400, combining topic labels of the text based on the similarity and the time similarity of feature representation among the topic labels to obtain topic label nodes.
In order to more comprehensively evaluate the similarity between the two labels, the similarity of the feature representation and the time similarity are combined, respectively given with weight values and summed to obtain the comprehensive similarity between the topic labels, and the topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold value are combined.
The comprehensive similarity is as follows:
The similarity of the feature representations may be given a higher weight (0.8) to emphasize the importance of the text content in the tag similarity determination. At the same time, temporal similarities may be given a lower weight (0.2) for balance considerations, as temporal features, while important, may not reflect the semantic relevance of tags directly as the text content in some cases.
And setting a comprehensive similarity threshold value for screening out topic labels with higher similarity to be combined. The default value of the threshold is 0.6, and the severity of the merging operation can be flexibly controlled by adjusting the threshold, so that various different research requirements can be better met.
S500, a time axis-based improved hierarchical clustering algorithm is adopted, topic label nodes with earliest time are selected from topic label nodes to serve as root nodes of event venation, the similarity of the next topic label node and existing nodes in the current event venation is compared one by one according to time sequence, topic label nodes with highest similarity are selected to serve as father nodes, and the father nodes are brought into corresponding branches until each topic label node is integrated into the event venation.
After the topic label nodes are acquired, the core task is to comb the relation among the nodes so as to construct event context. The invention introduces an improved hierarchical clustering algorithm based on a time axis, and the core idea of the algorithm is to utilize the sequence of the time axis to perform hierarchical clustering on nodes so as to construct ordered and clear event venation. First, the earliest node of the timeline is extracted from the topic label cluster as the starting point of the story. This node may be considered the root node of the overall event context. Then, according to the sequence of the time axis, the similarity between the next node and the existing nodes in the current context is compared one by one, the node with the highest similarity is selected as a father node, and the father node is orderly brought into the corresponding branch. This iterative process continues until each node successfully merges into the entire context. The algorithm design is unique in that it can effectively establish associations between the topic label nodes in the time dimension. This means that the relevant nodes can be organized in time sequence to form a complete context.
Through the algorithm, hierarchical organization of the topic label nodes is realized, so that the development process of the event presents a clear and orderly structure. Compared with the traditional clustering method, the method is characterized in that time information is fully utilized, so that a time line of events can penetrate through the whole context construction process. This unique design provides a way for this document to understand the evolution of microblog events in depth, so that building the context is no longer a simple node connection, but rather more closely follows the presentation of the actual context of the event.
Referring to fig. 2, an event context generating system based on topic heat index includes a first screening module 100, a second screening module 200, a topic label similarity calculating module 300, a topic label node constructing module 400, and an event context generating module 500.
The first screening module 100 is configured to screen the text based on the topic heat index under the topic of each text under a certain keyword in the social media platform.
The second screening module 200 is configured to screen, for the screened text, the topic label of the screened text by using the topic label of each text and the semantic similarity of the text body.
The topic tag similarity calculation module 300 is configured to splice, accumulate and then average topic tag vectors and text vectors of each text to obtain feature representations of each text topic tag, calculate similarity of feature representations among topic tags by adopting cosine similarity, and calculate time similarity among topic tags based on time sequence features of topic tags.
The topic tag node construction module 400 is configured to combine topic tags of the text based on the similarity and the time similarity of feature representation among topic tags, so as to obtain a topic tag node.
The event context generation module 500 is configured to screen out the topic label node with the earliest time from the topic label nodes based on an improved hierarchical clustering algorithm of a time axis, to compare the similarity between the next topic label node and the existing nodes in the current event context one by one according to a time sequence, to select the topic label node with the highest similarity as a father node, and to incorporate the father node into the corresponding branch until each topic label node is merged into the event context.
The method for screening the texts based on topic heat indexes under topics of each text in the social media platform comprises the following steps:
Calculating topic heat index under topic for each text in a social media platform
Wherein the method comprises the steps ofIs a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
Wherein, aiming at the text after screening, the screened text is screened again through the semantic similarity of the topic label and the text body of each text, and the method comprises the following steps:
aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
Combining the topic labels of the text screened again based on the similarity and the time similarity of the feature representation among the topic labels to obtain topic label nodes, wherein the topic label nodes comprise:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
Wherein, the similarity is synthesizedThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
Finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1.一种基于话题热度指数的事件脉络生成方法,其特征在于,包括:1. A method for generating event context based on topic heat index, characterized by comprising: 基于社交媒体平台中某一关键词下每条文本的在话题下的话题热度指数,对文本进行筛选;Filter texts based on the topic popularity index of each text under a certain keyword in the social media platform; 针对筛选后的文本,通过每条文本的话题标签和文本正文的语义相似度对所述筛选后的文本的话题标签进行筛选;For the filtered texts, filtering the topic tags of the filtered texts by the semantic similarity between the topic tags of each text and the text body; 将每条文本的话题标签向量和文本正文向量进行拼接累加再平均,得到每条文本话题标签的特征表示,采用余弦相似度计算话题标签间的特征表示的相似度,再基于话题标签的时间序列特征,计算话题标签间的时间相似度;The topic tag vector and text body vector of each text are concatenated, accumulated and averaged to obtain the feature representation of the topic tag of each text. The cosine similarity is used to calculate the similarity of the feature representations between topic tags. Then, based on the time series characteristics of the topic tags, the time similarity between the topic tags is calculated. 基于所述话题标签间的特征表示的相似度和时间相似度,将文本的话题标签进行合并,得到话题标签节点;Based on the similarity of the feature representations and the time similarity between the topic tags, the topic tags of the text are merged to obtain a topic tag node; 基于时间轴的改进层次聚类算法,从所述话题标签节点中筛选时间最早的话题标签节点作为事件脉络的根节点,按照时间顺序,逐一比较下一话题标签节点与当前事件脉络中已存在的节点的相似度,选取相似度最高的话题标签节点作为父节点并将其纳入相应的分支,直到每一个话题标签节点融入事件脉络中。An improved hierarchical clustering algorithm based on the timeline selects the earliest topic tag node from the topic tag nodes as the root node of the event context, compares the similarity of the next topic tag node with the existing nodes in the current event context one by one in chronological order, selects the topic tag node with the highest similarity as the parent node and incorporates it into the corresponding branch until each topic tag node is integrated into the event context. 2.根据权利要求1所述的基于话题热度指数的事件脉络生成方法,其特征在于,所述基于社交媒体平台中某一关键词下每条文本的在话题下的话题热度指数,对文本进行筛选,包括:2. The event context generation method based on topic heat index according to claim 1 is characterized in that the text is screened based on the topic heat index of each text under a certain keyword in the social media platform, including: 计算社交媒体平台中每条文本的在话题下的话题热度指数Calculate the topic heat index of each text under the topic in the social media platform : 其中是权重系数,其和为1,为话题下文本的转发次数,为话题下文本的评论数,为话题下文本的点赞数; in , , is the weight coefficient, which sums to 1. is the number of forwardings of the text under the topic, is the number of comments on the text under the topic, The number of likes for the text under the topic; 基于计算的话题热度指数,与设定的话题热度指数阈值进行比较,对文本进行筛选。Based on the calculated topic heat index, the text is screened by comparing it with the set topic heat index threshold. 3.根据权利要求1所述的基于话题热度指数的事件脉络生成方法,其特征在于,所述针对筛选后的文本,通过每条文本的话题标签和文本正文的语义相似度对所述筛选后的文本的话题标签进行筛选,包括:3. The event context generation method based on topic heat index according to claim 1 is characterized in that, for the filtered text, the topic tags of the filtered text are filtered according to the semantic similarity between the topic tag of each text and the text body, including: 针对筛选后的文本,采用预训练的词向量模型RBT3计算每条文本的话题标签向量和文本正文向量,并通过余弦相似度计算每条文本的话题标签向量和文本正文向量的语义相似度,通过语义相似度与设定的语义相似度阈值进行比较,对所述筛选后的文本的话题标签进行筛选。For the filtered text, the pre-trained word vector model RBT3 is used to calculate the topic tag vector and text body vector of each text, and the semantic similarity of the topic tag vector and text body vector of each text is calculated by cosine similarity. The semantic similarity is compared with the set semantic similarity threshold to filter the topic tags of the filtered text. 4.根据权利要求1所述的基于话题热度指数的事件脉络生成方法,其特征在于,所述基于所述话题标签间的特征表示的相似度和时间相似度,将文本的话题标签进行合并,得到话题标签节点,包括:4. The method for generating event context based on topic heat index according to claim 1, characterized in that the topic tags of the text are merged based on the similarity of feature representation and time similarity between the topic tags to obtain topic tag nodes, including: 通过将文本的话题标签间的特征表示的相似度和时间相似度分别赋予权重值后求和,得到话题标签间的综合相似度,将综合相似度高于设定的综合相似度阈值的话题标签进行合并。The comprehensive similarity between topic tags is obtained by assigning weight values to the similarity of feature representation and time similarity between topic tags in the text and then summing them up. Topic tags with comprehensive similarity higher than the set comprehensive similarity threshold are merged. 5.根据权利要求4所述的基于话题热度指数的事件脉络生成方法,其特征在于,所述综合相似度通过下式进行计算:5. The event context generation method based on topic heat index according to claim 4 is characterized in that the comprehensive similarity Calculate using the following formula: 其中,为特征表示的相似度,为时间相似度。 in, is the similarity of feature representation, is the time similarity. 6.一种基于话题热度指数的事件脉络生成系统,其特征在于,包括第一筛选模块、第二筛选模块、话题标签相似度计算模块、话题标签节点构建模块和事件脉络生成模块;6. An event context generation system based on topic heat index, characterized by comprising a first screening module, a second screening module, a topic tag similarity calculation module, a topic tag node construction module and an event context generation module; 所述第一筛选模块,用于基于社交媒体平台中某一关键词下每条文本的在话题下的话题热度指数,对文本进行筛选;The first screening module is used to screen texts based on the topic heat index of each text under a certain keyword in the social media platform; 所述第二筛选模块,用于针对筛选后的文本,通过每条文本的话题标签和文本正文的语义相似度对所述筛选后的文本进行再次筛选;The second screening module is used to screen the screened text again by comparing the semantic similarity between the topic tag and the text body of each text; 所述话题标签相似度计算模块,用于针对再次筛选后的文本,将每条文本的话题标签向量和文本正文向量进行拼接累加再平均,得到每条文本话题标签的特征表示,采用余弦相似度计算话题标签间的特征表示的相似度,再基于话题标签的时间序列特征,计算话题标签间的时间相似度;The topic tag similarity calculation module is used to concatenate, accumulate and average the topic tag vector and text body vector of each text after the re-screening, obtain the feature representation of the topic tag of each text, use cosine similarity to calculate the similarity of the feature representation between topic tags, and then calculate the time similarity between topic tags based on the time series characteristics of topic tags; 所述话题标签节点构建模块,用于基于所述话题标签间的特征表示的相似度和时间相似度,将文本的话题标签进行合并,得到话题标签节点;The topic tag node construction module is used to merge the topic tags of the text based on the similarity of the feature representations and the time similarity between the topic tags to obtain a topic tag node; 所述事件脉络生成模块,用于基于时间轴的改进层次聚类算法,从所述话题标签节点中筛选时间最早的话题标签节点作为事件脉络的根节点,按照时间顺序,逐一比较下一话题标签节点与当前事件脉络中已存在的节点的相似度,选取相似度最高的话题标签节点作为父节点并将其纳入相应的分支,直到每一个话题标签节点融入事件脉络中。The event context generation module is used to use an improved hierarchical clustering algorithm based on the timeline to select the earliest topic tag node from the topic tag nodes as the root node of the event context, and compare the similarity of the next topic tag node with the existing nodes in the current event context one by one in chronological order, select the topic tag node with the highest similarity as the parent node and incorporate it into the corresponding branch until each topic tag node is integrated into the event context. 7.根据权利要求6所述的基于话题热度指数的事件脉络生成系统,其特征在于,所述基于社交媒体平台中某一关键词下每条文本的在话题下的话题热度指数,对文本进行筛选,包括:7. The event context generation system based on topic heat index according to claim 6 is characterized in that the text is screened based on the topic heat index of each text under a certain keyword in the social media platform, including: 计算社交媒体平台中每条文本的在话题下的话题热度指数Calculate the topic heat index of each text under the topic in the social media platform : 其中是权重系数,其和为1,为话题下文本的转发次数,为话题下文本的评论数,为话题下文本的点赞数; in , , is the weight coefficient, which sums to 1. is the number of forwardings of the text under the topic, is the number of comments on the text under the topic, The number of likes for the text under the topic; 基于计算的话题热度指数,与设定的话题热度指数阈值进行比较,对文本进行筛选。Based on the calculated topic heat index, the text is screened by comparing it with the set topic heat index threshold. 8.根据权利要求6所述的基于话题热度指数的事件脉络生成系统,其特征在于,所述针对筛选后的文本,通过每条文本的话题标签和文本正文的语义相似度对所述筛选后的文本进行再次筛选,包括:8. The event context generation system based on topic heat index according to claim 6 is characterized in that the filtered text is screened again by the semantic similarity between the topic tag and the text body of each text, including: 针对筛选后的文本,采用预训练的词向量模型RBT3计算每条文本的话题标签向量和文本正文向量,并通过余弦相似度计算每条文本的话题标签向量和文本正文向量的语义相似度,通过语义相似度与设定的语义相似度阈值进行比较,对所述筛选后的文本进行再次筛选。For the filtered text, the pre-trained word vector model RBT3 is used to calculate the topic tag vector and text body vector of each text, and the semantic similarity of the topic tag vector and text body vector of each text is calculated by cosine similarity. The semantic similarity is compared with the set semantic similarity threshold to screen the filtered text again. 9.根据权利要求6所述的基于话题热度指数的事件脉络生成系统,其特征在于,所述基于所述话题标签间的特征表示的相似度和时间相似度,将文本的话题标签进行合并,得到话题标签节点,包括:9. The event context generation system based on topic heat index according to claim 6 is characterized in that the topic tags of the text are merged based on the similarity of the feature representation and the time similarity between the topic tags to obtain the topic tag node, including: 通过将文本的话题标签间的特征表示的相似度和时间相似度分别赋予权重值后求和,得到话题标签间的综合相似度,将综合相似度高于设定的综合相似度阈值的话题标签进行合并。The comprehensive similarity between topic tags is obtained by assigning weight values to the similarity of feature representation and time similarity between topic tags in the text and then summing them up. Topic tags with comprehensive similarity higher than the set comprehensive similarity threshold are merged. 10.根据权利要求9所述的基于话题热度指数的事件脉络生成系统,其特征在于,所述综合相似度通过下式进行计算:10. The event context generation system based on topic heat index according to claim 9, characterized in that the comprehensive similarity Calculate using the following formula: 其中,为特征表示的相似度,为时间相似度。 in, is the similarity of feature representation, is the time similarity.
CN202410829338.1A 2024-06-25 2024-06-25 A method and system for generating event context based on topic heat index Pending CN118626649A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410829338.1A CN118626649A (en) 2024-06-25 2024-06-25 A method and system for generating event context based on topic heat index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410829338.1A CN118626649A (en) 2024-06-25 2024-06-25 A method and system for generating event context based on topic heat index

Publications (1)

Publication Number Publication Date
CN118626649A true CN118626649A (en) 2024-09-10

Family

ID=92595678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410829338.1A Pending CN118626649A (en) 2024-06-25 2024-06-25 A method and system for generating event context based on topic heat index

Country Status (1)

Country Link
CN (1) CN118626649A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234955A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Clustering based text classification
CN107193797A (en) * 2017-04-26 2017-09-22 天津大学 The much-talked-about topic detection of Chinese microblogging and trend forecasting method
CN114860936A (en) * 2022-05-14 2022-08-05 北京清博智能科技有限公司 Topic generation system and method based on hotspot list
CN116361468A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Event context generation method, electronic equipment and storage medium
CN116992886A (en) * 2023-07-28 2023-11-03 中国电子科技集团公司第五十四研究所 A BERT-based hot news event context generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234955A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Clustering based text classification
CN107193797A (en) * 2017-04-26 2017-09-22 天津大学 The much-talked-about topic detection of Chinese microblogging and trend forecasting method
CN114860936A (en) * 2022-05-14 2022-08-05 北京清博智能科技有限公司 Topic generation system and method based on hotspot list
CN116361468A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Event context generation method, electronic equipment and storage medium
CN116992886A (en) * 2023-07-28 2023-11-03 中国电子科技集团公司第五十四研究所 A BERT-based hot news event context generation method and device

Similar Documents

Publication Publication Date Title
Ahmed et al. A literature review on NoSQL database for big data processing
Bykau et al. Fine-grained controversy detection in Wikipedia
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
Sybrandt et al. Are abstracts enough for hypothesis generation?
Yafooz et al. Enhancing multi-class web video categorization model using machine and deep learning approaches
CN113761104A (en) Method, device and electronic device for detecting entity relationship in knowledge graph
Wang et al. An automated hybrid approach for generating requirements trace links
Yuan Big data recommendation research based on travel consumer sentiment analysis
Nurhachita et al. A comparison between deep learning, naïve bayes and random forest for the application of data mining on the admission of new students
CN111026866B (en) Domain-oriented text information extraction clustering method, device and storage medium
CN112487160A (en) Technical document tracing method and device, computer equipment and computer storage medium
CN112363996A (en) Method, system, and medium for building a physical model of a power grid knowledge graph
CN111611455A (en) A user group division method based on user emotional behavior characteristics under microblog hot topics
CN113505223B (en) Network water army identification method and system
CN111708919B (en) Big data processing method and system
CN118193791B (en) A multimodal sentiment analysis method and system for short videos on social networks
CN109062551A (en) Development Framework based on big data exploitation command set
CN119782526A (en) A topic clustering method and storage medium driven by a large language model
Lin et al. Mining online book reviews for sentimental clustering
CN118626649A (en) A method and system for generating event context based on topic heat index
Tandjung et al. Topic modeling with latent-dirichlet allocation for the discovery of state-of-the-art in research: A literature review
CN113434654B (en) A data processing method, device, equipment, and storage medium
CN116431877A (en) Webpage big data content clustering method driven by cloud computing platform
Dong et al. A statistical method for constructing tang poet social networks
CN118193730A (en) A technical patent identification method based on deep learning and topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20240910