Background
And extracting event-related story venues from social media platforms such as massive microblog data, wherein the main work is divided into event detection and story venues construction. The event detection aims to collect and integrate information related to the events and sub-events thereof, and the story context construction aims to acquire the relation between the events.
For event detection, event detection tasks are divided into four types according to different detection methods: detection methods based on keyword relevance, detection methods based on topic modeling, detection methods based on incremental clustering and hybrid detection methods.
At present, the method for extracting the microblog event keywords mainly comprises the following steps: frequency statistics based, TF-IDF based, and text clustering based. And (3) calculating the occurrence frequency of words in the microblog text by a frequency statistics-based method, and selecting words with higher frequency as event keywords. The method is simple and direct, but ignores the semantic information of the words and is easy to be interfered by common words. The method based on TF-IDF uses word frequency-inverse document frequency, and selects words with higher TF-IDF values as event keywords by calculating the importance of the words in the microblog data set. This approach takes into account the frequency of words and the importance in the text collection, enabling better capture of keywords. The text clustering method is adopted to divide the text into different clusters, and then representative words are selected from each cluster to serve as event keywords. The method can extract representative words through a clustering structure, but has higher quality requirements on a clustering algorithm and a clustering result. In this way, the traditional method based on frequency statistics and TF-IDF ignores the semantic information of the words, is easily interfered by some common words, and is difficult to accurately capture the semantic features of the event keywords. When selecting representative words from the clustering results, the relationship between diversity and representativeness needs to be balanced. Some approaches may tend to select words that are representative but highly repetitive, while ignoring more semantic information.
The detection method based on the topic modeling is used for finding out topics and topics in text data by establishing topic models such as PLSA, LDA and the like, so that event detection is completed. Because social media data is generally small and sparse, traditional topic models pose high-dimensionality and sparsity problems. Shi et al propose a topic discovery method based on RNNs and topic models, which uses relationships among RNNs learning words as priori knowledge of topic models and constructs word pairs to solve the sparsity problem of text topic modeling. However, topic-based models are more prone to high frequency feature words, resulting in insufficient text discrimination. And topic models generally assume that the words in the text are independent, ignoring complex context relationships between the words, and limiting their application in some complex contexts.
Unlike topic modeling, which models the entire dataset at once, incremental clustering can process new text data in real-time. It is intended to discover new event types that occur over time by clustering and categorizing the data. It overcomes some of the limitations of existing event detection models, such as few predefined event types and the inability to handle new types of events. The common incremental clustering algorithms are K-means, DBSCAN and the like. The algorithm firstly carries out vectorization on the text and then clusters, and each cluster represents an event, so that the purpose of event detection is achieved. In the detection method based on incremental clustering, the vectorized representation of the text is the key of the whole method. The accurate and effective vectorization expression of the microblog text has critical influence on the final text clustering.
In recent years, with the rapid progress of deep learning technology, the application of the deep learning technology is also expanded to the field of event detection. These deep learning methods have achieved significant results in terms of improving accuracy and reducing redundancy of artificial design features, but need to face higher training costs, consuming a lot of time and computing resources. Furthermore, once training is complete, the learning parameters of the model cannot be modified unless retraining is performed again.
For event context construction, in view of the huge volume of social media data and the high amount of noise, it is obviously a quite challenging task to directly obtain clear story context from a social platform. And the manual arrangement of story lines not only requires a great deal of time and effort. Thus, automatically building story context is an important task in event context analysis.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an event context generation method and system based on topic heat index, so as to solve the problems of improper keyword or event detection and inaccurate event context construction aiming at a social media platform.
In order to solve the problems, the invention adopts the following technical scheme:
An event context generation method based on topic heat index comprises the following steps:
The method comprises the steps of screening texts based on topic heat indexes under topics of each text under a certain keyword in a social media platform;
Aiming at the screened texts, screening topic labels of the screened texts according to semantic similarity of topic labels of each text and text texts;
The topic label vectors and the text vectors of each text are spliced, accumulated and then averaged to obtain the characteristic representation of each text topic label, the similarity of the characteristic representations among the topic labels is calculated by adopting cosine similarity, and then the time similarity among the topic labels is calculated based on the time sequence characteristics of the topic labels;
Combining the topic labels of the text screened again based on the similarity and the time similarity of the feature representation among the topic labels to obtain topic label nodes;
And (3) based on an improved hierarchical clustering algorithm of a time axis, selecting the topic label node with the earliest time from the topic label nodes as a root node of an event context, comparing the similarity of the next topic label node and the existing nodes in the current event context one by one according to a time sequence, selecting the topic label node with the highest similarity as a father node and incorporating the father node into a corresponding branch until each topic label node is fused into the event context.
As an implementation manner, the filtering the text based on the topic heat index under the topic of each text in the social media platform includes:
Calculating topic heat index under topic for each text in a social media platform :
Wherein the method comprises the steps of、、Is a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
As an implementation manner, the rescreening the screened text according to the semantic similarity between the topic label and the text body of each text includes:
Aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
As an implementation manner, the merging the topic labels of the text based on the similarity and the time similarity of the feature representation between the topic labels to obtain a topic label node includes:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
As an implementation manner, the integrated similarityThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
An event context generation system based on topic heat index comprises a first screening module, a second screening module, a topic label similarity calculation module, a topic label node construction module and an event context generation module;
The first screening module is used for screening the texts based on topic heat indexes under topics of each text under a certain keyword in the social media platform;
the second screening module is used for screening the topic labels of the screened texts according to the semantic similarity of the topic labels of each text and the text body;
The topic label similarity calculation module is used for carrying out splicing, accumulation and averaging on topic label vectors and text vectors of each text to obtain feature representations of each text topic label, calculating similarity of feature representations among topic labels by adopting cosine similarity, and calculating time similarity among topic labels based on time sequence features of the topic labels;
The topic label node construction module is used for combining topic labels of texts based on the similarity and the time similarity of feature representation among the topic labels to obtain topic label nodes;
the event context generation module is used for screening the topic label node with the earliest time from the topic label nodes based on an improved hierarchical clustering algorithm of a time axis to serve as a root node of the event context, comparing the similarity of the next topic label node and the existing nodes in the current event context one by one according to a time sequence, selecting the topic label node with the highest similarity as a father node and bringing the father node into a corresponding branch until each topic label node is fused into the event context.
As an implementation manner, the filtering the text based on the topic heat index under the topic of each text in the social media platform includes:
Calculating topic heat index under topic for each text in a social media platform :
Wherein the method comprises the steps of、、Is a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
As an implementation manner, the rescreening the screened text according to the semantic similarity between the topic label and the text body of each text includes:
Aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
As an implementation manner, the merging the topic labels of the text based on the similarity and the time similarity of the feature representation between the topic labels to obtain a topic label node includes:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
As an implementation manner, the integrated similarityThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
The invention has the beneficial effects that: according to the method, the microblog text is screened by using the topic heat index, so that the corpus quality obtained from massive social media platform data is higher, and the accuracy of subsequent analysis is improved; the semantic similarity between the topic labels and the texts is calculated, texts irrelevant to the topic labels are screened out, and topic labels with event relevance are used as candidate nodes of event venation; the topic labels are distributed in the time axis in a time sequence and are used as time characteristics, and the similarity among the topic labels is comprehensively calculated by combining the time characteristics with the characteristic representation of the topic labels, so that similar or repeated topic labels are effectively combined; and (3) carrying out hierarchical clustering on the key topic labels by using a hierarchical clustering algorithm based on a time axis to form a clear and ordered event context.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, and not for limiting the present invention, and simple modifications of the method under the premise of the inventive concept are all within the scope of the claimed invention.
Referring to fig. 1, a topic heat index-based event context generation method includes:
The social media platform takes microblog as an example, and collects related topics from massive microblog texts (microblog texts under a plurality of topics can be obtained based on a certain keyword), but a great amount of noise is mixed in the texts. These noises may include content that is irrelevant, of low quality, or spurious to the topic, which makes understanding and analysis of the topic difficult. Therefore, the acquired text needs to be preprocessed first.
S100, filtering the texts based on topic heat indexes under topics of each text under a certain keyword in the social media platform.
Specifically, calculating the topic heat index under the topic for each text in the social media platform:
。
Wherein the method comprises the steps of、、Is a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Is praise for text under the topic. Considering that in microblog use, the action of praise is easier to generate than comment and forwarding, and comment and forwarding can more represent the true influence degree of texts and the attention degree of users. Thus, is provided hereinIt may be that the amount of the catalyst is 0.4,It may be that the amount of the catalyst is 0.4,0.2.
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text. Note that microblogs that truly cover important event content tend to be at relatively high social media popularity, so the topic popularity index threshold may be set to the top 10% in order to screen out microblog text that is at relatively high interest in social media.
The method is characterized in that topic popularity indexes of each microblog text are calculated, social interaction information such as forwarding, commentary, praise number and the like are combined with the indexes, and relevance of each text and topics is quantified. The relevance of the text can be evaluated more accurately, and the text closely related to the topic can be distinguished, so that noise information irrelevant to the topic is removed.
In processing microblog content, publishers may add topic labels to the microblog text on their own, due to the high uncertainty of the content, with the aim of increasing exposure and attracting more readers. However, this behavior results in a population of microblog data sets that are populated with a large number of tags that are not related to keyword events, and therefore require efficient screening.
S200, aiming at the screened texts, screening the topic labels of the screened texts according to the semantic similarity of the topic labels of each text and the text.
The method specifically comprises the following steps: aiming at the filtered texts, calculating the topic label vector of each text by adopting a pre-trained word vector model RBT3And text body vectorThe method is characterized by comprising the following steps:
。
and calculates the topic label vector of each text by cosine similarity And text body vectorSemantic similarity of (c):
。
And comparing the semantic similarity with a set semantic similarity threshold value, and screening the topic labels of the screened texts.
The method aims to effectively reduce noise interference of irrelevant topic labels and ensure that the extracted label set is more accurate and representative. The method can more accurately capture the key topic labels related to the event, serve as context candidate nodes and provide a more reliable data basis for subsequent event context analysis.
Since there are different expressions of the same event in a microblog, it is necessary to effectively merge these similar or duplicate topic tags in order to improve tag consistency and overall data quality. Aiming at the problem of label expression diversity in microblog data, the method aims at more comprehensively processing and combining similar or repeated labels by combining means of text feature extraction, time sequence analysis, semantic similarity calculation and the like, so that the consistency and quality of the data are improved.
S300, carrying out topic label vector on each textAnd text body vectorAnd performing splicing accumulation and re-averaging to obtain the characteristic representation of each text topic label:
。
Calculating the similarity of feature representations among topic labels by adopting cosine similarity:
。
Based on the time sequence characteristics of the topic labels, calculating the time similarity among the topic labels:
。
Indicating the number of occurrences of the tag and the distribution over the time axis. Temporal similarity concerns the distribution and pattern of occurrence of tags on the time axis. Tags with high similarity are also similar in time sequence. Thus, by using time features, the behavior and trends of the tags in the time dimension can be analyzed to better process and merge similar or duplicate tags.
S400, combining topic labels of the text based on the similarity and the time similarity of feature representation among the topic labels to obtain topic label nodes.
In order to more comprehensively evaluate the similarity between the two labels, the similarity of the feature representation and the time similarity are combined, respectively given with weight values and summed to obtain the comprehensive similarity between the topic labels, and the topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold value are combined.
The comprehensive similarity is as follows:
The similarity of the feature representations may be given a higher weight (0.8) to emphasize the importance of the text content in the tag similarity determination. At the same time, temporal similarities may be given a lower weight (0.2) for balance considerations, as temporal features, while important, may not reflect the semantic relevance of tags directly as the text content in some cases.
And setting a comprehensive similarity threshold value for screening out topic labels with higher similarity to be combined. The default value of the threshold is 0.6, and the severity of the merging operation can be flexibly controlled by adjusting the threshold, so that various different research requirements can be better met.
S500, a time axis-based improved hierarchical clustering algorithm is adopted, topic label nodes with earliest time are selected from topic label nodes to serve as root nodes of event venation, the similarity of the next topic label node and existing nodes in the current event venation is compared one by one according to time sequence, topic label nodes with highest similarity are selected to serve as father nodes, and the father nodes are brought into corresponding branches until each topic label node is integrated into the event venation.
After the topic label nodes are acquired, the core task is to comb the relation among the nodes so as to construct event context. The invention introduces an improved hierarchical clustering algorithm based on a time axis, and the core idea of the algorithm is to utilize the sequence of the time axis to perform hierarchical clustering on nodes so as to construct ordered and clear event venation. First, the earliest node of the timeline is extracted from the topic label cluster as the starting point of the story. This node may be considered the root node of the overall event context. Then, according to the sequence of the time axis, the similarity between the next node and the existing nodes in the current context is compared one by one, the node with the highest similarity is selected as a father node, and the father node is orderly brought into the corresponding branch. This iterative process continues until each node successfully merges into the entire context. The algorithm design is unique in that it can effectively establish associations between the topic label nodes in the time dimension. This means that the relevant nodes can be organized in time sequence to form a complete context.
Through the algorithm, hierarchical organization of the topic label nodes is realized, so that the development process of the event presents a clear and orderly structure. Compared with the traditional clustering method, the method is characterized in that time information is fully utilized, so that a time line of events can penetrate through the whole context construction process. This unique design provides a way for this document to understand the evolution of microblog events in depth, so that building the context is no longer a simple node connection, but rather more closely follows the presentation of the actual context of the event.
Referring to fig. 2, an event context generating system based on topic heat index includes a first screening module 100, a second screening module 200, a topic label similarity calculating module 300, a topic label node constructing module 400, and an event context generating module 500.
The first screening module 100 is configured to screen the text based on the topic heat index under the topic of each text under a certain keyword in the social media platform.
The second screening module 200 is configured to screen, for the screened text, the topic label of the screened text by using the topic label of each text and the semantic similarity of the text body.
The topic tag similarity calculation module 300 is configured to splice, accumulate and then average topic tag vectors and text vectors of each text to obtain feature representations of each text topic tag, calculate similarity of feature representations among topic tags by adopting cosine similarity, and calculate time similarity among topic tags based on time sequence features of topic tags.
The topic tag node construction module 400 is configured to combine topic tags of the text based on the similarity and the time similarity of feature representation among topic tags, so as to obtain a topic tag node.
The event context generation module 500 is configured to screen out the topic label node with the earliest time from the topic label nodes based on an improved hierarchical clustering algorithm of a time axis, to compare the similarity between the next topic label node and the existing nodes in the current event context one by one according to a time sequence, to select the topic label node with the highest similarity as a father node, and to incorporate the father node into the corresponding branch until each topic label node is merged into the event context.
The method for screening the texts based on topic heat indexes under topics of each text in the social media platform comprises the following steps:
Calculating topic heat index under topic for each text in a social media platform :
Wherein the method comprises the steps of、、Is a weight coefficient, the sum of which is 1,Is the number of times of forwarding text under a topic,Is the number of comments of the text under the topic,Endorsements for text under the topic;
And comparing the calculated topic heat index with a set topic heat index threshold value, and screening the text.
Wherein, aiming at the text after screening, the screened text is screened again through the semantic similarity of the topic label and the text body of each text, and the method comprises the following steps:
aiming at the screened texts, calculating the topic label vector and the text vector of each text by adopting a pre-trained word vector model RBT3, calculating the semantic similarity of the topic label vector and the text vector of each text by cosine similarity, comparing the semantic similarity with a set semantic similarity threshold, and screening the screened texts again.
Combining the topic labels of the text screened again based on the similarity and the time similarity of the feature representation among the topic labels to obtain topic label nodes, wherein the topic label nodes comprise:
The topic labels with the comprehensive similarity higher than the set comprehensive similarity threshold are combined by respectively endowing the feature-expressed similarity and the time similarity among the topic labels of the text with weight values and summing the weight values to obtain the comprehensive similarity among the topic labels.
Wherein, the similarity is synthesizedThe calculation is performed by the following formula:
wherein, For the similarity of the representation of the features,Is the temporal similarity.
Finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.