CN109284379A

CN109284379A - Adaptive microblog topic tracking method based on dual vector model

Info

Publication number: CN109284379A
Application number: CN201811106923.XA
Authority: CN
Inventors: 郭文忠; 黄畅; 郭昆; 陈羽中
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2019-01-29
Anticipated expiration: 2038-09-21
Also published as: CN109284379B

Abstract

The present invention relates to a kind of adaptive microblog topic method for tracing based on double vector models, including S1: microblogging fragment, by microblogging daily fragment；S2: the double vector models of building topic；S3: topic and microblogging are expressed as vector by the double vector models of microblogging；S4: the cosine similarity of topic and microblogging is calculated, cosine similar value is bigger, and expression topic is more similar to microblogging；S5: the adaptive learning and threshold value comparison of similarity threshold overcome similarity threshold invariance bring topic drifting problem；S6: topic model updates, and overcomes topic model invariance bring topic drifting problem；S7: judging whether time slot is processed, no, then enters next time slot, repeats step 4-7；Otherwise, terminate algorithm.The present invention can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.

Description

Adaptive microblog topic method for tracing based on double vector models

Technical field

The present invention relates to be related to the Chinese text processing technical field of natural language processing, and in particular to one kind is based on two-way Measure the adaptive microblog topic method for tracing of model.

Background technique

Microblogging receives public extensive concern as the representative of social media, can all generate the data letter of magnanimity daily Breath.Microblog users often focus more on the progress of hot topic, thus, in the real time information stream of microblogging, user is for topic Dynamic, which updates, urgent demand.Subtask one of of the Topic Tracking technology as topic detection and tracking technology, for interconnection The problem of information overload of net provides good solution route.Topic Tracking technology, after mainly being carried out to certain known topic The evolutionary process of topic is extracted in the lasting tracking of continuous text for user, to the summary of the generation of user-customized recommended, viewpoint with And the practical applications such as emergency event emergency monitoring suffer from important directive function.

Microblog topic method for tracing can totally be divided into the method based on classification and the method based on query vector.Based on point The method of class is the microblogging corpus training classifier using a large amount of known topics, realizes the classification to subsequent document.Based on inquiry The method of vector is to construct a query vector according to priori data collection, and then calculated for subsequent microblogging is similar to the query vector Degree, and made decisions according to similarity threshold, to complete topic tracking.Currently, microblog topic tracking existing characteristics are sparse, words Topic drift, microblogging vectorization lead to problems such as microblogging partial information lose.For feature Sparse Problems, it has been suggested that a variety of extensions are special The method of sign；In order to cope with topic drifting problem, the methods of feedback iteration, Word probability are suggested；For microblogging vectorization problem, Generalling use VSM, perhaps word is embedded in neologisms or semantic information that vectorization method retains microblogging.But there are still microblogging vectorizations The deficiencies of being lost microblogging semanteme afterwards or have ignored the neologisms in microblogging, and topic can not being overcome to drift about completely.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of adaptive microblog topic tracking side based on double vector models Method can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.

To achieve the above object, the present invention adopts the following technical scheme:

A kind of adaptive microblog topic method for tracing based on double vector models, comprising the following steps:

Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot；

Step S2: the double vector models of building initial topic；

Step S3: the double vector models of building microblogging；

Step S4: according to the double vector models of initial topic and the double vector models of microblogging, phase of the topic with microblogging is calculated Like degree；

Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value of similarity threshold are carried out Compare；

Step S6: topic model updates；

Step S7: judging whether time slot is processed, no, into next time slot；Otherwise, terminate algorithm, complete microblogging Topic tracking.

Further, the two-way amount model construction of the initial topic specifically:

Step S21: potential descriptor point is excavated from randomly selected initial topic microblogging using BTM topic model Cloth, select probability is distributed the probability distribution value after m high word and corresponding normalization, as character representation initial topic；

Step S22: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector；Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (2) It is shown:

K={ k₁,k₂,…,k_n} (1)

Wherein, k indicates that vector, n indicate vector dimension, k_iIndicate the value of vector k i-th dimension, m indicates Feature Words number, w_ij Represent the value of the term vector i-th dimension of j-th of Feature Words, rate_jIndicate the feature weight of j-th of Feature Words；VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.

Further, the two-way amount model construction of the initial topic microblogging specifically:

Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):

TFIDF_w=tf_w×lg(M/M_w+0.01) (3)

Wherein, TFIDF_wIndicate the TFIDF value of word w, tf_wIndicate frequency of occurrence of the word w in current microblogging, M indicates total micro- Rich number, M_wIndicate the textual data containing word w.

Step S32: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector；Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (5) It is shown:

P={ p₁,p₂,…,p_n} (4)

Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, wij The value of the term vector i-th dimension of j-th of Feature Words is represented, ratej indicates the feature weight of j-th of Feature Words；VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.

Further, the step S4 specifically:

Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging in the double vector models of topic, The calculation formula (6) of cosine similarity is as follows:

Wherein, Sim_kdIndicate the cosine similarity of vector k and vector d, k_iIndicate the value in vector k i-th dimension, d_iIndicate to Measure the value in d i-th dimension；

Step S42: Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic Cosine similarity；

Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector as topic and The similarity of microblogging, shown in calculation method such as formula (7):

Wherein Sim indicates the similarity of topic and microblogging, sim_vsmIndicate the phase between the VSM vector in double vector models Like degree, sim_word2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates topic and microblogging It is more similar.

Further, the step S5 specifically:

Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial of each topic Feedback threshold is indicated with initial topic and the average value of the similarity of initial topic relevant microblog；And threshold is fed back in tracing process Value is then related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, and time interval is closer, and correlation is stronger, threshold Shown in calculating such as formula (8)-(9) of value ε and δ:

ε_t=δ_t-C (9)

Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates the anti-of i-th of time slot Present microblogging and topic similarity average value, t indicate t time slot minimum threshold, C indicate topic tolerance, lowest threshold with Feedback threshold is related, and value is equal to feedback threshold and subtracts topic tolerance C；

Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus Enter to feed back microblogging collection, for generating new topic model；If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined For topic relevant microblog；Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold Close microblogging.

Further, the step S6 specifically:

Step S61: selection topic feature is concentrated to generate initial topic mould from initial topic microblogging using BTM topic model Type；

Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model；

Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model Update topic in talk model.

Further, the step S7 specifically:

Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot；It is no Then, terminate algorithm, complete microblog topic tracking；

Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector；

Step S73: by the topic feature of new topic model using vectorization method described in step S21 be expressed as to Amount；

Step S74: step S4-S7 is repeated.

Compared with the prior art, the invention has the following beneficial effects:

The present invention proposes that two-way amount model indicates topic and microblogging, and the semanteme that text is remained by the way of word insertion is special Property, while retaining new word information in the way of VSM vectorization；Time attribute is introduced, proposes a kind of adaptive learning similarity The strategy of threshold value reduces the omission factor of topic relevant microblog, improves the performance of topic tracking algorithm；It is moved during topic tracking State updates topic model, copes with the topic drift in topic evolutionary process, reduces the omission factor and false detection rate of topic relevant microblog.

Detailed description of the invention

Fig. 1 is the implementation flow chart in one embodiment of the invention.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

Please referring to Fig. 1, the present invention provides a kind of adaptive microblog topic method for tracing based on double vector models, including with Lower step:

Step S2: the double vector models of building initial topic；

K={ k₁,k₂,…,k_n} (1)

Step S3: the double vector models of building microblogging；

TFIDF_w=tf_w×lg(M/M_w+0.01) (3)

P={ p₁,p₂,…,p_n} (4)

ε_t=δ_t-C (9)

Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus Enter to feed back microblogging collection, for generating new topic model；If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined For topic relevant microblog；Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold Close microblogging.Feedback threshold is used to select the microblogging highly relevant with topic as feedback microblogging, updates topic model.And similarity Lowest threshold is the minimum boundary that microblogging belongs to topic, and feedback threshold is greater than lowest threshold.

Step S6: topic model updates；

Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model Update topic in talk model.In order to improve the efficiency of topic tracking method, topic model, which updates, to be provided with time conditions and feeds back micro- Rich number threshold value feed (feed value is 10).If updating topic as long as thering is feedback microblogging to be added, topic can be made to update secondary Number is excessively frequent, influences to track efficiency.It also, may be that noise is micro- if the feedback microblog number being added in the time slot is very little It is rich, so not updating topic.Therefore, after a time slot, if the feedback microblog number being newly added is greater than feed, more newspeak Topic.Otherwise, topic is not updated.In general, 20 features can indicate a topic, so T takes 20.

Step S7: judging whether time slot is processed, no, into next time slot；Otherwise, terminate algorithm, complete microblogging Topic tracking,

Step S71: judging whether time slot is processed, if untreated complete, executes step S62 into next time slot；It is no Then, terminate algorithm, complete microblog topic tracking；

Step S74: step S4-S7 is repeated.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims

1. a kind of adaptive microblog topic method for tracing based on double vector models, which comprises the following steps:

Step S2: the double vector models of building initial topic；

Step S3: the double vector models of building microblogging；

Step S4: according to the double vector models of initial topic and the double vector models of microblogging, the similarity of topic and microblogging is calculated；

Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value comparison of similarity threshold are carried out；

Step S6: topic model updates；

Step S7: judging whether time slot is processed, no, into next time slot；Otherwise, terminate algorithm, complete microblog topic Tracking.

2. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State the two-way amount model construction of initial topic specifically:

Step S21: potential descriptor distribution, choosing are excavated from randomly selected initial topic microblogging using BTM topic model Probability distribution value after selecting m high word of probability distribution and corresponding normalization, as character representation initial topic；

Step S22: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method Model is made of Word2Vec vector sum VSM vector；Wherein Word2Vec vector refers to the term vector conversion using its Feature Words At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (2):

K={ k₁,k₂,…,k_n} (1)

Wherein, k indicates that vector, n indicate vector dimension, k_iIndicate the value of vector k i-th dimension, m indicates Feature Words number, w_ijIt represents The value of the term vector i-th dimension of j-th of Feature Words, rate_jIndicate the feature weight of j-th of Feature Words；VSM vector is empty using vector Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair Value is assigned a value of 0 if this feature is not present in text.

3. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State the two-way amount model construction of initial topic microblogging specifically:

TFIDF_w=tf_w×lg(M/M_w+0.01) (3)

Wherein, TFIDF_wIndicate the TFIDF value of word w, tf_wIndicate frequency of occurrence of the word w in current microblogging, M indicates total microblog number Mesh, M_wIndicate the textual data containing word w；

Step S32: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method Model is made of Word2Vec vector sum VSM vector；Wherein Word2Vec vector refers to the term vector conversion using its Feature Words At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (5):

P={ p₁,p₂,…,p_n} (4)

Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, and wij is represented The value of the term vector i-th dimension of j-th of Feature Words, ratej indicate the feature weight of j-th of Feature Words；VSM vector is empty using vector Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair Value is assigned a value of 0 if this feature is not present in text.

4. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S4 specifically:

Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging, cosine in the double vector models of topic The calculation formula (6) of similarity is as follows:

Wherein, Sim_kdIndicate the cosine similarity of vector k and vector d, k_iIndicate the value in vector k i-th dimension, d_iIndicate vector d the Value in i dimension；

Step S42: the cosine of Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic Similarity；

Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector is as topic and microblogging Similarity, shown in calculation method such as formula (7):

Wherein Sim indicates the similarity of topic and microblogging, sim_vsmIndicate the similarity between the VSM vector in double vector models, sim_word2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates that topic gets over phase with microblogging Seemingly.

5. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S5 specifically:

Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial feedback of each topic Threshold value is indicated with initial topic and the average value of the similarity of initial topic relevant microblog；And feedback threshold is then in tracing process Related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, time interval is closer, and correlation is stronger, threshold epsilon Shown in calculating such as formula (8)-(9) with δ:

ε_t=δ_t-C (9)

Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates that the feedback of i-th of time slot is micro- The rich average value with topic similarity, t indicate that the minimum threshold of t time slot, C indicate topic tolerance, lowest threshold and feedback Threshold value is related, and value is equal to feedback threshold and subtracts topic tolerance C；

Step S52: if the similarity of microblogging and topic is greater than feedback threshold, microblogging is highly relevant with topic, is added into anti- Microblogging collection is presented, for generating new topic model；If the similarity of microblogging and topic is greater than lowest threshold, determine microblogging for words Inscribe relevant microblog；Conversely, microblogging is determined as that topic is uncorrelated micro- if the similarity of microblogging and topic is not more than lowest threshold It is rich.

6. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S6 specifically:

Step S61: selection topic feature is concentrated to generate initial topic model from initial topic microblogging using BTM topic model；

Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model There are some feature, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and will be former The feature of topic model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model modification Topic in talk model.

7. the adaptive microblog topic method for tracing according to claim 2 based on double vector models, it is characterised in that: institute State step S7 specifically:

Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot；Otherwise, it ties Beam algorithm completes microblog topic tracking；

Step S73: the topic feature of new topic model is expressed as vector using vectorization method described in step S21；

Step S74: step S4-S7 is repeated.