CN109284379A - Adaptive microblog topic tracking method based on dual vector model - Google Patents
Adaptive microblog topic tracking method based on dual vector model Download PDFInfo
- Publication number
- CN109284379A CN109284379A CN201811106923.XA CN201811106923A CN109284379A CN 109284379 A CN109284379 A CN 109284379A CN 201811106923 A CN201811106923 A CN 201811106923A CN 109284379 A CN109284379 A CN 109284379A
- Authority
- CN
- China
- Prior art keywords
- topic
- vector
- microblogging
- value
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of adaptive microblog topic method for tracing based on double vector models, including S1: microblogging fragment, by microblogging daily fragment;S2: the double vector models of building topic;S3: topic and microblogging are expressed as vector by the double vector models of microblogging;S4: the cosine similarity of topic and microblogging is calculated, cosine similar value is bigger, and expression topic is more similar to microblogging;S5: the adaptive learning and threshold value comparison of similarity threshold overcome similarity threshold invariance bring topic drifting problem;S6: topic model updates, and overcomes topic model invariance bring topic drifting problem;S7: judging whether time slot is processed, no, then enters next time slot, repeats step 4-7;Otherwise, terminate algorithm.The present invention can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.
Description
Technical field
The present invention relates to be related to the Chinese text processing technical field of natural language processing, and in particular to one kind is based on two-way
Measure the adaptive microblog topic method for tracing of model.
Background technique
Microblogging receives public extensive concern as the representative of social media, can all generate the data letter of magnanimity daily
Breath.Microblog users often focus more on the progress of hot topic, thus, in the real time information stream of microblogging, user is for topic
Dynamic, which updates, urgent demand.Subtask one of of the Topic Tracking technology as topic detection and tracking technology, for interconnection
The problem of information overload of net provides good solution route.Topic Tracking technology, after mainly being carried out to certain known topic
The evolutionary process of topic is extracted in the lasting tracking of continuous text for user, to the summary of the generation of user-customized recommended, viewpoint with
And the practical applications such as emergency event emergency monitoring suffer from important directive function.
Microblog topic method for tracing can totally be divided into the method based on classification and the method based on query vector.Based on point
The method of class is the microblogging corpus training classifier using a large amount of known topics, realizes the classification to subsequent document.Based on inquiry
The method of vector is to construct a query vector according to priori data collection, and then calculated for subsequent microblogging is similar to the query vector
Degree, and made decisions according to similarity threshold, to complete topic tracking.Currently, microblog topic tracking existing characteristics are sparse, words
Topic drift, microblogging vectorization lead to problems such as microblogging partial information lose.For feature Sparse Problems, it has been suggested that a variety of extensions are special
The method of sign;In order to cope with topic drifting problem, the methods of feedback iteration, Word probability are suggested;For microblogging vectorization problem,
Generalling use VSM, perhaps word is embedded in neologisms or semantic information that vectorization method retains microblogging.But there are still microblogging vectorizations
The deficiencies of being lost microblogging semanteme afterwards or have ignored the neologisms in microblogging, and topic can not being overcome to drift about completely.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of adaptive microblog topic tracking side based on double vector models
Method can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.
To achieve the above object, the present invention adopts the following technical scheme:
A kind of adaptive microblog topic method for tracing based on double vector models, comprising the following steps:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S3: the double vector models of building microblogging;
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, phase of the topic with microblogging is calculated
Like degree;
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value of similarity threshold are carried out
Compare;
Step S6: topic model updates;
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblogging
Topic tracking.
Further, the two-way amount model construction of the initial topic specifically:
Step S21: potential descriptor point is excavated from randomly selected initial topic microblogging using BTM topic model
Cloth, select probability is distributed the probability distribution value after m high word and corresponding normalization, as character representation initial topic;
Step S22: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double
Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words
The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (2)
It is shown:
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wij
Represent the value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector use to
The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair
Weighted value is assigned a value of 0 if this feature is not present in text.
Further, the two-way amount model construction of the initial topic microblogging specifically:
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging
TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total micro-
Rich number, MwIndicate the textual data containing word w.
Step S32: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double
Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words
The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (5)
It is shown:
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, wij
The value of the term vector i-th dimension of j-th of Feature Words is represented, ratej indicates the feature weight of j-th of Feature Words;VSM vector use to
The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair
Weighted value is assigned a value of 0 if this feature is not present in text.
Further, the step S4 specifically:
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging in the double vector models of topic,
The calculation formula (6) of cosine similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate to
Measure the value in d i-th dimension;
Step S42: Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic
Cosine similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector as topic and
The similarity of microblogging, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the phase between the VSM vector in double vector models
Like degree, simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates topic and microblogging
It is more similar.
Further, the step S5 specifically:
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial of each topic
Feedback threshold is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And threshold is fed back in tracing process
Value is then related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, and time interval is closer, and correlation is stronger, threshold
Shown in calculating such as formula (8)-(9) of value ε and δ:
εt=δt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates the anti-of i-th of time slot
Present microblogging and topic similarity average value, t indicate t time slot minimum threshold, C indicate topic tolerance, lowest threshold with
Feedback threshold is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus
Enter to feed back microblogging collection, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined
For topic relevant microblog;Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold
Close microblogging.
Further, the step S6 specifically:
Step S61: selection topic feature is concentrated to generate initial topic mould from initial topic microblogging using BTM topic model
Type;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model
Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and
The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model
Update topic in talk model.
Further, the step S7 specifically:
Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot;It is no
Then, terminate algorithm, complete microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: by the topic feature of new topic model using vectorization method described in step S21 be expressed as to
Amount;
Step S74: step S4-S7 is repeated.
Compared with the prior art, the invention has the following beneficial effects:
The present invention proposes that two-way amount model indicates topic and microblogging, and the semanteme that text is remained by the way of word insertion is special
Property, while retaining new word information in the way of VSM vectorization;Time attribute is introduced, proposes a kind of adaptive learning similarity
The strategy of threshold value reduces the omission factor of topic relevant microblog, improves the performance of topic tracking algorithm;It is moved during topic tracking
State updates topic model, copes with the topic drift in topic evolutionary process, reduces the omission factor and false detection rate of topic relevant microblog.
Detailed description of the invention
Fig. 1 is the implementation flow chart in one embodiment of the invention.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
Please referring to Fig. 1, the present invention provides a kind of adaptive microblog topic method for tracing based on double vector models, including with
Lower step:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S21: potential descriptor point is excavated from randomly selected initial topic microblogging using BTM topic model
Cloth, select probability is distributed the probability distribution value after m high word and corresponding normalization, as character representation initial topic;
Step S22: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double
Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words
The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (2)
It is shown:
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wij
Represent the value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector use to
The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair
Weighted value is assigned a value of 0 if this feature is not present in text.
Step S3: the double vector models of building microblogging;
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging
TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total micro-
Rich number, MwIndicate the textual data containing word w.
Step S32: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double
Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words
The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (5)
It is shown:
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, wij
The value of the term vector i-th dimension of j-th of Feature Words is represented, ratej indicates the feature weight of j-th of Feature Words;VSM vector use to
The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair
Weighted value is assigned a value of 0 if this feature is not present in text.
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, phase of the topic with microblogging is calculated
Like degree;
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging in the double vector models of topic,
The calculation formula (6) of cosine similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate to
Measure the value in d i-th dimension;
Step S42: Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic
Cosine similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector as topic and
The similarity of microblogging, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the phase between the VSM vector in double vector models
Like degree, simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates topic and microblogging
It is more similar.
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value of similarity threshold are carried out
Compare;
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial of each topic
Feedback threshold is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And threshold is fed back in tracing process
Value is then related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, and time interval is closer, and correlation is stronger, threshold
Shown in calculating such as formula (8)-(9) of value ε and δ:
εt=δt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates the anti-of i-th of time slot
Present microblogging and topic similarity average value, t indicate t time slot minimum threshold, C indicate topic tolerance, lowest threshold with
Feedback threshold is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus
Enter to feed back microblogging collection, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined
For topic relevant microblog;Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold
Close microblogging.Feedback threshold is used to select the microblogging highly relevant with topic as feedback microblogging, updates topic model.And similarity
Lowest threshold is the minimum boundary that microblogging belongs to topic, and feedback threshold is greater than lowest threshold.
Step S6: topic model updates;
Step S61: selection topic feature is concentrated to generate initial topic mould from initial topic microblogging using BTM topic model
Type;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model
Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and
The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model
Update topic in talk model.In order to improve the efficiency of topic tracking method, topic model, which updates, to be provided with time conditions and feeds back micro-
Rich number threshold value feed (feed value is 10).If updating topic as long as thering is feedback microblogging to be added, topic can be made to update secondary
Number is excessively frequent, influences to track efficiency.It also, may be that noise is micro- if the feedback microblog number being added in the time slot is very little
It is rich, so not updating topic.Therefore, after a time slot, if the feedback microblog number being newly added is greater than feed, more newspeak
Topic.Otherwise, topic is not updated.In general, 20 features can indicate a topic, so T takes 20.
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblogging
Topic tracking,
Step S71: judging whether time slot is processed, if untreated complete, executes step S62 into next time slot;It is no
Then, terminate algorithm, complete microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: by the topic feature of new topic model using vectorization method described in step S21 be expressed as to
Amount;
Step S74: step S4-S7 is repeated.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with
Modification, is all covered by the present invention.
Claims (7)
1. a kind of adaptive microblog topic method for tracing based on double vector models, which comprises the following steps:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S3: the double vector models of building microblogging;
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, the similarity of topic and microblogging is calculated;
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value comparison of similarity threshold are carried out;
Step S6: topic model updates;
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblog topic
Tracking.
2. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute
State the two-way amount model construction of initial topic specifically:
Step S21: potential descriptor distribution, choosing are excavated from randomly selected initial topic microblogging using BTM topic model
Probability distribution value after selecting m high word of probability distribution and corresponding normalization, as character representation initial topic;
Step S22: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method
Model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector conversion using its Feature Words
At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (2):
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wijIt represents
The value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector is empty using vector
Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair
Value is assigned a value of 0 if this feature is not present in text.
3. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute
State the two-way amount model construction of initial topic microblogging specifically:
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging
TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total microblog number
Mesh, MwIndicate the textual data containing word w;
Step S32: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method
Model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector conversion using its Feature Words
At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (5):
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, and wij is represented
The value of the term vector i-th dimension of j-th of Feature Words, ratej indicate the feature weight of j-th of Feature Words;VSM vector is empty using vector
Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair
Value is assigned a value of 0 if this feature is not present in text.
4. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute
State step S4 specifically:
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging, cosine in the double vector models of topic
The calculation formula (6) of similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate vector d the
Value in i dimension;
Step S42: the cosine of Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic
Similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector is as topic and microblogging
Similarity, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the similarity between the VSM vector in double vector models,
simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates that topic gets over phase with microblogging
Seemingly.
5. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute
State step S5 specifically:
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial feedback of each topic
Threshold value is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And feedback threshold is then in tracing process
Related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, time interval is closer, and correlation is stronger, threshold epsilon
Shown in calculating such as formula (8)-(9) with δ:
εt=δt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates that the feedback of i-th of time slot is micro-
The rich average value with topic similarity, t indicate that the minimum threshold of t time slot, C indicate topic tolerance, lowest threshold and feedback
Threshold value is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic is greater than feedback threshold, microblogging is highly relevant with topic, is added into anti-
Microblogging collection is presented, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, determine microblogging for words
Inscribe relevant microblog;Conversely, microblogging is determined as that topic is uncorrelated micro- if the similarity of microblogging and topic is not more than lowest threshold
It is rich.
6. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute
State step S6 specifically:
Step S61: selection topic feature is concentrated to generate initial topic model from initial topic microblogging using BTM topic model;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model
There are some feature, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and will be former
The feature of topic model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model modification
Topic in talk model.
7. the adaptive microblog topic method for tracing according to claim 2 based on double vector models, it is characterised in that: institute
State step S7 specifically:
Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot;Otherwise, it ties
Beam algorithm completes microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: the topic feature of new topic model is expressed as vector using vectorization method described in step S21;
Step S74: step S4-S7 is repeated.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811106923.XA CN109284379B (en) | 2018-09-21 | 2018-09-21 | Adaptive microblog topic tracking method based on two-way quantity model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811106923.XA CN109284379B (en) | 2018-09-21 | 2018-09-21 | Adaptive microblog topic tracking method based on two-way quantity model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109284379A true CN109284379A (en) | 2019-01-29 |
| CN109284379B CN109284379B (en) | 2022-01-04 |
Family
ID=65181961
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811106923.XA Active CN109284379B (en) | 2018-09-21 | 2018-09-21 | Adaptive microblog topic tracking method based on two-way quantity model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109284379B (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107562919A (en) * | 2017-09-13 | 2018-01-09 | 云南大学 | A kind of more indexes based on information retrieval integrate software component retrieval method and system |
| CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
| US20180032606A1 (en) * | 2016-07-26 | 2018-02-01 | Qualtrics, Llc | Recommending topic clusters for unstructured text documents |
| US20180068371A1 (en) * | 2016-09-08 | 2018-03-08 | Adobe Systems Incorporated | Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models |
| CN108062307A (en) * | 2018-01-04 | 2018-05-22 | 中国科学技术大学 | The text semantic steganalysis method of word-based incorporation model |
-
2018
- 2018-09-21 CN CN201811106923.XA patent/CN109284379B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180032606A1 (en) * | 2016-07-26 | 2018-02-01 | Qualtrics, Llc | Recommending topic clusters for unstructured text documents |
| US20180068371A1 (en) * | 2016-09-08 | 2018-03-08 | Adobe Systems Incorporated | Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models |
| CN107562919A (en) * | 2017-09-13 | 2018-01-09 | 云南大学 | A kind of more indexes based on information retrieval integrate software component retrieval method and system |
| CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
| CN108062307A (en) * | 2018-01-04 | 2018-05-22 | 中国科学技术大学 | The text semantic steganalysis method of word-based incorporation model |
Non-Patent Citations (4)
| Title |
|---|
| DANIEL MORARIU ET AL.: "An Extension of the VSM Documents Representation using Word Embedding", 《PROCEEDINGS OF THE BRCEBE-ICEBE’17 CONFERENCE, SIBIU, ROMANIA》 * |
| 唐明等: "基于Word2Vec的一种文档向量表示", 《计算机科学》 * |
| 武军娜: "自适应话题跟踪技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
| 程林骏: "基于多源数据的话题检测与追踪研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109284379B (en) | 2022-01-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111191466B (en) | Homonymous author disambiguation method based on network characterization and semantic characterization | |
| CN109657135B (en) | Scholars user portrait information extraction method and model based on neural network | |
| CN104462253B (en) | A kind of topic detection or tracking of network-oriented text big data | |
| CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
| CN106815310A (en) | A kind of hierarchy clustering method and system to magnanimity document sets | |
| CN104376406A (en) | Enterprise innovation resource management and analysis system and method based on big data | |
| CN106383877A (en) | On-line short text clustering and topic detection method of social media | |
| CN105653668A (en) | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment | |
| CN111522923B (en) | A Multi-Task Dialogue State Tracking Method | |
| CN104298765A (en) | Dynamic recognizing and tracking method of internet public opinion topics | |
| CN111061939B (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
| CN105956093A (en) | Individual recommending method based on multi-view anchor graph Hash technology | |
| Zhang et al. | Enhancing video event recognition using automatically constructed semantic-visual knowledge base | |
| CN106202480A (en) | A kind of network behavior based on K means and LDA bi-directional verification custom clustering method | |
| CN108647322A (en) | The method that word-based net identifies a large amount of Web text messages similarities | |
| An et al. | A heuristic approach on metadata recommendation for search engine optimization | |
| CN107038184A (en) | A kind of news based on layering latent variable model recommends method | |
| CN115408621B (en) | Point-of-interest recommendation method considering linear and nonlinear interaction of auxiliary information features | |
| CN110532450A (en) | A Method of Topic Crawling Based on Improved Shark Search | |
| CN112561599A (en) | Click rate prediction method based on attention network learning and fusing domain feature interaction | |
| CN113051397A (en) | Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation | |
| Sun et al. | A hybrid approach to news recommendation based on knowledge graph and long short-term user preferences | |
| CN110717068B (en) | Video retrieval method based on deep learning | |
| CN110674313B (en) | Method for dynamically updating knowledge graph based on user log | |
| Izquierdo et al. | Word vs. class-based word sense disambiguation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |