[go: up one dir, main page]

CN109284379A - Adaptive microblog topic tracking method based on dual vector model - Google Patents

Adaptive microblog topic tracking method based on dual vector model Download PDF

Info

Publication number
CN109284379A
CN109284379A CN201811106923.XA CN201811106923A CN109284379A CN 109284379 A CN109284379 A CN 109284379A CN 201811106923 A CN201811106923 A CN 201811106923A CN 109284379 A CN109284379 A CN 109284379A
Authority
CN
China
Prior art keywords
topic
vector
microblogging
value
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811106923.XA
Other languages
Chinese (zh)
Other versions
CN109284379B (en
Inventor
郭文忠
黄畅
郭昆
陈羽中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201811106923.XA priority Critical patent/CN109284379B/en
Publication of CN109284379A publication Critical patent/CN109284379A/en
Application granted granted Critical
Publication of CN109284379B publication Critical patent/CN109284379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of adaptive microblog topic method for tracing based on double vector models, including S1: microblogging fragment, by microblogging daily fragment;S2: the double vector models of building topic;S3: topic and microblogging are expressed as vector by the double vector models of microblogging;S4: the cosine similarity of topic and microblogging is calculated, cosine similar value is bigger, and expression topic is more similar to microblogging;S5: the adaptive learning and threshold value comparison of similarity threshold overcome similarity threshold invariance bring topic drifting problem;S6: topic model updates, and overcomes topic model invariance bring topic drifting problem;S7: judging whether time slot is processed, no, then enters next time slot, repeats step 4-7;Otherwise, terminate algorithm.The present invention can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.

Description

Adaptive microblog topic method for tracing based on double vector models
Technical field
The present invention relates to be related to the Chinese text processing technical field of natural language processing, and in particular to one kind is based on two-way Measure the adaptive microblog topic method for tracing of model.
Background technique
Microblogging receives public extensive concern as the representative of social media, can all generate the data letter of magnanimity daily Breath.Microblog users often focus more on the progress of hot topic, thus, in the real time information stream of microblogging, user is for topic Dynamic, which updates, urgent demand.Subtask one of of the Topic Tracking technology as topic detection and tracking technology, for interconnection The problem of information overload of net provides good solution route.Topic Tracking technology, after mainly being carried out to certain known topic The evolutionary process of topic is extracted in the lasting tracking of continuous text for user, to the summary of the generation of user-customized recommended, viewpoint with And the practical applications such as emergency event emergency monitoring suffer from important directive function.
Microblog topic method for tracing can totally be divided into the method based on classification and the method based on query vector.Based on point The method of class is the microblogging corpus training classifier using a large amount of known topics, realizes the classification to subsequent document.Based on inquiry The method of vector is to construct a query vector according to priori data collection, and then calculated for subsequent microblogging is similar to the query vector Degree, and made decisions according to similarity threshold, to complete topic tracking.Currently, microblog topic tracking existing characteristics are sparse, words Topic drift, microblogging vectorization lead to problems such as microblogging partial information lose.For feature Sparse Problems, it has been suggested that a variety of extensions are special The method of sign;In order to cope with topic drifting problem, the methods of feedback iteration, Word probability are suggested;For microblogging vectorization problem, Generalling use VSM, perhaps word is embedded in neologisms or semantic information that vectorization method retains microblogging.But there are still microblogging vectorizations The deficiencies of being lost microblogging semanteme afterwards or have ignored the neologisms in microblogging, and topic can not being overcome to drift about completely.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of adaptive microblog topic tracking side based on double vector models Method can track topic in real time and reduce the omission factor and false detection rate of topic relevant microblog.
To achieve the above object, the present invention adopts the following technical scheme:
A kind of adaptive microblog topic method for tracing based on double vector models, comprising the following steps:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S3: the double vector models of building microblogging;
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, phase of the topic with microblogging is calculated Like degree;
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value of similarity threshold are carried out Compare;
Step S6: topic model updates;
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblogging Topic tracking.
Further, the two-way amount model construction of the initial topic specifically:
Step S21: potential descriptor point is excavated from randomly selected initial topic microblogging using BTM topic model Cloth, select probability is distributed the probability distribution value after m high word and corresponding normalization, as character representation initial topic;
Step S22: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (2) It is shown:
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wij Represent the value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.
Further, the two-way amount model construction of the initial topic microblogging specifically:
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total micro- Rich number, MwIndicate the textual data containing word w.
Step S32: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (5) It is shown:
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, wij The value of the term vector i-th dimension of j-th of Feature Words is represented, ratej indicates the feature weight of j-th of Feature Words;VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.
Further, the step S4 specifically:
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging in the double vector models of topic, The calculation formula (6) of cosine similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate to Measure the value in d i-th dimension;
Step S42: Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic Cosine similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector as topic and The similarity of microblogging, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the phase between the VSM vector in double vector models Like degree, simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates topic and microblogging It is more similar.
Further, the step S5 specifically:
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial of each topic Feedback threshold is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And threshold is fed back in tracing process Value is then related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, and time interval is closer, and correlation is stronger, threshold Shown in calculating such as formula (8)-(9) of value ε and δ:
εtt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates the anti-of i-th of time slot Present microblogging and topic similarity average value, t indicate t time slot minimum threshold, C indicate topic tolerance, lowest threshold with Feedback threshold is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus Enter to feed back microblogging collection, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined For topic relevant microblog;Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold Close microblogging.
Further, the step S6 specifically:
Step S61: selection topic feature is concentrated to generate initial topic mould from initial topic microblogging using BTM topic model Type;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model Update topic in talk model.
Further, the step S7 specifically:
Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot;It is no Then, terminate algorithm, complete microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: by the topic feature of new topic model using vectorization method described in step S21 be expressed as to Amount;
Step S74: step S4-S7 is repeated.
Compared with the prior art, the invention has the following beneficial effects:
The present invention proposes that two-way amount model indicates topic and microblogging, and the semanteme that text is remained by the way of word insertion is special Property, while retaining new word information in the way of VSM vectorization;Time attribute is introduced, proposes a kind of adaptive learning similarity The strategy of threshold value reduces the omission factor of topic relevant microblog, improves the performance of topic tracking algorithm;It is moved during topic tracking State updates topic model, copes with the topic drift in topic evolutionary process, reduces the omission factor and false detection rate of topic relevant microblog.
Detailed description of the invention
Fig. 1 is the implementation flow chart in one embodiment of the invention.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
Please referring to Fig. 1, the present invention provides a kind of adaptive microblog topic method for tracing based on double vector models, including with Lower step:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S21: potential descriptor point is excavated from randomly selected initial topic microblogging using BTM topic model Cloth, select probability is distributed the probability distribution value after m high word and corresponding normalization, as character representation initial topic;
Step S22: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (2) It is shown:
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wij Represent the value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.
Step S3: the double vector models of building microblogging;
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total micro- Rich number, MwIndicate the textual data containing word w.
Step S32: being expressed as vector for feature set using VSM vectorization method and word insertion vectorization method, double Vector model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector using its Feature Words The vector being converted to, equal to most representative m Feature Words term vector be multiplied with its feature weight and, such as formula (5) It is shown:
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, wij The value of the term vector i-th dimension of j-th of Feature Words is represented, ratej indicates the feature weight of j-th of Feature Words;VSM vector use to The mode of quantity space model is expressed as vector, and for a character representation at one-dimensional in vector, the value in vector is character pair Weighted value is assigned a value of 0 if this feature is not present in text.
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, phase of the topic with microblogging is calculated Like degree;
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging in the double vector models of topic, The calculation formula (6) of cosine similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate to Measure the value in d i-th dimension;
Step S42: Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic Cosine similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector as topic and The similarity of microblogging, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the phase between the VSM vector in double vector models Like degree, simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates topic and microblogging It is more similar.
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value of similarity threshold are carried out Compare;
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial of each topic Feedback threshold is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And threshold is fed back in tracing process Value is then related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, and time interval is closer, and correlation is stronger, threshold Shown in calculating such as formula (8)-(9) of value ε and δ:
εtt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates the anti-of i-th of time slot Present microblogging and topic similarity average value, t indicate t time slot minimum threshold, C indicate topic tolerance, lowest threshold with Feedback threshold is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic be greater than feedback threshold, microblogging is highly relevant with topic, by itself plus Enter to feed back microblogging collection, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, microblogging is determined For topic relevant microblog;Conversely, microblogging is determined as topic not phase if the similarity of microblogging and topic is not more than lowest threshold Close microblogging.Feedback threshold is used to select the microblogging highly relevant with topic as feedback microblogging, updates topic model.And similarity Lowest threshold is the minimum boundary that microblogging belongs to topic, and feedback threshold is greater than lowest threshold.
Step S6: topic model updates;
Step S61: selection topic feature is concentrated to generate initial topic mould from initial topic microblogging using BTM topic model Type;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model Some feature is had existed, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and The feature of topic in talk model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model Update topic in talk model.In order to improve the efficiency of topic tracking method, topic model, which updates, to be provided with time conditions and feeds back micro- Rich number threshold value feed (feed value is 10).If updating topic as long as thering is feedback microblogging to be added, topic can be made to update secondary Number is excessively frequent, influences to track efficiency.It also, may be that noise is micro- if the feedback microblog number being added in the time slot is very little It is rich, so not updating topic.Therefore, after a time slot, if the feedback microblog number being newly added is greater than feed, more newspeak Topic.Otherwise, topic is not updated.In general, 20 features can indicate a topic, so T takes 20.
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblogging Topic tracking,
Step S71: judging whether time slot is processed, if untreated complete, executes step S62 into next time slot;It is no Then, terminate algorithm, complete microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: by the topic feature of new topic model using vectorization method described in step S21 be expressed as to Amount;
Step S74: step S4-S7 is repeated.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims (7)

1. a kind of adaptive microblog topic method for tracing based on double vector models, which comprises the following steps:
Step S1: carrying out time slot fragment for microblogging to be tracked by date, and microblogging on the same day is belonged to a time slot;
Step S2: the double vector models of building initial topic;
Step S3: the double vector models of building microblogging;
Step S4: according to the double vector models of initial topic and the double vector models of microblogging, the similarity of topic and microblogging is calculated;
Step S5: according to the similarity of obtained topic and microblogging, the adaptive learning and threshold value comparison of similarity threshold are carried out;
Step S6: topic model updates;
Step S7: judging whether time slot is processed, no, into next time slot;Otherwise, terminate algorithm, complete microblog topic Tracking.
2. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State the two-way amount model construction of initial topic specifically:
Step S21: potential descriptor distribution, choosing are excavated from randomly selected initial topic microblogging using BTM topic model Probability distribution value after selecting m high word of probability distribution and corresponding normalization, as character representation initial topic;
Step S22: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method Model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector conversion using its Feature Words At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (2):
K={ k1,k2,…,kn} (1)
Wherein, k indicates that vector, n indicate vector dimension, kiIndicate the value of vector k i-th dimension, m indicates Feature Words number, wijIt represents The value of the term vector i-th dimension of j-th of Feature Words, ratejIndicate the feature weight of j-th of Feature Words;VSM vector is empty using vector Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair Value is assigned a value of 0 if this feature is not present in text.
3. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State the two-way amount model construction of initial topic microblogging specifically:
Step S31: using TFIDF algorithm after extracting m big word of TFIDF value and corresponding normalization in microblogging TFIDF value is characterized weight to indicate, shown in the calculation formula of TFIDF value such as formula (3):
TFIDFw=tfw×lg(M/Mw+0.01) (3)
Wherein, TFIDFwIndicate the TFIDF value of word w, tfwIndicate frequency of occurrence of the word w in current microblogging, M indicates total microblog number Mesh, MwIndicate the textual data containing word w;
Step S32: feature set is expressed as by vector, two-way amount using VSM vectorization method and word insertion vectorization method Model is made of Word2Vec vector sum VSM vector;Wherein Word2Vec vector refers to the term vector conversion using its Feature Words At vector, equal to most representative m Feature Words term vector be multiplied with its feature weight and, as shown in formula (5):
P={ p1,p2,…,pn} (4)
Wherein, p indicates that vector, n indicate that vector dimension, pi indicate the value of vector p i-th dimension, and m indicates Feature Words number, and wij is represented The value of the term vector i-th dimension of j-th of Feature Words, ratej indicate the feature weight of j-th of Feature Words;VSM vector is empty using vector Between the mode of model be expressed as vector, for a character representation at one-dimensional in vector, the value in vector is the weight of character pair Value is assigned a value of 0 if this feature is not present in text.
4. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S4 specifically:
Step S41: the cosine similarity of VSM vector and VSM vector in the double vector models of microblogging, cosine in the double vector models of topic The calculation formula (6) of similarity is as follows:
Wherein, SimkdIndicate the cosine similarity of vector k and vector d, kiIndicate the value in vector k i-th dimension, diIndicate vector d the Value in i dimension;
Step S42: the cosine of Word2Vec vector and Word2Vec vector in the double vector models of microblogging in the double vector models of topic Similarity;
Step S43: the similarity between cosine similarity and Word2Vec vector between comprehensive VSM vector is as topic and microblogging Similarity, shown in calculation method such as formula (7):
Wherein Sim indicates the similarity of topic and microblogging, simvsmIndicate the similarity between the VSM vector in double vector models, simword2vecIndicate the similarity between the Word2Vec vector in double vector models, value is bigger, illustrates that topic gets over phase with microblogging Seemingly.
5. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S5 specifically:
Step S51: similarity threshold is divided into similarity lowest threshold ε and feedback threshold δ, for the initial feedback of each topic Threshold value is indicated with initial topic and the average value of the similarity of initial topic relevant microblog;And feedback threshold is then in tracing process Related with the average value of the similarity of topic to the feedback microblogging of preceding s time slot, time interval is closer, and correlation is stronger, threshold epsilon Shown in calculating such as formula (8)-(9) with δ:
εtt-C (9)
Wherein, t indicates that t-th of time slot, t indicate the feedback threshold of t time slot, and feedsimi indicates that the feedback of i-th of time slot is micro- The rich average value with topic similarity, t indicate that the minimum threshold of t time slot, C indicate topic tolerance, lowest threshold and feedback Threshold value is related, and value is equal to feedback threshold and subtracts topic tolerance C;
Step S52: if the similarity of microblogging and topic is greater than feedback threshold, microblogging is highly relevant with topic, is added into anti- Microblogging collection is presented, for generating new topic model;If the similarity of microblogging and topic is greater than lowest threshold, determine microblogging for words Inscribe relevant microblog;Conversely, microblogging is determined as that topic is uncorrelated micro- if the similarity of microblogging and topic is not more than lowest threshold It is rich.
6. the adaptive microblog topic method for tracing according to claim 1 based on double vector models, it is characterised in that: institute State step S6 specifically:
Step S61: selection topic feature is concentrated to generate initial topic model from initial topic microblogging using BTM topic model;
Step S62: selection topic feature is concentrated from feedback microblogging using BTM topic model, generates dynamic topic model;
Step S63: the feature of initial topic model and dynamic topic model is added in topic in talk model, if in topic in talk model There are some feature, the weighted value of the topic in talk aspect of model is updated with the weight limit value of this feature in three models, and will be former The feature of topic model is arranged by weighted value descending, selects forward T feature and its weighted value as new topic model modification Topic in talk model.
7. the adaptive microblog topic method for tracing according to claim 2 based on double vector models, it is characterised in that: institute State step S7 specifically:
Step S71: judging whether time slot is processed, if untreated complete, executes step S72 into next time slot;Otherwise, it ties Beam algorithm completes microblog topic tracking;
Step S72: the method that microblogging is constructed to the double vector models of microblogging described in step S22 is expressed as vector;
Step S73: the topic feature of new topic model is expressed as vector using vectorization method described in step S21;
Step S74: step S4-S7 is repeated.
CN201811106923.XA 2018-09-21 2018-09-21 Adaptive microblog topic tracking method based on two-way quantity model Active CN109284379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811106923.XA CN109284379B (en) 2018-09-21 2018-09-21 Adaptive microblog topic tracking method based on two-way quantity model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811106923.XA CN109284379B (en) 2018-09-21 2018-09-21 Adaptive microblog topic tracking method based on two-way quantity model

Publications (2)

Publication Number Publication Date
CN109284379A true CN109284379A (en) 2019-01-29
CN109284379B CN109284379B (en) 2022-01-04

Family

ID=65181961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811106923.XA Active CN109284379B (en) 2018-09-21 2018-09-21 Adaptive microblog topic tracking method based on two-way quantity model

Country Status (1)

Country Link
CN (1) CN109284379B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562919A (en) * 2017-09-13 2018-01-09 云南大学 A kind of more indexes based on information retrieval integrate software component retrieval method and system
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
US20180032606A1 (en) * 2016-07-26 2018-02-01 Qualtrics, Llc Recommending topic clusters for unstructured text documents
US20180068371A1 (en) * 2016-09-08 2018-03-08 Adobe Systems Incorporated Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models
CN108062307A (en) * 2018-01-04 2018-05-22 中国科学技术大学 The text semantic steganalysis method of word-based incorporation model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032606A1 (en) * 2016-07-26 2018-02-01 Qualtrics, Llc Recommending topic clusters for unstructured text documents
US20180068371A1 (en) * 2016-09-08 2018-03-08 Adobe Systems Incorporated Learning Vector-Space Representations of Items for Recommendations using Word Embedding Models
CN107562919A (en) * 2017-09-13 2018-01-09 云南大学 A kind of more indexes based on information retrieval integrate software component retrieval method and system
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108062307A (en) * 2018-01-04 2018-05-22 中国科学技术大学 The text semantic steganalysis method of word-based incorporation model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DANIEL MORARIU ET AL.: "An Extension of the VSM Documents Representation using Word Embedding", 《PROCEEDINGS OF THE BRCEBE-ICEBE’17 CONFERENCE, SIBIU, ROMANIA》 *
唐明等: "基于Word2Vec的一种文档向量表示", 《计算机科学》 *
武军娜: "自适应话题跟踪技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
程林骏: "基于多源数据的话题检测与追踪研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN109284379B (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN111191466B (en) Homonymous author disambiguation method based on network characterization and semantic characterization
CN109657135B (en) Scholars user portrait information extraction method and model based on neural network
CN104462253B (en) A kind of topic detection or tracking of network-oriented text big data
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN104376406A (en) Enterprise innovation resource management and analysis system and method based on big data
CN106383877A (en) On-line short text clustering and topic detection method of social media
CN105653668A (en) Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN111522923B (en) A Multi-Task Dialogue State Tracking Method
CN104298765A (en) Dynamic recognizing and tracking method of internet public opinion topics
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN105956093A (en) Individual recommending method based on multi-view anchor graph Hash technology
Zhang et al. Enhancing video event recognition using automatically constructed semantic-visual knowledge base
CN106202480A (en) A kind of network behavior based on K means and LDA bi-directional verification custom clustering method
CN108647322A (en) The method that word-based net identifies a large amount of Web text messages similarities
An et al. A heuristic approach on metadata recommendation for search engine optimization
CN107038184A (en) A kind of news based on layering latent variable model recommends method
CN115408621B (en) Point-of-interest recommendation method considering linear and nonlinear interaction of auxiliary information features
CN110532450A (en) A Method of Topic Crawling Based on Improved Shark Search
CN112561599A (en) Click rate prediction method based on attention network learning and fusing domain feature interaction
CN113051397A (en) Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation
Sun et al. A hybrid approach to news recommendation based on knowledge graph and long short-term user preferences
CN110717068B (en) Video retrieval method based on deep learning
CN110674313B (en) Method for dynamically updating knowledge graph based on user log
Izquierdo et al. Word vs. class-based word sense disambiguation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant