CN103678642A

CN103678642A - Concept semantic similarity measurement method based on search engine

Info

Publication number: CN103678642A
Application number: CN201310713182.2A
Authority: CN
Inventors: 徐峥; 齐力; 梅林�; 胡传平; 支凤麟; 梁辰; 骆祥峰; 魏晓; 张顺香
Original assignee: Third Research Institute of the Ministry of Public Security
Current assignee: Third Research Institute of the Ministry of Public Security
Priority date: 2013-12-20
Filing date: 2013-12-20
Publication date: 2014-03-26

Abstract

The invention discloses a concept semantic similarity measurement method based on a search engine. Page numbering, semantic fragments and the number of displayed search results are integrated in the new method. Noise and redundancy in data of the search engine are effectively removed, and the problems in the prior art are effectively solved.

Description

A kind of Concept Semantic Similarity measure based on search engine

Technical field

The present invention relates to Data Mining, be specially a kind of tolerance Concept Semantic Similarity method.

Background technology

At web, excavate, in information retrieval and natural language processing, the semantic similarity of measuring exactly between concept is an important problem.The extraction of Web Mining application Zhong Ru community, relation detects, and concept disambiguation, and requirement can be measured the semantic similarity between concept or entity exactly.In information retrieval, a main problem is when user inquires about, will retrieve one group of semantic relevant file to user.For various natural language processing tasks, such as semanteme of word disambiguation, text contains, and autotext summary, estimates that the semantic similarity between word and word is vital efficiently.

In research before, there is the research of a lot of Semantic Similarity tolerance based on basis, website, be mainly divided into following three aspects:

(1). the webpage quantity of returning according to search engine is measured, and the similarity between the larger explanation concept of quantity of returning is larger.

(2). according to the quantity of the download seniority among brothers and sisters of file, then apply top text-processing technology and measure.These tolerance are to set up these hypothesis bases above, and similar context means similar meaning, and word appears at similar vocabulary environment close semantic relation.

(3). in conjunction with (1) and (2), measure.

In sum, the semantic similarity of tolerance concept, but measure the redundance of seldom removing noise and web page fragments in the method for subjectivity and objectivity of incidence relation.

At present proposed many different Concept Semantic Similarity measuring methods, these methods are mainly divided into two aspects: method and network method based on classification.Method based on classification is to carry out computing semantic similarity with information theory and hierarchical classification, yet network method in contrast, and it is dynamic as one using network, the corpus of real-time update, based on corpus, carrys out computing semantic similarity.

The information content can be used for evaluating Concept Semantic Similarity, and the information content of concept C is negative log-likelihood value, refers to the possibility that concept C occurs, and has developed similarity word finder software measure the semantic similarity of a pair of concept according to the thought of the information content.Yet the distance classification of two vocabulary is to measure the more naturally direct mode of semantic similarity.Shorter to the distance of another vocabulary from a vocabulary, they are just more similar.Owing to considering the type of line, the degree of depth, density, by the formula of edge calculation density, the edge degree of depth, edge strength, measure Concept Semantic Similarity, be also a kind of good method.The distance of the information content and two vocabulary is combined the model of formation can measure Concept Semantic Similarity, yet usage space vector model and walk random also can be measured Concept Semantic Similarity.Past has people to explore the definition of the semantic similarity of bulk information resource, and the structurized semantic information that these resources are classified by dictionary and the information content of corpus form.For the validity of survey information resource, implemented the technology of the various possible information resources of a large amount of uses.Because new word constantly produces, new implication is also assigned in the vocabulary of existence.The manual software that comprises thesaurus is costly such as word finder captures new term with new implication, and if possible, this makes the method based on classification in related Web task, seem very dumb.

Different from the method based on classification, pointwise mutual information method is to identify synonym with the touching quantity that Web search engine returns, and symbiosis duplication check is that the core of this method is the rank algorithm of search engine using Web as the corpus upgrading.Similar kernel function can define the Concept Semantic Similarity searching by google, and the function of similar kernel function is the inquiry of advising being correlated with to search engine user in a large-scale system.Method based on corpus is called second order symbiosis PMI, calculates the semantic similarity of two target vocabulary.The method is to use mutual information go the to classify a series of important adjacent words of two target vocabulary.The page count that Web search engine provides and paragraph also can be measured semantic similarity.The grammatical pattern that this method need to automatically be extracted by means of some from paragraph.In this method, from rank, in 900 fragment, extract 200 patterns, 200 patterns come from 4562471 unique patterns.Because As time goes on the forward pattern of rank changes, the regeneration of a large amount of unique patterns makes this method very consuming time, and therefore, extraction pattern has greatly affected this method.

In sum, the tolerance semantic similarity method based on website existing at present lacks relevant mechanism and processes noise and the redundance in website data.

Summary of the invention

For existing tolerance semantic similarity method, cannot process noise in website data and the problem of redundance, the object of the present invention is to provide a kind of Concept Semantic Similarity measure based on search engine, effectively removed the noise and the redundance that in search engine data, exist.

In order to achieve the above object, the present invention adopts following technical scheme:

A Concept Semantic Similarity measure based on search engine, described measure comprises the steps:

(1) webpage counting, by search engine search related notion, and returns to corresponding webpage quantity;

(2) semantic segment, provides the semantic segment that comprises all concepts by search engine search, and calculates the ratio that the semantic segment comprise all concepts accounts for all semantic segments that search engine search returns;

(3) quantity of the Search Results having shown, is searched for and is shown the result searching by search engine, and the quantity of the result having shown is provided;

(4) result providing according to step (1) to (3) is carried out Concept Semantic Similarity calculating.

In preferred embodiment of the present invention, in step (1), by search engine, search for concept p, the concept q of similarity to be measured, also search for the concept p ∧ q that represents concept p and concept q co-occurrence simultaneously.

Further, in described step (2), pass through search engine search concept p ∧ q, and inquire about the webpage number returning, calculate its shared ratio in n forward fragment of rank, be designated as SS(p ∧ q).

Further, the repeat search interface that described step (3) provides by search engine omits some and the similar entry of the Search Results that shown.

Further, in described step (4), utilize the result that step (2) and (3) obtain to eliminate respectively noise and redundance processing to the corresponding web page quantity of returning in step (1), and carry out semantic similarity tolerance to processing the application of results pointwise mutual information method obtaining.

The measure providing according to such scheme, it is by page count, semantic segment and a kind of new method of the integrated formation of quantity of display of search results.This scheme is by comprise concept p and concept q simultaneously in the sentence of semantic segment, concept p and concept q co-occurrence in a word remove the noise in Web fragment; " repeat search " interface simultaneously providing by search engine omits some and the similar entry of the Search Results that shown, with this, reaches the object that removes the redundance in Web fragment.Thus, this programme can effectively remove noise and the redundance existing in search engine data, has improved greatly efficiency and the precision of Concept Semantic Similarity tolerance.

Accompanying drawing explanation

Below in conjunction with the drawings and specific embodiments, further illustrate the present invention.

Fig. 1 is theory diagram of the invention process.

Embodiment

For technological means, creation characteristic that the present invention is realized, reach object and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.

The object of this invention is to provide a kind of noise and redundance of removing web page fragments and calculate Concept Semantic Similarity method, for achieving the above object, method provided by the invention comprises:

Semantic similarity is the matching degree between the Concept of Information that represents of computing machine processable form, the invention provides a kind of method of the tolerance Concept Semantic Similarity based on search engine.This measure mainly comprises following three step: A, webpage counting step; B, semantic segment treatment step; The quantity step of the Search Results that C, statistics have shown.

For the webpage counting in steps A, by Web search engine, search for corresponding concept, and add up the corresponding webpage quantity that Web search engine returns.It is concrete that with Web, to search plain engine search concept p, concept q, concept p and concept q co-occurrence be concept p ∧ q; Wherein by Web search engine search concept p, find the total number N(p of Search Results), by Web search engine search concept q, find the total number N(q of Search Results), by Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).

It is PMI (p, q) that each Search Results obtaining in webpage counting is used to pointwise mutual information method, gets final product the tolerance of similarity between real concept p and concept q.PMI (p wherein, q) be exactly the webpage quantity N(N=1011 of search engine) be multiplied by the ratio of product of the page number of concept p and the webpage number of concept q co-occurrence and the webpage number of concept p and concept q, this ratio is being carried out to logarithm operation, the webpage quantity of the search engine of the result of computing and logarithm is being made to division arithmetic.

But so directly utilize each Search Results obtaining in webpage counting to carry out metric calculation, cannot remove the noise and the redundance that in search engine data, exist.

The quantity of the Search Results that for this reason, scheme provided by the invention has shown by B, semantic segment treatment step and C, statistics removes respectively noise and the redundance existing in search engine data.

First, for semantic segment, process, semantic segment refer to by the search of Web search engine provide one section with the similar semantic information of the content of searching for.

In this programme, by comprise concept p and concept q in a word simultaneously, concept p and concept q are in a declarative sentence, exclamative sentence or yet co-occurrence.In appearing in short when if concept p is different with concept q, the information that search engine returns may be only about concept p or concept q, or the information of returning comprises concept p and concept q, but concept p is not associated in the information of returning with concept q.Therefore in the sentence of semantic segment, comprise that concept p and concept q can accurately calculate PMI(p, q simultaneously) concept p in formula and the webpage number of concept q co-occurrence.

Specifically, in this programme, the webpage number returning by query concept p ∧ q, calculates its shared ratio in n forward fragment of rank, is designated as SS(p ∧ q), with SS(p ∧ q) * N(p ∧ q) replace PMI(p, q) N(p ∧ q in formula).Remove thus the noise in Web fragment.

Because this programme is based on search engine, a n described here fragment is that user inputs the Search Results that search engine after keyword represents by the form of fragment, user judges whether it is the content oneself needing by reading fragment, if meet user's expectation, user can click fragment and enter relevant webpage.

Due to, in the fragment that input is returned after keyword, differing to establish a capital comprises p and q, the summation of the segments that fragment/search procedure engine of the ratio of herein calculating=comprise p and q concept returns.

Moreover, for the quantity of the Search Results that shown of statistics, by Web search engine, search for, show the result searching, and provide the quantity of the result having shown, its quantity to omit some and the similar entry of the Search Results having shown.Here the quantity that shows result providing by Web search engine, it provides " repeat search " interface to omit some and the similar entry of the Search Results that shown by Web search engine (as google), if " repeat search " excuse of not using Web search engine (as google) to provide, the page number that search engine returns so reaches 1000, and the Search Results returning is not necessarily corresponding with the content of search, therefore use the Search Results having shown can improve PMI(p, q) search engine returns in formula webpage quantity.

Concrete, in this programme, by obtaining the quantity of the Search Results that concept p, concept q and concept p ∧ q shown, be designated as respectively R(p), R(q) and R(p ∧ q), and with R(p) * N(p), R(q) * N(q) and R(p ∧ q) * N(p ∧ q) replace respectively PMI(p, q) N(p in formula), N(q) and N(p ∧ q).Remove thus the redundance in Web fragment.

By a concrete tolerance example, further illustrate such scheme below.

The gauging system of this tolerance example based on a Concept Semantic Similarity realizes, this gauging system mainly comprises webpage counting module, semantic segment processing module, the Search Results quantity module and the similarity calculation module that have shown, and these modules can realize the function of above-mentioned correspondence respectively.

Referring to Fig. 1, it is depicted as on the basis of this gauging system, the process of the semantic similarity of tolerance concept p and concept q.Detailed process is as follows:

Step 1: webpage counting module and Web search plain engine (for google searches plain engine, lower same) and match, and utilizing Web to search plain engine search concept p, concept q, concept p and concept q co-occurrence is concept p ∧ q.

Step 2: webpage counting module, by Web search engine search concept p, finds the total number N(p of Search Results); By Web search engine search concept q, find the total number N(q of Search Results); By Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).

Step 3: threshold alpha is set.

This threshold alpha is mainly used in the relatively judgement of step 5 kind, and its concrete value is set according to concrete requirement.

Step 4: determine the shared ratio of semantic segment of concept p and concept q co-occurrence in the sentence of semantic segment by semantic segment processing module.In the Search Results of semantic segment processing module search engine search concept p ∧ q from step 1, the webpage number that query search engine search concept p ∧ q returns, in n forward fragment of the rank of returning, calculate the shared ratio of concept p ∧ q simultaneously, be designated as SS(p ∧ q).

Step 5: semantic segment processing module is by the ratio SS(p ∧ q calculating) compare with the threshold alpha of setting before, as SS(p ∧ q) during > α, operating procedure six; Otherwise assert semantic similarity SPPMI (p, q)=0 of concept p and concept q.

Step 6: the Search Results quantity module having shown coordinates with search engine, the quantity of the Search Results that statistic concept p, concept q and concept p ∧ q have shown respectively, and they are designated as respectively to R(p), R(q) and R(p ∧ q).

Step 7: similarity calculation module accept webpage counting module, semantic segment processing module and the Search Results quantity module that shown in process the data that obtain, and according to receiving data, calculate respectively N(p) * R(p), N(q) * R(q) and SS(p ∧ q) * N(p ∧ q) * R(p ∧ q).

Step 8: similarity calculation module, with the following formula of result utilization calculating, is calculated the semantic similarity SPPMI (p, q) of concept p and concept q:

More than show and described ultimate principle of the present invention, principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that in above-described embodiment and instructions, describes just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims

1. a concept semantic similarity measure method based on search engine, it is characterized in that, described measure method comprises the steps:

(1) Web page count, search for related concepts by search engines, and return the corresponding number of web pages;

(2) Semantic fragments, provide semantic fragments including all concepts through search engines, and calculate the proportion of semantic fragments including all concepts in all semantic fragments returned by search engines;

(3) The number of displayed search results, display the searched results through search engines, and provide the number of displayed results;

(4) Calculate the conceptual semantic similarity according to the results provided by steps (1) to (3).

2. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that in step (1), the concept p and concept q to be measured for similarity are searched through the search engine, and at the same time search Denotes the concept p∧q of the co-occurrence of concept p and concept q.

3. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that, in the step (2), search the concept p∧q through the search engine, and query the number of returned web pages, and calculate Its proportion in the top n fragments is denoted as SS(p∧q).

4. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that in the step (3), omit some search results that are similar to the already displayed search results through the repeated search interface provided by the search engine. entry.

5. A method for measuring conceptual semantic similarity based on a search engine according to claim 1, characterized in that, in the step (4), the results obtained in the steps (2) and (3) are used to compare the results of the step (1) The number of corresponding webpages returned in is processed to eliminate noise and redundancy respectively, and use the point-by-point mutual information method to measure the semantic similarity of the processed results.