[go: up one dir, main page]

CN103678642A - Concept semantic similarity measurement method based on search engine - Google Patents

Concept semantic similarity measurement method based on search engine Download PDF

Info

Publication number
CN103678642A
CN103678642A CN201310713182.2A CN201310713182A CN103678642A CN 103678642 A CN103678642 A CN 103678642A CN 201310713182 A CN201310713182 A CN 201310713182A CN 103678642 A CN103678642 A CN 103678642A
Authority
CN
China
Prior art keywords
concept
search
search engine
semantic
semantic similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310713182.2A
Other languages
Chinese (zh)
Inventor
徐峥
齐力
梅林�
胡传平
支凤麟
梁辰
骆祥峰
魏晓
张顺香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute of the Ministry of Public Security
Original Assignee
Third Research Institute of the Ministry of Public Security
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute of the Ministry of Public Security filed Critical Third Research Institute of the Ministry of Public Security
Priority to CN201310713182.2A priority Critical patent/CN103678642A/en
Publication of CN103678642A publication Critical patent/CN103678642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a concept semantic similarity measurement method based on a search engine. Page numbering, semantic fragments and the number of displayed search results are integrated in the new method. Noise and redundancy in data of the search engine are effectively removed, and the problems in the prior art are effectively solved.

Description

A kind of Concept Semantic Similarity measure based on search engine
Technical field
The present invention relates to Data Mining, be specially a kind of tolerance Concept Semantic Similarity method.
Background technology
At web, excavate, in information retrieval and natural language processing, the semantic similarity of measuring exactly between concept is an important problem.The extraction of Web Mining application Zhong Ru community, relation detects, and concept disambiguation, and requirement can be measured the semantic similarity between concept or entity exactly.In information retrieval, a main problem is when user inquires about, will retrieve one group of semantic relevant file to user.For various natural language processing tasks, such as semanteme of word disambiguation, text contains, and autotext summary, estimates that the semantic similarity between word and word is vital efficiently.
In research before, there is the research of a lot of Semantic Similarity tolerance based on basis, website, be mainly divided into following three aspects:
(1). the webpage quantity of returning according to search engine is measured, and the similarity between the larger explanation concept of quantity of returning is larger.
(2). according to the quantity of the download seniority among brothers and sisters of file, then apply top text-processing technology and measure.These tolerance are to set up these hypothesis bases above, and similar context means similar meaning, and word appears at similar vocabulary environment close semantic relation.
(3). in conjunction with (1) and (2), measure.
In sum, the semantic similarity of tolerance concept, but measure the redundance of seldom removing noise and web page fragments in the method for subjectivity and objectivity of incidence relation.
At present proposed many different Concept Semantic Similarity measuring methods, these methods are mainly divided into two aspects: method and network method based on classification.Method based on classification is to carry out computing semantic similarity with information theory and hierarchical classification, yet network method in contrast, and it is dynamic as one using network, the corpus of real-time update, based on corpus, carrys out computing semantic similarity.
The information content can be used for evaluating Concept Semantic Similarity, and the information content of concept C is negative log-likelihood value, refers to the possibility that concept C occurs, and has developed similarity word finder software measure the semantic similarity of a pair of concept according to the thought of the information content.Yet the distance classification of two vocabulary is to measure the more naturally direct mode of semantic similarity.Shorter to the distance of another vocabulary from a vocabulary, they are just more similar.Owing to considering the type of line, the degree of depth, density, by the formula of edge calculation density, the edge degree of depth, edge strength, measure Concept Semantic Similarity, be also a kind of good method.The distance of the information content and two vocabulary is combined the model of formation can measure Concept Semantic Similarity, yet usage space vector model and walk random also can be measured Concept Semantic Similarity.Past has people to explore the definition of the semantic similarity of bulk information resource, and the structurized semantic information that these resources are classified by dictionary and the information content of corpus form.For the validity of survey information resource, implemented the technology of the various possible information resources of a large amount of uses.Because new word constantly produces, new implication is also assigned in the vocabulary of existence.The manual software that comprises thesaurus is costly such as word finder captures new term with new implication, and if possible, this makes the method based on classification in related Web task, seem very dumb.
Different from the method based on classification, pointwise mutual information method is to identify synonym with the touching quantity that Web search engine returns, and symbiosis duplication check is that the core of this method is the rank algorithm of search engine using Web as the corpus upgrading.Similar kernel function can define the Concept Semantic Similarity searching by google, and the function of similar kernel function is the inquiry of advising being correlated with to search engine user in a large-scale system.Method based on corpus is called second order symbiosis PMI, calculates the semantic similarity of two target vocabulary.The method is to use mutual information go the to classify a series of important adjacent words of two target vocabulary.The page count that Web search engine provides and paragraph also can be measured semantic similarity.The grammatical pattern that this method need to automatically be extracted by means of some from paragraph.In this method, from rank, in 900 fragment, extract 200 patterns, 200 patterns come from 4562471 unique patterns.Because As time goes on the forward pattern of rank changes, the regeneration of a large amount of unique patterns makes this method very consuming time, and therefore, extraction pattern has greatly affected this method.
In sum, the tolerance semantic similarity method based on website existing at present lacks relevant mechanism and processes noise and the redundance in website data.
Summary of the invention
For existing tolerance semantic similarity method, cannot process noise in website data and the problem of redundance, the object of the present invention is to provide a kind of Concept Semantic Similarity measure based on search engine, effectively removed the noise and the redundance that in search engine data, exist.
In order to achieve the above object, the present invention adopts following technical scheme:
A Concept Semantic Similarity measure based on search engine, described measure comprises the steps:
(1) webpage counting, by search engine search related notion, and returns to corresponding webpage quantity;
(2) semantic segment, provides the semantic segment that comprises all concepts by search engine search, and calculates the ratio that the semantic segment comprise all concepts accounts for all semantic segments that search engine search returns;
(3) quantity of the Search Results having shown, is searched for and is shown the result searching by search engine, and the quantity of the result having shown is provided;
(4) result providing according to step (1) to (3) is carried out Concept Semantic Similarity calculating.
In preferred embodiment of the present invention, in step (1), by search engine, search for concept p, the concept q of similarity to be measured, also search for the concept p ∧ q that represents concept p and concept q co-occurrence simultaneously.
Further, in described step (2), pass through search engine search concept p ∧ q, and inquire about the webpage number returning, calculate its shared ratio in n forward fragment of rank, be designated as SS(p ∧ q).
Further, the repeat search interface that described step (3) provides by search engine omits some and the similar entry of the Search Results that shown.
Further, in described step (4), utilize the result that step (2) and (3) obtain to eliminate respectively noise and redundance processing to the corresponding web page quantity of returning in step (1), and carry out semantic similarity tolerance to processing the application of results pointwise mutual information method obtaining.
The measure providing according to such scheme, it is by page count, semantic segment and a kind of new method of the integrated formation of quantity of display of search results.This scheme is by comprise concept p and concept q simultaneously in the sentence of semantic segment, concept p and concept q co-occurrence in a word remove the noise in Web fragment; " repeat search " interface simultaneously providing by search engine omits some and the similar entry of the Search Results that shown, with this, reaches the object that removes the redundance in Web fragment.Thus, this programme can effectively remove noise and the redundance existing in search engine data, has improved greatly efficiency and the precision of Concept Semantic Similarity tolerance.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention.
Fig. 1 is theory diagram of the invention process.
Embodiment
For technological means, creation characteristic that the present invention is realized, reach object and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
The object of this invention is to provide a kind of noise and redundance of removing web page fragments and calculate Concept Semantic Similarity method, for achieving the above object, method provided by the invention comprises:
Semantic similarity is the matching degree between the Concept of Information that represents of computing machine processable form, the invention provides a kind of method of the tolerance Concept Semantic Similarity based on search engine.This measure mainly comprises following three step: A, webpage counting step; B, semantic segment treatment step; The quantity step of the Search Results that C, statistics have shown.
For the webpage counting in steps A, by Web search engine, search for corresponding concept, and add up the corresponding webpage quantity that Web search engine returns.It is concrete that with Web, to search plain engine search concept p, concept q, concept p and concept q co-occurrence be concept p ∧ q; Wherein by Web search engine search concept p, find the total number N(p of Search Results), by Web search engine search concept q, find the total number N(q of Search Results), by Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).
It is PMI (p, q) that each Search Results obtaining in webpage counting is used to pointwise mutual information method, gets final product the tolerance of similarity between real concept p and concept q.PMI (p wherein, q) be exactly the webpage quantity N(N=1011 of search engine) be multiplied by the ratio of product of the page number of concept p and the webpage number of concept q co-occurrence and the webpage number of concept p and concept q, this ratio is being carried out to logarithm operation, the webpage quantity of the search engine of the result of computing and logarithm is being made to division arithmetic.
But so directly utilize each Search Results obtaining in webpage counting to carry out metric calculation, cannot remove the noise and the redundance that in search engine data, exist.
The quantity of the Search Results that for this reason, scheme provided by the invention has shown by B, semantic segment treatment step and C, statistics removes respectively noise and the redundance existing in search engine data.
First, for semantic segment, process, semantic segment refer to by the search of Web search engine provide one section with the similar semantic information of the content of searching for.
In this programme, by comprise concept p and concept q in a word simultaneously, concept p and concept q are in a declarative sentence, exclamative sentence or yet co-occurrence.In appearing in short when if concept p is different with concept q, the information that search engine returns may be only about concept p or concept q, or the information of returning comprises concept p and concept q, but concept p is not associated in the information of returning with concept q.Therefore in the sentence of semantic segment, comprise that concept p and concept q can accurately calculate PMI(p, q simultaneously) concept p in formula and the webpage number of concept q co-occurrence.
Specifically, in this programme, the webpage number returning by query concept p ∧ q, calculates its shared ratio in n forward fragment of rank, is designated as SS(p ∧ q), with SS(p ∧ q) * N(p ∧ q) replace PMI(p, q) N(p ∧ q in formula).Remove thus the noise in Web fragment.
Because this programme is based on search engine, a n described here fragment is that user inputs the Search Results that search engine after keyword represents by the form of fragment, user judges whether it is the content oneself needing by reading fragment, if meet user's expectation, user can click fragment and enter relevant webpage.
Due to, in the fragment that input is returned after keyword, differing to establish a capital comprises p and q, the summation of the segments that fragment/search procedure engine of the ratio of herein calculating=comprise p and q concept returns.
Moreover, for the quantity of the Search Results that shown of statistics, by Web search engine, search for, show the result searching, and provide the quantity of the result having shown, its quantity to omit some and the similar entry of the Search Results having shown.Here the quantity that shows result providing by Web search engine, it provides " repeat search " interface to omit some and the similar entry of the Search Results that shown by Web search engine (as google), if " repeat search " excuse of not using Web search engine (as google) to provide, the page number that search engine returns so reaches 1000, and the Search Results returning is not necessarily corresponding with the content of search, therefore use the Search Results having shown can improve PMI(p, q) search engine returns in formula webpage quantity.
Concrete, in this programme, by obtaining the quantity of the Search Results that concept p, concept q and concept p ∧ q shown, be designated as respectively R(p), R(q) and R(p ∧ q), and with R(p) * N(p), R(q) * N(q) and R(p ∧ q) * N(p ∧ q) replace respectively PMI(p, q) N(p in formula), N(q) and N(p ∧ q).Remove thus the redundance in Web fragment.
By a concrete tolerance example, further illustrate such scheme below.
The gauging system of this tolerance example based on a Concept Semantic Similarity realizes, this gauging system mainly comprises webpage counting module, semantic segment processing module, the Search Results quantity module and the similarity calculation module that have shown, and these modules can realize the function of above-mentioned correspondence respectively.
Referring to Fig. 1, it is depicted as on the basis of this gauging system, the process of the semantic similarity of tolerance concept p and concept q.Detailed process is as follows:
Step 1: webpage counting module and Web search plain engine (for google searches plain engine, lower same) and match, and utilizing Web to search plain engine search concept p, concept q, concept p and concept q co-occurrence is concept p ∧ q.
Step 2: webpage counting module, by Web search engine search concept p, finds the total number N(p of Search Results); By Web search engine search concept q, find the total number N(q of Search Results); By Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).
Step 3: threshold alpha is set.
This threshold alpha is mainly used in the relatively judgement of step 5 kind, and its concrete value is set according to concrete requirement.
Step 4: determine the shared ratio of semantic segment of concept p and concept q co-occurrence in the sentence of semantic segment by semantic segment processing module.In the Search Results of semantic segment processing module search engine search concept p ∧ q from step 1, the webpage number that query search engine search concept p ∧ q returns, in n forward fragment of the rank of returning, calculate the shared ratio of concept p ∧ q simultaneously, be designated as SS(p ∧ q).
Step 5: semantic segment processing module is by the ratio SS(p ∧ q calculating) compare with the threshold alpha of setting before, as SS(p ∧ q) during > α, operating procedure six; Otherwise assert semantic similarity SPPMI (p, q)=0 of concept p and concept q.
Step 6: the Search Results quantity module having shown coordinates with search engine, the quantity of the Search Results that statistic concept p, concept q and concept p ∧ q have shown respectively, and they are designated as respectively to R(p), R(q) and R(p ∧ q).
Step 7: similarity calculation module accept webpage counting module, semantic segment processing module and the Search Results quantity module that shown in process the data that obtain, and according to receiving data, calculate respectively N(p) * R(p), N(q) * R(q) and SS(p ∧ q) * N(p ∧ q) * R(p ∧ q).
Step 8: similarity calculation module, with the following formula of result utilization calculating, is calculated the semantic similarity SPPMI (p, q) of concept p and concept q:
Figure BDA0000442779110000061
More than show and described ultimate principle of the present invention, principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that in above-described embodiment and instructions, describes just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims (5)

1.一种基于搜索引擎的概念语义相似度度量方法,其特征在于,所述度量方法包括如下步骤:1. a concept semantic similarity measure method based on search engine, it is characterized in that, described measure method comprises the steps: (1)网页计数,由搜索引擎搜索相关概念,并返回相应的网页数量;(1) Web page count, search for related concepts by search engines, and return the corresponding number of web pages; (2)语义片段,通过搜索引擎搜索提供包括所有概念的语义片段,并计算包括所有概念的语义片段占搜索引擎搜索返回的所有语义片段的比例;(2) Semantic fragments, provide semantic fragments including all concepts through search engines, and calculate the proportion of semantic fragments including all concepts in all semantic fragments returned by search engines; (3)已显示的搜索结果的数量,通过搜索引擎搜索显示搜索到的结果,并提供已显示的结果的数量;(3) The number of displayed search results, display the searched results through search engines, and provide the number of displayed results; (4)根据步骤(1)至(3)提供的结果进行概念语义相似度计算。(4) Calculate the conceptual semantic similarity according to the results provided by steps (1) to (3). 2.根据权利要求1所述的一种基于搜索引擎的概念语义相似度度量方法,其特征在于,在步骤(1)中通过搜索引擎搜索待度量相似度的概念p、概念q,同时还搜索表示概念p和概念q共现的概念p∧q。2. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that in step (1), the concept p and concept q to be measured for similarity are searched through the search engine, and at the same time search Denotes the concept p∧q of the co-occurrence of concept p and concept q. 3.根据权利要求1所述的一种基于搜索引擎的概念语义相似度度量方法,其特征在于,所述步骤(2)中通过搜索引擎搜索概念p∧q,并查询返回的网页数,计算其在排名靠前的n个片段中所占的比例,记为SS(p∧q)。3. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that, in the step (2), search the concept p∧q through the search engine, and query the number of returned web pages, and calculate Its proportion in the top n fragments is denoted as SS(p∧q). 4.根据权利要求1所述的一种基于搜索引擎的概念语义相似度度量方法,其特征在于,所述步骤(3)通过搜索引擎提供的重复搜索接口省略一些与已经显示的搜索结果相类似的条目。4. A search engine-based concept semantic similarity measurement method according to claim 1, characterized in that in the step (3), omit some search results that are similar to the already displayed search results through the repeated search interface provided by the search engine. entry. 5.根据权利要求1所述的一种基于搜索引擎的概念语义相似度度量方法,其特征在于,所述步骤(4)中利用步骤(2)和(3)得到的结果对步骤(1)中返回的相应网页数量进行分别进行消除噪声和冗余度处理,并对处理得到的结果运用逐点互信息方法进行语义相似度度量。5. A method for measuring conceptual semantic similarity based on a search engine according to claim 1, characterized in that, in the step (4), the results obtained in the steps (2) and (3) are used to compare the results of the step (1) The number of corresponding webpages returned in is processed to eliminate noise and redundancy respectively, and use the point-by-point mutual information method to measure the semantic similarity of the processed results.
CN201310713182.2A 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine Pending CN103678642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310713182.2A CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310713182.2A CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Publications (1)

Publication Number Publication Date
CN103678642A true CN103678642A (en) 2014-03-26

Family

ID=50316186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310713182.2A Pending CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Country Status (1)

Country Link
CN (1) CN103678642A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335505A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information searching method based on natural language
CN105335504A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information retrieval method based on natural language
CN107408156A (en) * 2015-03-09 2017-11-28 皇家飞利浦有限公司 For carrying out semantic search and the system and method for extracting related notion from clinical document
CN108917677A (en) * 2018-07-19 2018-11-30 福建天晴数码有限公司 Cube room inside dimension measurement method, storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107408156A (en) * 2015-03-09 2017-11-28 皇家飞利浦有限公司 For carrying out semantic search and the system and method for extracting related notion from clinical document
CN105335505A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information searching method based on natural language
CN105335504A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information retrieval method based on natural language
CN108917677A (en) * 2018-07-19 2018-11-30 福建天晴数码有限公司 Cube room inside dimension measurement method, storage medium
CN108917677B (en) * 2018-07-19 2020-03-17 福建天晴数码有限公司 Cubic room internal dimension measuring method and storage medium

Similar Documents

Publication Publication Date Title
Phan et al. Pair-linking for collective entity disambiguation: Two could be better than all
CN108132929A (en) A kind of similarity calculation method of magnanimity non-structured text
WO2020215667A1 (en) Text content quick duplicate removal method and apparatus, computer device, and storage medium
CN109359172B (en) An Entity Alignment Optimization Method Based on Graph Partitioning
WO2021218322A1 (en) Paragraph search method and apparatus, and electronic device and storage medium
CN107967256B (en) Word weight prediction model generation method, position recommendation method and computing device
CN106294762B (en) Entity identification method based on learning
WO2017166912A1 (en) Method and device for extracting core words from commodity short text
CN111488740A (en) Causal relationship judging method and device, electronic equipment and storage medium
CN105389349A (en) Dictionary update method and device
CN112148843B (en) Text processing method, device, terminal device and storage medium
JP2017504105A5 (en)
Jain et al. Query2vec: An evaluation of NLP techniques for generalized workload analytics
CN105183770A (en) Chinese integrated entity linking method based on graph model
CN103049470A (en) Opinion retrieval method based on emotional relevancy
CN106294863A (en) A kind of abstract method for mass text fast understanding
CN111737966B (en) Document repetition detection method, device, equipment and readable storage medium
CN104750813A (en) Data cleaning method based on data reduction model
US10198497B2 (en) Search term clustering
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN103678642A (en) Concept semantic similarity measurement method based on search engine
CN105608075A (en) Related knowledge point acquisition method and system
TW201316191A (en) Method and apparatus of searching information
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
WO2018205391A1 (en) Method, system and apparatus for evaluating accuracy of information retrieval, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140326