[go: up one dir, main page]

WO2008131597A1 - Moteur de recherche et procédé de filtrage d'informations d'agence - Google Patents

Moteur de recherche et procédé de filtrage d'informations d'agence Download PDF

Info

Publication number
WO2008131597A1
WO2008131597A1 PCT/CN2007/001474 CN2007001474W WO2008131597A1 WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1 CN 2007001474 W CN2007001474 W CN 2007001474W WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
mediation
intermediary
search engine
webpage
Prior art date
Application number
PCT/CN2007/001474
Other languages
English (en)
Chinese (zh)
Inventor
Haitao Lin
Original Assignee
Haitao Lin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haitao Lin filed Critical Haitao Lin
Priority to PCT/CN2007/001474 priority Critical patent/WO2008131597A1/fr
Priority to CN200780052784A priority patent/CN101849232A/zh
Publication of WO2008131597A1 publication Critical patent/WO2008131597A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • the present invention relates to computer search engine technology, and more particularly to a search engine and a filtering method for the mediation information.
  • the Internet provides instant and rich information (and a platform for people to communicate and participate in entertainment), which deeply influences the lives of modern people. But with the rapid increase in the number and content of websites, the Internet is like a huge encyclopedia with no catalogs, making it impossible for people to find the information they want.
  • search engines has added catalogues and indexes to this encyclopedia. Just type the keyword in the search box and you will be able to get the relevant information or URL.
  • search engines provide an entry point for all surfers. It is no exaggeration to say that almost all users can search from the search to any place on the Internet they want. Therefore, it has also become the most used online service in addition to email.
  • Figure 1 shows the system architecture diagram of a typical search engine in the prior art.
  • the parts of the search engine are interdependent and interdependent.
  • the processing flow is as follows:
  • the web spider crawls the webpage from the Internet.
  • the crawling process is as follows: (1) Manually add one or more URLs of the starting webpage (Uniform Resource Locator, also known as webpage address) to the URL database. These URLs are also called Seed; (2) The web spider obtains a URL from the URL database, grabs the webpage content corresponding to the URL, and then puts the webpage content into the webpage database; (3) the URL that satisfies the requested webpage Extract it and put it in the URL database.
  • the method for judging whether the URL satisfies the requirements is pattern matching; (4) Repeat steps (2) one (3) until the web database no longer has new records added.
  • the system obtains the original page of the webpage from the webpage database, and extracts the textual information from the webpage, that is, removes all the HTML grammar marks. Then, the extracted text information is sent to the text indexing module to establish an index.
  • the process of indexing is to first calculate the relevance (or importance) of each keyword in the page content and the hyperlink, and then use the related information to establish a webpage index. Database, forming an index database.
  • Text index In the process of establishment, you need to refer to the link information of the website, mainly to prevent illegal websites, such as multiple loop links of the website itself.
  • the link information is extracted from the webpage database, and the link information (including the anchor text and the link itself) is sent to the link database to provide a basis for the webpage rating.
  • the user submits the query request to the query server, and the server searches for the relevant webpage in the index database, and the webpage rating combines the query request and the link information to evaluate the relevance of the search result, and sorts according to the relevance degree by the query server.
  • the content summary of the keyword is extracted, and finally the page generation system organizes the link address of the search result and the page content summary and returns the content to the user.
  • the spider and the linker (Parser) module are the most important parts. among them:
  • the web spider uses multi-threaded concurrent search technology to complete the document access agent, the path selection engine, and the access control engine.
  • Web spider is mainly composed of three major data resources: URL server, crawler, memory, URL parser and resource library (web database), anchor library, URL database, and also one of the indexers. Accessibility.
  • the specific process is that the URL server obtains the URL to be crawled from the URL database, the crawler grabs the web page according to the URL and sends it to the memory, compresses the web page and stores it into the webpage database, and then analyzes each web by the indexer. All links to the page and store relevant important information in the anchors file.
  • the URL parser reads the anchor file and parses the URL, which in turn turns into a docID.
  • the anchor text is then indexed into the index and sent to the index database.
  • the specific process is shown in Figure 2.
  • the analyzer in Figure 2 can be seen as part of the indexer, or as an auxiliary part of the indexer. Since the processing flow of the web spider is a well-known technique, it is not described in detail herein.
  • the link information extraction module is configured to read a webpage database, decompress the document, and then perform analysis. Each document is converted into a set of words, which is called the number of samples. The number of words is recorded and the position in the document, the size of the font, and the case information. Search engines have two types of samples: (1) Title: This title is the title of the HTML or URL and the meta information in the HTML file. Index by analyzing individual words. Users can search for this information through this index.
  • the general search engine only extracts and indexes the title and content in the webpage, and does not further extract the information in the content.
  • An object of the embodiments of the present invention is to provide a search engine and a method for filtering the mediation information, so that some or all of the mediation information is filtered out in the search result.
  • the present invention provides a search engine, including: a web spider, a link information extraction module, and a query server;
  • the link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;
  • the search engine filters out the index corresponding to the mediation information from its index database.
  • the invention also provides a search engine, comprising: a web spider, a link information extraction module and a query server;
  • the link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.
  • the search engine filters out the index corresponding to the mediation information from its index database.
  • the present invention also provides a filtering method for a search engine to mediate information, including: Grab a web page from the Internet and send it to a web page database;
  • the extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;
  • the present invention also provides a filtering method for a search engine to mediate information, including:
  • the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively prevent the interference of the intermediary information to the user, improve the usability of the search result, and provide the user with more Great convenience.
  • FIG. 1 is a system architecture diagram of a typical search engine in the prior art
  • FIG. 2 is a schematic diagram of a processing flow of a web spider in the prior art
  • FIG. 3 is a schematic flowchart of filtering mediation information according to an embodiment of the present invention. detailed description
  • the intermediary information generally has one or more of the following characteristics:
  • the same intermediary will publish a lot of different information. Taking rental housing as an example, an intermediary usually publishes rental information in many different locations.
  • the published information contains company information. For example, company address and company contact information.
  • the published information contains unreasonable information. Examples include incorrect phone numbers (including cell phone numbers, landline numbers, PHS numbers, etc.), very low prices, and more.
  • the embodiment of the present invention modifies the link information extraction portion (link information extraction module) of the search engine based on the general vertical search.
  • the search engine in this embodiment mainly includes a web spider (Spider), a link information extraction module (Parser), and a query server.
  • the web spider (Spider) and the query server adopt a common processing technology, which is not described in detail herein.
  • the link information extraction module improves the feature of the mediation information, and further extracts information in the content in addition to the web page title and content, to extract mediation feature information (such as a phone number, for identifying the mediation information, Email and price, etc., and the extracted content can be further processed:
  • mediation feature information such as a phone number, for identifying the mediation information, Email and price, etc.
  • the analysis and processing of the web content can be used to find further information about the company or other mediation.
  • the improved link information extraction module adds the following functions:
  • the mode of extraction is pattern matching, that is, look for “mobile phone”, “mobile phone”, “telephone”, “Little Smart”, “Mobile Phone”, “Cell Phone”, etc. for each web page. Once found, the first consecutive number following these strings is extracted. The first consecutive number is the user's phone number.
  • the extraction method is pattern matching, that is, look for "email box”, "Emai l", etc. for each web page. Once found, extract the consecutive strings after these strings, and encounter the space to stop the extraction.
  • the extracted string is the user's email.
  • the number starting with 010 must be 5, 6, and 8. Otherwise, the information corresponding to this number is considered to be all intermediary information.
  • the link information extraction module can further identify the intermediate information by analyzing and processing the extracted content. For example, the content of the main body of the webpage can be analyzed. If the words "company”, “company address”, “my company”, “large amount of listings” are included, the information is considered as intermediary information.
  • the link information extraction module extracts the above information, only the information determined as the non-intermediary information is indexed, or the link information extraction module extracts the above information, and the index is established, but all the information determined to be the intermediary information is deleted from the index database. Indexing is performed using the generic "inverted index" technique (since the inverted indexing technique is well known in the art and will not be described in detail herein).
  • the index corresponding to the mediation information is filtered out in the index database, and the user submits the query by submitting the query.
  • the request is sent to the query server, and the server searches for the relevant webpage in the index database, and the intermediate information is basically filtered out in the returned search result.
  • FIG. 3 is a schematic diagram of a filtering process of a mediation information by a search engine according to an embodiment of the present invention. As shown in Figure 3, the following steps are included:
  • Step 100 Extract mediation feature information (such as a phone number and an email), and specifically include the following information: i. a mobile phone number;
  • step 200 the same information extracted is counted.
  • the method implemented in this embodiment is to establish a table in the background database of the search engine, the first field is a phone number or Email, and the second field is the number of times of repeated occurrence. After each message is extracted, the table is queried first. If there is already a record, the corresponding number of repetitions is incremented by one; if there is no record, a record is inserted, and the corresponding number of repetitions is set to 1.
  • Step 500 Determine whether the mobile phone, the telephone or the PHS number is legal.
  • the rule of judgment is based on the number rule table of various places in China. For example, the telephone number of Beijing is 8 digits. For those that do not comply with the rules, all the posting information corresponding to this mobile phone, telephone, and PHS is deleted from the index database of the search engine.
  • Step 600 Determine whether the extracted webpage content has an intermediary tendency. If the content of the webpage contains "the company", "large number of listings" or contains multiple different addresses (for example: existing Dongzhimen, Xizhimen, Zhongguancun multiple housing), then this information is not indexed, or this information is searched from Engine cable
  • the information determined as the intermediary information may also be processed without special processing, and after all the conditions are determined, the mediation information of all the judgments is from the search engine. Deleted in the index database; or after all the conditions are judged, the non-intermediary information is added to the index database for the user to query, and the information determined as the intermediary information is not indexed.
  • the present invention is not limited to these modes, and it is within the scope of the present invention as long as the mediation information of the judgment can be filtered out from the index database.
  • the above steps in the present embodiment shown in FIG. 3 are not limited in order, and the mediation feature information is not limited to the phone number or email given in the embodiment, and may be other information such as price.
  • the index database record of the search engine can be provided to the query server for the user to query and use.
  • the mediation information in the search result can be reduced from 90% before processing to 10% or less.
  • the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively preventing the interference of the intermediary information to the user, and improving the usability of the search result. Users provide greater convenience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un moteur de recherche et un procédé pour filtrer des informations d'agence. Le procédé consistant à saisir des pages Internet sur l'Internet ; envoyer lesdites pages à une base de données de pages Internet ; extraire des informations de lien ; extraire les titres et les contenus des pages Internet à partir de la base de données ; extraire d'autres informations de caractéristique d'agence ; analyser les informations de caractéristique d'agence extraites, si la condition de détermination d'informations d'agence réglée est satisfaite, les informations correspondant aux informations de caractéristique d'agence sont déterminées sous forme d'informations d'agence ; filtrer les informations d'agence à partir du résultat de recherche.
PCT/CN2007/001474 2007-04-29 2007-04-29 Moteur de recherche et procédé de filtrage d'informations d'agence WO2008131597A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2007/001474 WO2008131597A1 (fr) 2007-04-29 2007-04-29 Moteur de recherche et procédé de filtrage d'informations d'agence
CN200780052784A CN101849232A (zh) 2007-04-29 2007-04-29 搜索引擎及其对中介信息的过滤方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2007/001474 WO2008131597A1 (fr) 2007-04-29 2007-04-29 Moteur de recherche et procédé de filtrage d'informations d'agence

Publications (1)

Publication Number Publication Date
WO2008131597A1 true WO2008131597A1 (fr) 2008-11-06

Family

ID=39925170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/001474 WO2008131597A1 (fr) 2007-04-29 2007-04-29 Moteur de recherche et procédé de filtrage d'informations d'agence

Country Status (2)

Country Link
CN (1) CN101849232A (fr)
WO (1) WO2008131597A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062328A (zh) * 2016-11-08 2018-05-22 北京国双科技有限公司 获取网站自然搜索排名的方法和装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (zh) * 2003-04-04 2004-10-13 陈文中 网络信息抽取及处理的方法及系统
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962510B2 (en) * 2005-02-11 2011-06-14 Microsoft Corporation Using content analysis to detect spam web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (zh) * 2003-04-04 2004-10-13 陈文中 网络信息抽取及处理的方法及系统
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG MAOYUAN AND ZOU CHUNYAN: "Research for Web Page Filter with Natural Language Processing", COMPUTER & DIGITAL ENGINEERING, vol. 31, no. 3, March 2003 (2003-03-01), pages 11, 24 - 28 *

Also Published As

Publication number Publication date
CN101849232A (zh) 2010-09-29

Similar Documents

Publication Publication Date Title
US8341150B1 (en) Filtering search results using annotations
Li et al. Tag-based social interest discovery
CN100498790C (zh) 一种搜索方法和系统
US8224809B2 (en) System and method for matching entities
CN100440224C (zh) 一种搜索引擎性能评价的自动化处理方法
US9367637B2 (en) System and method for searching a bookmark and tag database for relevant bookmarks
Bharat et al. A comparison of techniques to find mirrored hosts on the WWW
Jansen et al. Determining the user intent of web search engine queries
JP4857075B2 (ja) ウェブドキュメントの集合において効率的に日付を検索する方法、コンピュータプログラム
US9104772B2 (en) System and method for providing tag-based relevance recommendations of bookmarks in a bookmark and tag database
US20070250501A1 (en) Search result delivery engine
US20110196861A1 (en) Propagating Information Among Web Pages
US20100115003A1 (en) Methods For Merging Text Snippets For Context Classification
CN101169780A (zh) 一种基于语义本体的检索系统和方法
CN101630327A (zh) 一种主题网络爬虫系统的设计方法
WO2009000174A1 (fr) Procédé et dispositif de classement de pages web
Chau et al. Web searching in Chinese: A study of a search engine in Hong Kong
Cetintas et al. Effective query generation and postprocessing strategies for prior art patent search
CN101133415A (zh) 使用页面集而提供信息搜索服务的服务器、方法和系统
JP5364012B2 (ja) データ抽出装置、データ抽出方法、および、データ抽出プログラム
WO2017000659A1 (fr) Procédé et appareil d'identification de localisateur uniforme de ressources (url) enrichi
US8037073B1 (en) Detection of bounce pad sites
CN103617225A (zh) 一种关联网页搜索方法和系统
WO2015074455A1 (fr) Procédé et appareil pour calculer un modèle d'adresse url d'une page internet associée
WO2008131597A1 (fr) Moteur de recherche et procédé de filtrage d'informations d'agence

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200780052784.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07721047

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07721047

Country of ref document: EP

Kind code of ref document: A1