WO2008131597A1 - Moteur de recherche et procédé de filtrage d'informations d'agence - Google Patents
Moteur de recherche et procédé de filtrage d'informations d'agence Download PDFInfo
- Publication number
- WO2008131597A1 WO2008131597A1 PCT/CN2007/001474 CN2007001474W WO2008131597A1 WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1 CN 2007001474 W CN2007001474 W CN 2007001474W WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- mediation
- intermediary
- search engine
- webpage
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
Definitions
- the present invention relates to computer search engine technology, and more particularly to a search engine and a filtering method for the mediation information.
- the Internet provides instant and rich information (and a platform for people to communicate and participate in entertainment), which deeply influences the lives of modern people. But with the rapid increase in the number and content of websites, the Internet is like a huge encyclopedia with no catalogs, making it impossible for people to find the information they want.
- search engines has added catalogues and indexes to this encyclopedia. Just type the keyword in the search box and you will be able to get the relevant information or URL.
- search engines provide an entry point for all surfers. It is no exaggeration to say that almost all users can search from the search to any place on the Internet they want. Therefore, it has also become the most used online service in addition to email.
- Figure 1 shows the system architecture diagram of a typical search engine in the prior art.
- the parts of the search engine are interdependent and interdependent.
- the processing flow is as follows:
- the web spider crawls the webpage from the Internet.
- the crawling process is as follows: (1) Manually add one or more URLs of the starting webpage (Uniform Resource Locator, also known as webpage address) to the URL database. These URLs are also called Seed; (2) The web spider obtains a URL from the URL database, grabs the webpage content corresponding to the URL, and then puts the webpage content into the webpage database; (3) the URL that satisfies the requested webpage Extract it and put it in the URL database.
- the method for judging whether the URL satisfies the requirements is pattern matching; (4) Repeat steps (2) one (3) until the web database no longer has new records added.
- the system obtains the original page of the webpage from the webpage database, and extracts the textual information from the webpage, that is, removes all the HTML grammar marks. Then, the extracted text information is sent to the text indexing module to establish an index.
- the process of indexing is to first calculate the relevance (or importance) of each keyword in the page content and the hyperlink, and then use the related information to establish a webpage index. Database, forming an index database.
- Text index In the process of establishment, you need to refer to the link information of the website, mainly to prevent illegal websites, such as multiple loop links of the website itself.
- the link information is extracted from the webpage database, and the link information (including the anchor text and the link itself) is sent to the link database to provide a basis for the webpage rating.
- the user submits the query request to the query server, and the server searches for the relevant webpage in the index database, and the webpage rating combines the query request and the link information to evaluate the relevance of the search result, and sorts according to the relevance degree by the query server.
- the content summary of the keyword is extracted, and finally the page generation system organizes the link address of the search result and the page content summary and returns the content to the user.
- the spider and the linker (Parser) module are the most important parts. among them:
- the web spider uses multi-threaded concurrent search technology to complete the document access agent, the path selection engine, and the access control engine.
- Web spider is mainly composed of three major data resources: URL server, crawler, memory, URL parser and resource library (web database), anchor library, URL database, and also one of the indexers. Accessibility.
- the specific process is that the URL server obtains the URL to be crawled from the URL database, the crawler grabs the web page according to the URL and sends it to the memory, compresses the web page and stores it into the webpage database, and then analyzes each web by the indexer. All links to the page and store relevant important information in the anchors file.
- the URL parser reads the anchor file and parses the URL, which in turn turns into a docID.
- the anchor text is then indexed into the index and sent to the index database.
- the specific process is shown in Figure 2.
- the analyzer in Figure 2 can be seen as part of the indexer, or as an auxiliary part of the indexer. Since the processing flow of the web spider is a well-known technique, it is not described in detail herein.
- the link information extraction module is configured to read a webpage database, decompress the document, and then perform analysis. Each document is converted into a set of words, which is called the number of samples. The number of words is recorded and the position in the document, the size of the font, and the case information. Search engines have two types of samples: (1) Title: This title is the title of the HTML or URL and the meta information in the HTML file. Index by analyzing individual words. Users can search for this information through this index.
- the general search engine only extracts and indexes the title and content in the webpage, and does not further extract the information in the content.
- An object of the embodiments of the present invention is to provide a search engine and a method for filtering the mediation information, so that some or all of the mediation information is filtered out in the search result.
- the present invention provides a search engine, including: a web spider, a link information extraction module, and a query server;
- the link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;
- the search engine filters out the index corresponding to the mediation information from its index database.
- the invention also provides a search engine, comprising: a web spider, a link information extraction module and a query server;
- the link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.
- the search engine filters out the index corresponding to the mediation information from its index database.
- the present invention also provides a filtering method for a search engine to mediate information, including: Grab a web page from the Internet and send it to a web page database;
- the extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;
- the present invention also provides a filtering method for a search engine to mediate information, including:
- the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively prevent the interference of the intermediary information to the user, improve the usability of the search result, and provide the user with more Great convenience.
- FIG. 1 is a system architecture diagram of a typical search engine in the prior art
- FIG. 2 is a schematic diagram of a processing flow of a web spider in the prior art
- FIG. 3 is a schematic flowchart of filtering mediation information according to an embodiment of the present invention. detailed description
- the intermediary information generally has one or more of the following characteristics:
- the same intermediary will publish a lot of different information. Taking rental housing as an example, an intermediary usually publishes rental information in many different locations.
- the published information contains company information. For example, company address and company contact information.
- the published information contains unreasonable information. Examples include incorrect phone numbers (including cell phone numbers, landline numbers, PHS numbers, etc.), very low prices, and more.
- the embodiment of the present invention modifies the link information extraction portion (link information extraction module) of the search engine based on the general vertical search.
- the search engine in this embodiment mainly includes a web spider (Spider), a link information extraction module (Parser), and a query server.
- the web spider (Spider) and the query server adopt a common processing technology, which is not described in detail herein.
- the link information extraction module improves the feature of the mediation information, and further extracts information in the content in addition to the web page title and content, to extract mediation feature information (such as a phone number, for identifying the mediation information, Email and price, etc., and the extracted content can be further processed:
- mediation feature information such as a phone number, for identifying the mediation information, Email and price, etc.
- the analysis and processing of the web content can be used to find further information about the company or other mediation.
- the improved link information extraction module adds the following functions:
- the mode of extraction is pattern matching, that is, look for “mobile phone”, “mobile phone”, “telephone”, “Little Smart”, “Mobile Phone”, “Cell Phone”, etc. for each web page. Once found, the first consecutive number following these strings is extracted. The first consecutive number is the user's phone number.
- the extraction method is pattern matching, that is, look for "email box”, "Emai l", etc. for each web page. Once found, extract the consecutive strings after these strings, and encounter the space to stop the extraction.
- the extracted string is the user's email.
- the number starting with 010 must be 5, 6, and 8. Otherwise, the information corresponding to this number is considered to be all intermediary information.
- the link information extraction module can further identify the intermediate information by analyzing and processing the extracted content. For example, the content of the main body of the webpage can be analyzed. If the words "company”, “company address”, “my company”, “large amount of listings” are included, the information is considered as intermediary information.
- the link information extraction module extracts the above information, only the information determined as the non-intermediary information is indexed, or the link information extraction module extracts the above information, and the index is established, but all the information determined to be the intermediary information is deleted from the index database. Indexing is performed using the generic "inverted index" technique (since the inverted indexing technique is well known in the art and will not be described in detail herein).
- the index corresponding to the mediation information is filtered out in the index database, and the user submits the query by submitting the query.
- the request is sent to the query server, and the server searches for the relevant webpage in the index database, and the intermediate information is basically filtered out in the returned search result.
- FIG. 3 is a schematic diagram of a filtering process of a mediation information by a search engine according to an embodiment of the present invention. As shown in Figure 3, the following steps are included:
- Step 100 Extract mediation feature information (such as a phone number and an email), and specifically include the following information: i. a mobile phone number;
- step 200 the same information extracted is counted.
- the method implemented in this embodiment is to establish a table in the background database of the search engine, the first field is a phone number or Email, and the second field is the number of times of repeated occurrence. After each message is extracted, the table is queried first. If there is already a record, the corresponding number of repetitions is incremented by one; if there is no record, a record is inserted, and the corresponding number of repetitions is set to 1.
- Step 500 Determine whether the mobile phone, the telephone or the PHS number is legal.
- the rule of judgment is based on the number rule table of various places in China. For example, the telephone number of Beijing is 8 digits. For those that do not comply with the rules, all the posting information corresponding to this mobile phone, telephone, and PHS is deleted from the index database of the search engine.
- Step 600 Determine whether the extracted webpage content has an intermediary tendency. If the content of the webpage contains "the company", "large number of listings" or contains multiple different addresses (for example: existing Dongzhimen, Xizhimen, Zhongguancun multiple housing), then this information is not indexed, or this information is searched from Engine cable
- the information determined as the intermediary information may also be processed without special processing, and after all the conditions are determined, the mediation information of all the judgments is from the search engine. Deleted in the index database; or after all the conditions are judged, the non-intermediary information is added to the index database for the user to query, and the information determined as the intermediary information is not indexed.
- the present invention is not limited to these modes, and it is within the scope of the present invention as long as the mediation information of the judgment can be filtered out from the index database.
- the above steps in the present embodiment shown in FIG. 3 are not limited in order, and the mediation feature information is not limited to the phone number or email given in the embodiment, and may be other information such as price.
- the index database record of the search engine can be provided to the query server for the user to query and use.
- the mediation information in the search result can be reduced from 90% before processing to 10% or less.
- the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively preventing the interference of the intermediary information to the user, and improving the usability of the search result. Users provide greater convenience.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un moteur de recherche et un procédé pour filtrer des informations d'agence. Le procédé consistant à saisir des pages Internet sur l'Internet ; envoyer lesdites pages à une base de données de pages Internet ; extraire des informations de lien ; extraire les titres et les contenus des pages Internet à partir de la base de données ; extraire d'autres informations de caractéristique d'agence ; analyser les informations de caractéristique d'agence extraites, si la condition de détermination d'informations d'agence réglée est satisfaite, les informations correspondant aux informations de caractéristique d'agence sont déterminées sous forme d'informations d'agence ; filtrer les informations d'agence à partir du résultat de recherche.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2007/001474 WO2008131597A1 (fr) | 2007-04-29 | 2007-04-29 | Moteur de recherche et procédé de filtrage d'informations d'agence |
| CN200780052784A CN101849232A (zh) | 2007-04-29 | 2007-04-29 | 搜索引擎及其对中介信息的过滤方法 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2007/001474 WO2008131597A1 (fr) | 2007-04-29 | 2007-04-29 | Moteur de recherche et procédé de filtrage d'informations d'agence |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2008131597A1 true WO2008131597A1 (fr) | 2008-11-06 |
Family
ID=39925170
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2007/001474 WO2008131597A1 (fr) | 2007-04-29 | 2007-04-29 | Moteur de recherche et procédé de filtrage d'informations d'agence |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN101849232A (fr) |
| WO (1) | WO2008131597A1 (fr) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108062328A (zh) * | 2016-11-08 | 2018-05-22 | 北京国双科技有限公司 | 获取网站自然搜索排名的方法和装置 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1536483A (zh) * | 2003-04-04 | 2004-10-13 | 陈文中 | 网络信息抽取及处理的方法及系统 |
| US20060136411A1 (en) * | 2004-12-21 | 2006-06-22 | Microsoft Corporation | Ranking search results using feature extraction |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7962510B2 (en) * | 2005-02-11 | 2011-06-14 | Microsoft Corporation | Using content analysis to detect spam web pages |
-
2007
- 2007-04-29 WO PCT/CN2007/001474 patent/WO2008131597A1/fr active Application Filing
- 2007-04-29 CN CN200780052784A patent/CN101849232A/zh active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1536483A (zh) * | 2003-04-04 | 2004-10-13 | 陈文中 | 网络信息抽取及处理的方法及系统 |
| US20060136411A1 (en) * | 2004-12-21 | 2006-06-22 | Microsoft Corporation | Ranking search results using feature extraction |
Non-Patent Citations (1)
| Title |
|---|
| ZHANG MAOYUAN AND ZOU CHUNYAN: "Research for Web Page Filter with Natural Language Processing", COMPUTER & DIGITAL ENGINEERING, vol. 31, no. 3, March 2003 (2003-03-01), pages 11, 24 - 28 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101849232A (zh) | 2010-09-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8341150B1 (en) | Filtering search results using annotations | |
| Li et al. | Tag-based social interest discovery | |
| CN100498790C (zh) | 一种搜索方法和系统 | |
| US8224809B2 (en) | System and method for matching entities | |
| CN100440224C (zh) | 一种搜索引擎性能评价的自动化处理方法 | |
| US9367637B2 (en) | System and method for searching a bookmark and tag database for relevant bookmarks | |
| Bharat et al. | A comparison of techniques to find mirrored hosts on the WWW | |
| Jansen et al. | Determining the user intent of web search engine queries | |
| JP4857075B2 (ja) | ウェブドキュメントの集合において効率的に日付を検索する方法、コンピュータプログラム | |
| US9104772B2 (en) | System and method for providing tag-based relevance recommendations of bookmarks in a bookmark and tag database | |
| US20070250501A1 (en) | Search result delivery engine | |
| US20110196861A1 (en) | Propagating Information Among Web Pages | |
| US20100115003A1 (en) | Methods For Merging Text Snippets For Context Classification | |
| CN101169780A (zh) | 一种基于语义本体的检索系统和方法 | |
| CN101630327A (zh) | 一种主题网络爬虫系统的设计方法 | |
| WO2009000174A1 (fr) | Procédé et dispositif de classement de pages web | |
| Chau et al. | Web searching in Chinese: A study of a search engine in Hong Kong | |
| Cetintas et al. | Effective query generation and postprocessing strategies for prior art patent search | |
| CN101133415A (zh) | 使用页面集而提供信息搜索服务的服务器、方法和系统 | |
| JP5364012B2 (ja) | データ抽出装置、データ抽出方法、および、データ抽出プログラム | |
| WO2017000659A1 (fr) | Procédé et appareil d'identification de localisateur uniforme de ressources (url) enrichi | |
| US8037073B1 (en) | Detection of bounce pad sites | |
| CN103617225A (zh) | 一种关联网页搜索方法和系统 | |
| WO2015074455A1 (fr) | Procédé et appareil pour calculer un modèle d'adresse url d'une page internet associée | |
| WO2008131597A1 (fr) | Moteur de recherche et procédé de filtrage d'informations d'agence |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 200780052784.0 Country of ref document: CN |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07721047 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 07721047 Country of ref document: EP Kind code of ref document: A1 |