WO2010049760A1 - Method to find user generated content web pages. - Google Patents
Method to find user generated content web pages. Download PDFInfo
- Publication number
- WO2010049760A1 WO2010049760A1 PCT/IB2008/055370 IB2008055370W WO2010049760A1 WO 2010049760 A1 WO2010049760 A1 WO 2010049760A1 IB 2008055370 W IB2008055370 W IB 2008055370W WO 2010049760 A1 WO2010049760 A1 WO 2010049760A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- web page
- web
- generated content
- user generated
- web pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Definitions
- the present invention concerns a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request and a software to practice the same.
- the usual method consists of choosing some words suited to the object of the search. These words are inputted into the query page of an internet engine such as those proposed by Google
- the search engine lists a set of web pages by their title and an automatically generated short abstract. A link gives access to the web page.
- Search engines contain some internal, and often secret, algorithms to sort the list of web pages and to show to the user the most pertinent, hopefully, web pages at the beginning of the list.
- a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprises a) downloading and storing each web pages referenced by said list, and b) associating to each web pages at least domain information, c) searching in each web page and its associated information a signature representative of a user generated content web page.
- the method has the advantage to select only web pages containing user generated content.
- the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request .
- steps e) and f) are executed only if the associated web page is not already stored.
- steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page, the signature comprises, alone or in combination: a set of HTML tags representative of web software for managing user generated content,
- a computer program product to search for a web page comprises program instructions to execute the steps of the hereabove method when the computer program product is executed on a computer.
- a system to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
- Fig. 1 is a schematic view of a terminal connected to internet to practice an embodiment of the invention
- Fig. 2 is a flowchart of a method according to an embodiment of the invention
- Fig. 3 is a flowchart of a method according to another embodiment of the invention
- Fig. 4 is a functional view of a terminal practising an embodiment of the invention.
- a computer is connected to Internet network 3. Through the network 3, the computer 1 is connected to a server 5 on which a search engine is running.
- the server 5 symbolized the infrastructure of search companies such as Google Inc. or Yahoo Inc. In fact, these companies use server farms containing hundred of computers dispatched around the world.
- a server 7 is also connected to the internet network 3 and contains a web page which is of interest for the user of the computer 1 but its address is not known by the computer 1.
- the web page contains opinion on a product/service of interest for the user of the computer
- I is a user generated content web page such as a blog, wiki or forum page.
- the computer 1 is a classical personal computer. It comprises interface means such as a display 9, a keyboard
- the storage means 15 contains a computer software product which, when executed by the processing means 17, makes the computer 1 execute the steps of a method to search for a web page according to an embodiment of the invention .
- a user of the computer 1 sends, step 20, a request to the search engine of server 5 and receives, step 22, a web page containing a list of web page links created by the web search engine in response to the request.
- the request comprises a set of lexical units on the subject of interest.
- the request is inputted into the internet search engine, such as Google engine
- the step 22 is done preferentially automatically, either by building HTTP request with the adapted syntax or by using some Application Specific Interface (API) provided by these companies to automatize the internet searches with their engine, as long as such use is accepted and/or authorized by these companies.
- the computer 1 downloads and stores, step 24, the web pages referenced by the links contained in the received list.
- Each web page is stored as usual as a set of files in the storage means 15.
- Each web page is associated, step 26, with domain information, particularly the domain name of the server which contains the web page.
- each web page and its associated information are searched for a signature representative of a user generated content web page.
- a signature in this context, means a character string or a set of character strings which is representative of a user generated content web page.
- a signature may be represented by a regular expression.
- Such a signature may be a character string which is specific to a server software used by web sites having user generated content.
- server software used by web sites having user generated content.
- forums, blogs and other user content web sites use specific server software. For instance,
- DotClear (www . dotclear. ne t ) is a software often used to create blogs. These softwares have specific ways to generate the user content web pages which can be discovered by analyzing some specific character strings found in different parts of the inner page html/php/asp
- a signature may also be a part of the character string forming the URL (Uniform Resource Locator) of the web page.
- the character string may be a part of the structure, or tree, of the link in a dynamic web site, i.e. a site in which a part of the content is contained in a database and the web page is generated on request by integrating elements of the database within HTML frameworks. For instance, the variable $itemid ending an URL beginning with index. php? is the telltale for a Joomla CMS.
- the character string may also be a word or an expression often used in an URL to define a user content area. For instance, the URL ht tp : / /www . domainname /phorum/ index .
- php contains "phorum” which is a marker of a forum area. "Blog”, “blogs”, “forum”, “forums” are often used as a way to distinguish between the root of the "normal” web site and the sub repertory that is used to host blog or forum files.
- the user content may also be used to find a signature as some languages are typical. For instance, some expressions such as LOL (for "Laughing Out Loud”) is an acronym which is almost only used in forum posts.
- step 30 only web pages exhibiting the signature are considered, step 30, as containing user content.
- the method of analyzing each web page for a signature may be applied also to web pages embedded inside other web pages, Fig. 3.
- step 32 it is verified that a found link does not refer to a page already stored in the computer 1 ;
- the downloaded web page is verified as containing information relevant to the user request, i.e., that, if it had been indexed by the search engine, it would have been included in the response list.
- the downloaded web page is stored with the other web pages, step 38.
- Steps 32 to 36 are repeated for all links found in a page, step 40.
- this search for new web pages embedded in already known web pages may be also practised on the new found web pages in an iterative manner. Therefore, the user defines a level of iteration to stop the method after having looked at all pages being at 3 links, for instance, from the pages initially found by the search engine.
- the user may send different requests to the search engines, each request having some lexical variation. The results of all these requests are a set of web page link lists. These links are consolidated: identical links found on different lists are merged.
- the storage of web pages is preferentially done within a relational database.
- a list of parameters may be associated to help the following analysis.
- These parameters may comprise: keyword (s) included in the "head" part of the page, i.e META tags; - used language; description tag; title tag; - HTML content
- the hereabove method is advantageously implemented as a computer program product which comprises instructions to execute the steps of the method when program product is executed on a computer.
- the computer 1 implementing the disclosed method comprises means 50 for storing web pages and associated information.
- storage means 50 comprise a database on a hard disk.
- Computer 1 also comprises means 52 for sending a request to an internet search engine and means 54 for receiving the result from the internet search engine and downloading the web pages referenced by the list.
- And computer 1 comprises means 56 to associate information to a web page and means 58 to search in each web page and its associated information a signature representative of a user generated content web page.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising a) downloading (24) and storing each web pages referenced by said list, and b) associating (26) to each web page at least domain information, c) searching (28) in each web page and its associated information a signature representative of a user generated content web page.
Description
METHOD TO FIND USER GENERATED CONTENT WEB PAGES
Field of the invention
The present invention concerns a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request and a software to practice the same. Background of the invention
Nowadays, to search for a web page, the usual method consists of choosing some words suited to the object of the search. These words are inputted into the query page of an internet engine such as those proposed by Google
Inc. or Yahoo Inc.
As a result, the search engine lists a set of web pages by their title and an automatically generated short abstract. A link gives access to the web page.
Search engines contain some internal, and often secret, algorithms to sort the list of web pages and to show to the user the most pertinent, hopefully, web pages at the beginning of the list.
However, this method is not very efficient when the search concerns user' s opinion about a product or a service .
Indeed, when a user would like to buy a product or a service it is nowadays a common practice to search for the opinions of the prior buyers or users. This information may be found on blogs, wikis, forums and any other web site where a "standard" user may post a message. These opinions around a product or service generate a buzz which has a positive, or negative, impact on the success of the product/service.
It is therefore important for marketing department as well as for users to dispose of a method which is
efficient to find the web pages containing opinions on a defined product or service while leaving aside "classical" web pages concerned by the product/service such as pages of merchant web site, of price comparators, etc.
Summary of the invention
To better address one or more concerns, in a first aspect of the invention, a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprises a) downloading and storing each web pages referenced by said list, and b) associating to each web pages at least domain information, c) searching in each web page and its associated information a signature representative of a user generated content web page.
Therefore, the method has the advantage to select only web pages containing user generated content.
In particular embodiments, the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request . steps e) and f) are executed only if the associated web page is not already stored. steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page, the signature comprises, alone or in combination:
a set of HTML tags representative of web software for managing user generated content,
- keyword (s) included into the web page link,
- determined domain name, and/or - keyword (s) specifically included into user content web page .
Aspects of these embodiments may be combined or modified as appropriate or desired, however.
In a second aspect of the invention, a computer program product to search for a web page comprises program instructions to execute the steps of the hereabove method when the computer program product is executed on a computer.
In a third aspect of the invention, a system to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
- means for downloading and means for storing web pages referenced by said list ; - means for associating to each web page at least domain information ; and
- means for searching in each web page and its associated information a signature representative of a user generated content web page. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment described hereafter where: g) Fig. 1 is a schematic view of a terminal connected to internet to practice an embodiment of the invention; h) Fig. 2 is a flowchart of a method according to an embodiment of the invention;
i) Fig. 3 is a flowchart of a method according to another embodiment of the invention; and j) Fig. 4 is a functional view of a terminal practising an embodiment of the invention. Detailed description
In reference to Fig. 1, a computer is connected to Internet network 3. Through the network 3, the computer 1 is connected to a server 5 on which a search engine is running. The man skilled in the art understands that the server 5 symbolized the infrastructure of search companies such as Google Inc. or Yahoo Inc. In fact, these companies use server farms containing hundred of computers dispatched around the world.
A server 7 is also connected to the internet network 3 and contains a web page which is of interest for the user of the computer 1 but its address is not known by the computer 1. The web page contains opinion on a product/service of interest for the user of the computer
I and is a user generated content web page such as a blog, wiki or forum page.
The computer 1 is a classical personal computer. It comprises interface means such as a display 9, a keyboard
II and a mouse 13 or the like.
It comprises also storage means 15 and processing means 17 such as, for instance, hard disk drives and motherboard.
The storage means 15 contains a computer software product which, when executed by the processing means 17, makes the computer 1 execute the steps of a method to search for a web page according to an embodiment of the invention .
In reference to Fig. 2, a user of the computer 1 sends, step 20, a request to the search engine of server
5 and receives, step 22, a web page containing a list of web page links created by the web search engine in response to the request.
As it is well known by the man skilled in the art, the request comprises a set of lexical units on the subject of interest. The request is inputted into the internet search engine, such as Google engine
(www. google . com) or Yahoo engine (www . yahoo . com) .
The step 22 is done preferentially automatically, either by building HTTP request with the adapted syntax or by using some Application Specific Interface (API) provided by these companies to automatize the internet searches with their engine, as long as such use is accepted and/or authorized by these companies. The computer 1 downloads and stores, step 24, the web pages referenced by the links contained in the received list.
Each web page is stored as usual as a set of files in the storage means 15. Each web page is associated, step 26, with domain information, particularly the domain name of the server which contains the web page.
Then, at step 28, each web page and its associated information are searched for a signature representative of a user generated content web page.
A signature, in this context, means a character string or a set of character strings which is representative of a user generated content web page. For instance, a signature may be represented by a regular expression.
Such a signature may be a character string which is specific to a server software used by web sites having user generated content. As it is well-known by the man
skilled in the art, forums, blogs and other user content web sites use specific server software. For instance,
DotClear (www . dotclear. ne t ) is a software often used to create blogs. These softwares have specific ways to generate the user content web pages which can be discovered by analyzing some specific character strings found in different parts of the inner page html/php/asp
(and other standards) structure. For instance, <body id="phpbb" class="section-viewtopic ltr"> for a PhpBB forum.
A signature may also be a part of the character string forming the URL (Uniform Resource Locator) of the web page. The character string may be a part of the structure, or tree, of the link in a dynamic web site, i.e. a site in which a part of the content is contained in a database and the web page is generated on request by integrating elements of the database within HTML frameworks. For instance, the variable $itemid ending an URL beginning with index. php? is the telltale for a Joomla CMS. The character string may also be a word or an expression often used in an URL to define a user content area. For instance, the URL ht tp : / /www . domainname /phorum/ index . php contains "phorum" which is a marker of a forum area. "Blog", "blogs", "forum", "forums" are often used as a way to distinguish between the root of the "normal" web site and the sub repertory that is used to host blog or forum files.
The signature may also be the blog or forum host domain name or the software server name usually displayed as a copyright on each generated web pages
The user content may also be used to find a signature as some languages are typical. For instance,
some expressions such as LOL (for "Laughing Out Loud") is an acronym which is almost only used in forum posts.
Thus, only web pages exhibiting the signature are considered, step 30, as containing user content. The method of analyzing each web page for a signature may be applied also to web pages embedded inside other web pages, Fig. 3.
It is useful to look at the links contained by the web pages stored in the computer 1 as these links may refer to web pages which are not referenced by the search engine .
Therefore, each web page stored in the computer 1 is searched for links, step 31.
At step 32, it is verified that a found link does not refer to a page already stored in the computer 1 ;
If the page is not already stored, it is downloaded at step 34.
Then, at step 36, the downloaded web page is verified as containing information relevant to the user request, i.e., that, if it had been indexed by the search engine, it would have been included in the response list.
If the verification is positive, the downloaded web page is stored with the other web pages, step 38.
Steps 32 to 36 are repeated for all links found in a page, step 40.
The man skilled in the art understands that this search for new web pages embedded in already known web pages may be also practised on the new found web pages in an iterative manner. Therefore, the user defines a level of iteration to stop the method after having looked at all pages being at 3 links, for instance, from the pages initially found by the search engine.
To increase the number of pages to analyse for finding user content, the user may send different requests to the search engines, each request having some lexical variation. The results of all these requests are a set of web page link lists. These links are consolidated: identical links found on different lists are merged.
The storage of web pages is preferentially done within a relational database. Thus, at each web page, a list of parameters may be associated to help the following analysis.
These parameters may comprise: keyword (s) included in the "head" part of the page, i.e META tags; - used language; description tag; title tag; - HTML content
URL information and, particularly, domain name. The hereabove method is advantageously implemented as a computer program product which comprises instructions to execute the steps of the method when program product is executed on a computer.
From a functional point of view, the computer 1 implementing the disclosed method comprises means 50 for storing web pages and associated information. Typically storage means 50 comprise a database on a hard disk.
Computer 1 also comprises means 52 for sending a request to an internet search engine and means 54 for receiving the result from the internet search engine and downloading the web pages referenced by the list.
And computer 1 comprises means 56 to associate information to a web page and means 58 to search in each
web page and its associated information a signature representative of a user generated content web page.
While the invention has been illustrated and described in details in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive, the invention is not limited to the disclosed embodiment.
Other variations to the disclosed embodiment can be understood and effected by these skilled in the art in practising the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements and the indefinite article "a" "or" "an" does not exclude a plurality.
Claims
1. Method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising a) downloading (24) and storing each web pages referenced by said list, and b) associating (26) to each web page at least domain information, c) searching (28) in each web page and its associated information a signature representative of a user generated content web page.
2. Method according to claim 1, wherein, after being downloaded and for each downloaded web page, the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request .
3. Method according to claim 2, wherein steps e) and f) are executed only if the associated web page is not already stored.
4. Method according to claim 2 or 3, wherein steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page.
5. Method according to any claim 1 to 4, wherein the signature comprises, alone or in combination:
- a set of HTML tags representative of web software for managing user generated content, - keyword (s) included into the web page link,
- determined domain name, and/or
- keyword (s) specifically included into user content web page .
6. A computer program product to select user generated content web pages comprising program instructions to executed the steps of the method according to anyone of claims 1 to 5 when said computer program product is executed on a computer.
7. Device to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
- means for downloading (54) and means for storing (50) web pages referenced by said list ;
- means for associating (56) to each web page at least domain information ; and
- means for searching (58) in each web page and its associated information a signature representative of a user generated content web page.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/IB2008/055370 WO2010049760A1 (en) | 2008-10-30 | 2008-10-30 | Method to find user generated content web pages. |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/IB2008/055370 WO2010049760A1 (en) | 2008-10-30 | 2008-10-30 | Method to find user generated content web pages. |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2010049760A1 true WO2010049760A1 (en) | 2010-05-06 |
Family
ID=40527550
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2008/055370 Ceased WO2010049760A1 (en) | 2008-10-30 | 2008-10-30 | Method to find user generated content web pages. |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2010049760A1 (en) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080021903A1 (en) * | 2006-07-20 | 2008-01-24 | Microsoft Corporation | Protecting non-adult privacy in content page search |
-
2008
- 2008-10-30 WO PCT/IB2008/055370 patent/WO2010049760A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080021903A1 (en) * | 2006-07-20 | 2008-01-24 | Microsoft Corporation | Protecting non-adult privacy in content page search |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7346605B1 (en) | Method and system for searching and monitoring internet trademark usage | |
| US7849081B1 (en) | Document analyzer and metadata generation and use | |
| CN1104696C (en) | System and method for automatically adding informational hypertext links to received documents | |
| JP6517818B2 (en) | Improving Website Traffic Optimization | |
| US9607089B2 (en) | Search and search optimization using a pattern of a location identifier | |
| US8799310B2 (en) | Method and system for processing a uniform resource locator | |
| US20080065611A1 (en) | Method and system for searching and monitoring internet trademark usage | |
| US20080306968A1 (en) | Method and system for extracting, analyzing, storing, comparing and reporting on data stored in web and/or other network repositories and apparatus to detect, prevent and obfuscate information removal from information servers | |
| CN101546309B (en) | Method and equipment for constructing indexes to resource content in computer network | |
| US20110225139A1 (en) | User role based customizable semantic search | |
| CN101305371A (en) | Rank blog documents | |
| GB2555801A (en) | Identifying fraudulent and malicious websites, domain and subdomain names | |
| US20110307479A1 (en) | Automatic Extraction of Structured Web Content | |
| CN102663060B (en) | A method and device for identifying tampered web pages | |
| Kumar | World towards advance web mining: A review | |
| KR20100023630A (en) | Method and system of classifying web page using categogory tag information and recording medium using by the same | |
| Li | [Retracted] Internet Tourism Resource Retrieval Using PageRank Search Ranking Algorithm | |
| Devi et al. | An efficient approach for web indexing of big data through hyperlinks in web crawling | |
| US7886217B1 (en) | Identification of web sites that contain session identifiers | |
| WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
| KR20120090131A (en) | Method, system and computer readable recording medium for providing search results | |
| CN101000611A (en) | Method for providing and inquiry information for public by interconnection network | |
| Duan et al. | Cloaker catcher: A client-based cloaking detection system | |
| RU2589856C2 (en) | Method of processing target message, method of processing new target message and server (versions) | |
| WO2010049760A1 (en) | Method to find user generated content web pages. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08875894 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 08875894 Country of ref document: EP Kind code of ref document: A1 |