[go: up one dir, main page]

WO2010049760A1 - Method to find user generated content web pages. - Google Patents

Method to find user generated content web pages. Download PDF

Info

Publication number
WO2010049760A1
WO2010049760A1 PCT/IB2008/055370 IB2008055370W WO2010049760A1 WO 2010049760 A1 WO2010049760 A1 WO 2010049760A1 IB 2008055370 W IB2008055370 W IB 2008055370W WO 2010049760 A1 WO2010049760 A1 WO 2010049760A1
Authority
WO
WIPO (PCT)
Prior art keywords
web page
web
generated content
user generated
web pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2008/055370
Other languages
French (fr)
Inventor
Eric De Barry
Bertrand Wolf
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ALTERBUZZ
Original Assignee
ALTERBUZZ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ALTERBUZZ filed Critical ALTERBUZZ
Priority to PCT/IB2008/055370 priority Critical patent/WO2010049760A1/en
Publication of WO2010049760A1 publication Critical patent/WO2010049760A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present invention concerns a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request and a software to practice the same.
  • the usual method consists of choosing some words suited to the object of the search. These words are inputted into the query page of an internet engine such as those proposed by Google
  • the search engine lists a set of web pages by their title and an automatically generated short abstract. A link gives access to the web page.
  • Search engines contain some internal, and often secret, algorithms to sort the list of web pages and to show to the user the most pertinent, hopefully, web pages at the beginning of the list.
  • a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprises a) downloading and storing each web pages referenced by said list, and b) associating to each web pages at least domain information, c) searching in each web page and its associated information a signature representative of a user generated content web page.
  • the method has the advantage to select only web pages containing user generated content.
  • the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request .
  • steps e) and f) are executed only if the associated web page is not already stored.
  • steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page, the signature comprises, alone or in combination: a set of HTML tags representative of web software for managing user generated content,
  • a computer program product to search for a web page comprises program instructions to execute the steps of the hereabove method when the computer program product is executed on a computer.
  • a system to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
  • Fig. 1 is a schematic view of a terminal connected to internet to practice an embodiment of the invention
  • Fig. 2 is a flowchart of a method according to an embodiment of the invention
  • Fig. 3 is a flowchart of a method according to another embodiment of the invention
  • Fig. 4 is a functional view of a terminal practising an embodiment of the invention.
  • a computer is connected to Internet network 3. Through the network 3, the computer 1 is connected to a server 5 on which a search engine is running.
  • the server 5 symbolized the infrastructure of search companies such as Google Inc. or Yahoo Inc. In fact, these companies use server farms containing hundred of computers dispatched around the world.
  • a server 7 is also connected to the internet network 3 and contains a web page which is of interest for the user of the computer 1 but its address is not known by the computer 1.
  • the web page contains opinion on a product/service of interest for the user of the computer
  • I is a user generated content web page such as a blog, wiki or forum page.
  • the computer 1 is a classical personal computer. It comprises interface means such as a display 9, a keyboard
  • the storage means 15 contains a computer software product which, when executed by the processing means 17, makes the computer 1 execute the steps of a method to search for a web page according to an embodiment of the invention .
  • a user of the computer 1 sends, step 20, a request to the search engine of server 5 and receives, step 22, a web page containing a list of web page links created by the web search engine in response to the request.
  • the request comprises a set of lexical units on the subject of interest.
  • the request is inputted into the internet search engine, such as Google engine
  • the step 22 is done preferentially automatically, either by building HTTP request with the adapted syntax or by using some Application Specific Interface (API) provided by these companies to automatize the internet searches with their engine, as long as such use is accepted and/or authorized by these companies.
  • the computer 1 downloads and stores, step 24, the web pages referenced by the links contained in the received list.
  • Each web page is stored as usual as a set of files in the storage means 15.
  • Each web page is associated, step 26, with domain information, particularly the domain name of the server which contains the web page.
  • each web page and its associated information are searched for a signature representative of a user generated content web page.
  • a signature in this context, means a character string or a set of character strings which is representative of a user generated content web page.
  • a signature may be represented by a regular expression.
  • Such a signature may be a character string which is specific to a server software used by web sites having user generated content.
  • server software used by web sites having user generated content.
  • forums, blogs and other user content web sites use specific server software. For instance,
  • DotClear (www . dotclear. ne t ) is a software often used to create blogs. These softwares have specific ways to generate the user content web pages which can be discovered by analyzing some specific character strings found in different parts of the inner page html/php/asp
  • a signature may also be a part of the character string forming the URL (Uniform Resource Locator) of the web page.
  • the character string may be a part of the structure, or tree, of the link in a dynamic web site, i.e. a site in which a part of the content is contained in a database and the web page is generated on request by integrating elements of the database within HTML frameworks. For instance, the variable $itemid ending an URL beginning with index. php? is the telltale for a Joomla CMS.
  • the character string may also be a word or an expression often used in an URL to define a user content area. For instance, the URL ht tp : / /www . domainname /phorum/ index .
  • php contains "phorum” which is a marker of a forum area. "Blog”, “blogs”, “forum”, “forums” are often used as a way to distinguish between the root of the "normal” web site and the sub repertory that is used to host blog or forum files.
  • the user content may also be used to find a signature as some languages are typical. For instance, some expressions such as LOL (for "Laughing Out Loud”) is an acronym which is almost only used in forum posts.
  • step 30 only web pages exhibiting the signature are considered, step 30, as containing user content.
  • the method of analyzing each web page for a signature may be applied also to web pages embedded inside other web pages, Fig. 3.
  • step 32 it is verified that a found link does not refer to a page already stored in the computer 1 ;
  • the downloaded web page is verified as containing information relevant to the user request, i.e., that, if it had been indexed by the search engine, it would have been included in the response list.
  • the downloaded web page is stored with the other web pages, step 38.
  • Steps 32 to 36 are repeated for all links found in a page, step 40.
  • this search for new web pages embedded in already known web pages may be also practised on the new found web pages in an iterative manner. Therefore, the user defines a level of iteration to stop the method after having looked at all pages being at 3 links, for instance, from the pages initially found by the search engine.
  • the user may send different requests to the search engines, each request having some lexical variation. The results of all these requests are a set of web page link lists. These links are consolidated: identical links found on different lists are merged.
  • the storage of web pages is preferentially done within a relational database.
  • a list of parameters may be associated to help the following analysis.
  • These parameters may comprise: keyword (s) included in the "head" part of the page, i.e META tags; - used language; description tag; title tag; - HTML content
  • the hereabove method is advantageously implemented as a computer program product which comprises instructions to execute the steps of the method when program product is executed on a computer.
  • the computer 1 implementing the disclosed method comprises means 50 for storing web pages and associated information.
  • storage means 50 comprise a database on a hard disk.
  • Computer 1 also comprises means 52 for sending a request to an internet search engine and means 54 for receiving the result from the internet search engine and downloading the web pages referenced by the list.
  • And computer 1 comprises means 56 to associate information to a web page and means 58 to search in each web page and its associated information a signature representative of a user generated content web page.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising a) downloading (24) and storing each web pages referenced by said list, and b) associating (26) to each web page at least domain information, c) searching (28) in each web page and its associated information a signature representative of a user generated content web page.

Description

METHOD TO FIND USER GENERATED CONTENT WEB PAGES
Field of the invention
The present invention concerns a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request and a software to practice the same. Background of the invention
Nowadays, to search for a web page, the usual method consists of choosing some words suited to the object of the search. These words are inputted into the query page of an internet engine such as those proposed by Google
Inc. or Yahoo Inc.
As a result, the search engine lists a set of web pages by their title and an automatically generated short abstract. A link gives access to the web page.
Search engines contain some internal, and often secret, algorithms to sort the list of web pages and to show to the user the most pertinent, hopefully, web pages at the beginning of the list.
However, this method is not very efficient when the search concerns user' s opinion about a product or a service .
Indeed, when a user would like to buy a product or a service it is nowadays a common practice to search for the opinions of the prior buyers or users. This information may be found on blogs, wikis, forums and any other web site where a "standard" user may post a message. These opinions around a product or service generate a buzz which has a positive, or negative, impact on the success of the product/service.
It is therefore important for marketing department as well as for users to dispose of a method which is efficient to find the web pages containing opinions on a defined product or service while leaving aside "classical" web pages concerned by the product/service such as pages of merchant web site, of price comparators, etc.
Summary of the invention
To better address one or more concerns, in a first aspect of the invention, a method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprises a) downloading and storing each web pages referenced by said list, and b) associating to each web pages at least domain information, c) searching in each web page and its associated information a signature representative of a user generated content web page.
Therefore, the method has the advantage to select only web pages containing user generated content.
In particular embodiments, the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request . steps e) and f) are executed only if the associated web page is not already stored. steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page, the signature comprises, alone or in combination: a set of HTML tags representative of web software for managing user generated content,
- keyword (s) included into the web page link,
- determined domain name, and/or - keyword (s) specifically included into user content web page .
Aspects of these embodiments may be combined or modified as appropriate or desired, however.
In a second aspect of the invention, a computer program product to search for a web page comprises program instructions to execute the steps of the hereabove method when the computer program product is executed on a computer.
In a third aspect of the invention, a system to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
- means for downloading and means for storing web pages referenced by said list ; - means for associating to each web page at least domain information ; and
- means for searching in each web page and its associated information a signature representative of a user generated content web page. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment described hereafter where: g) Fig. 1 is a schematic view of a terminal connected to internet to practice an embodiment of the invention; h) Fig. 2 is a flowchart of a method according to an embodiment of the invention; i) Fig. 3 is a flowchart of a method according to another embodiment of the invention; and j) Fig. 4 is a functional view of a terminal practising an embodiment of the invention. Detailed description
In reference to Fig. 1, a computer is connected to Internet network 3. Through the network 3, the computer 1 is connected to a server 5 on which a search engine is running. The man skilled in the art understands that the server 5 symbolized the infrastructure of search companies such as Google Inc. or Yahoo Inc. In fact, these companies use server farms containing hundred of computers dispatched around the world.
A server 7 is also connected to the internet network 3 and contains a web page which is of interest for the user of the computer 1 but its address is not known by the computer 1. The web page contains opinion on a product/service of interest for the user of the computer
I and is a user generated content web page such as a blog, wiki or forum page.
The computer 1 is a classical personal computer. It comprises interface means such as a display 9, a keyboard
II and a mouse 13 or the like.
It comprises also storage means 15 and processing means 17 such as, for instance, hard disk drives and motherboard.
The storage means 15 contains a computer software product which, when executed by the processing means 17, makes the computer 1 execute the steps of a method to search for a web page according to an embodiment of the invention .
In reference to Fig. 2, a user of the computer 1 sends, step 20, a request to the search engine of server 5 and receives, step 22, a web page containing a list of web page links created by the web search engine in response to the request.
As it is well known by the man skilled in the art, the request comprises a set of lexical units on the subject of interest. The request is inputted into the internet search engine, such as Google engine
(www. google . com) or Yahoo engine (www . yahoo . com) .
The step 22 is done preferentially automatically, either by building HTTP request with the adapted syntax or by using some Application Specific Interface (API) provided by these companies to automatize the internet searches with their engine, as long as such use is accepted and/or authorized by these companies. The computer 1 downloads and stores, step 24, the web pages referenced by the links contained in the received list.
Each web page is stored as usual as a set of files in the storage means 15. Each web page is associated, step 26, with domain information, particularly the domain name of the server which contains the web page.
Then, at step 28, each web page and its associated information are searched for a signature representative of a user generated content web page.
A signature, in this context, means a character string or a set of character strings which is representative of a user generated content web page. For instance, a signature may be represented by a regular expression.
Such a signature may be a character string which is specific to a server software used by web sites having user generated content. As it is well-known by the man skilled in the art, forums, blogs and other user content web sites use specific server software. For instance,
DotClear (www . dotclear. ne t ) is a software often used to create blogs. These softwares have specific ways to generate the user content web pages which can be discovered by analyzing some specific character strings found in different parts of the inner page html/php/asp
(and other standards) structure. For instance, <body id="phpbb" class="section-viewtopic ltr"> for a PhpBB forum.
A signature may also be a part of the character string forming the URL (Uniform Resource Locator) of the web page. The character string may be a part of the structure, or tree, of the link in a dynamic web site, i.e. a site in which a part of the content is contained in a database and the web page is generated on request by integrating elements of the database within HTML frameworks. For instance, the variable $itemid ending an URL beginning with index. php? is the telltale for a Joomla CMS. The character string may also be a word or an expression often used in an URL to define a user content area. For instance, the URL ht tp : / /www . domainname /phorum/ index . php contains "phorum" which is a marker of a forum area. "Blog", "blogs", "forum", "forums" are often used as a way to distinguish between the root of the "normal" web site and the sub repertory that is used to host blog or forum files.
The signature may also be the blog or forum host domain name or the software server name usually displayed as a copyright on each generated web pages
The user content may also be used to find a signature as some languages are typical. For instance, some expressions such as LOL (for "Laughing Out Loud") is an acronym which is almost only used in forum posts.
Thus, only web pages exhibiting the signature are considered, step 30, as containing user content. The method of analyzing each web page for a signature may be applied also to web pages embedded inside other web pages, Fig. 3.
It is useful to look at the links contained by the web pages stored in the computer 1 as these links may refer to web pages which are not referenced by the search engine .
Therefore, each web page stored in the computer 1 is searched for links, step 31.
At step 32, it is verified that a found link does not refer to a page already stored in the computer 1 ;
If the page is not already stored, it is downloaded at step 34.
Then, at step 36, the downloaded web page is verified as containing information relevant to the user request, i.e., that, if it had been indexed by the search engine, it would have been included in the response list.
If the verification is positive, the downloaded web page is stored with the other web pages, step 38.
Steps 32 to 36 are repeated for all links found in a page, step 40.
The man skilled in the art understands that this search for new web pages embedded in already known web pages may be also practised on the new found web pages in an iterative manner. Therefore, the user defines a level of iteration to stop the method after having looked at all pages being at 3 links, for instance, from the pages initially found by the search engine. To increase the number of pages to analyse for finding user content, the user may send different requests to the search engines, each request having some lexical variation. The results of all these requests are a set of web page link lists. These links are consolidated: identical links found on different lists are merged.
The storage of web pages is preferentially done within a relational database. Thus, at each web page, a list of parameters may be associated to help the following analysis.
These parameters may comprise: keyword (s) included in the "head" part of the page, i.e META tags; - used language; description tag; title tag; - HTML content
URL information and, particularly, domain name. The hereabove method is advantageously implemented as a computer program product which comprises instructions to execute the steps of the method when program product is executed on a computer.
From a functional point of view, the computer 1 implementing the disclosed method comprises means 50 for storing web pages and associated information. Typically storage means 50 comprise a database on a hard disk.
Computer 1 also comprises means 52 for sending a request to an internet search engine and means 54 for receiving the result from the internet search engine and downloading the web pages referenced by the list.
And computer 1 comprises means 56 to associate information to a web page and means 58 to search in each web page and its associated information a signature representative of a user generated content web page.
While the invention has been illustrated and described in details in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive, the invention is not limited to the disclosed embodiment.
Other variations to the disclosed embodiment can be understood and effected by these skilled in the art in practising the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements and the indefinite article "a" "or" "an" does not exclude a plurality.

Claims

1. Method to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising a) downloading (24) and storing each web pages referenced by said list, and b) associating (26) to each web page at least domain information, c) searching (28) in each web page and its associated information a signature representative of a user generated content web page.
2. Method according to claim 1, wherein, after being downloaded and for each downloaded web page, the method further comprises d) searching for contained web page link, and, if found, e) downloading the associated web page f) storing the associated web page if the associated web page is a positive response to said user request .
3. Method according to claim 2, wherein steps e) and f) are executed only if the associated web page is not already stored.
4. Method according to claim 2 or 3, wherein steps d) to f) are iteratively executed on each new web page found through a link contained in a stored web page.
5. Method according to any claim 1 to 4, wherein the signature comprises, alone or in combination:
- a set of HTML tags representative of web software for managing user generated content, - keyword (s) included into the web page link,
- determined domain name, and/or
- keyword (s) specifically included into user content web page .
6. A computer program product to select user generated content web pages comprising program instructions to executed the steps of the method according to anyone of claims 1 to 5 when said computer program product is executed on a computer.
7. Device to select user generated content web pages from a list of web page links created by a web search engine in response to a user request comprising
- means for downloading (54) and means for storing (50) web pages referenced by said list ;
- means for associating (56) to each web page at least domain information ; and
- means for searching (58) in each web page and its associated information a signature representative of a user generated content web page.
PCT/IB2008/055370 2008-10-30 2008-10-30 Method to find user generated content web pages. Ceased WO2010049760A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2008/055370 WO2010049760A1 (en) 2008-10-30 2008-10-30 Method to find user generated content web pages.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2008/055370 WO2010049760A1 (en) 2008-10-30 2008-10-30 Method to find user generated content web pages.

Publications (1)

Publication Number Publication Date
WO2010049760A1 true WO2010049760A1 (en) 2010-05-06

Family

ID=40527550

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2008/055370 Ceased WO2010049760A1 (en) 2008-10-30 2008-10-30 Method to find user generated content web pages.

Country Status (1)

Country Link
WO (1) WO2010049760A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080021903A1 (en) * 2006-07-20 2008-01-24 Microsoft Corporation Protecting non-adult privacy in content page search

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080021903A1 (en) * 2006-07-20 2008-01-24 Microsoft Corporation Protecting non-adult privacy in content page search

Similar Documents

Publication Publication Date Title
US7346605B1 (en) Method and system for searching and monitoring internet trademark usage
US7849081B1 (en) Document analyzer and metadata generation and use
CN1104696C (en) System and method for automatically adding informational hypertext links to received documents
JP6517818B2 (en) Improving Website Traffic Optimization
US9607089B2 (en) Search and search optimization using a pattern of a location identifier
US8799310B2 (en) Method and system for processing a uniform resource locator
US20080065611A1 (en) Method and system for searching and monitoring internet trademark usage
US20080306968A1 (en) Method and system for extracting, analyzing, storing, comparing and reporting on data stored in web and/or other network repositories and apparatus to detect, prevent and obfuscate information removal from information servers
CN101546309B (en) Method and equipment for constructing indexes to resource content in computer network
US20110225139A1 (en) User role based customizable semantic search
CN101305371A (en) Rank blog documents
GB2555801A (en) Identifying fraudulent and malicious websites, domain and subdomain names
US20110307479A1 (en) Automatic Extraction of Structured Web Content
CN102663060B (en) A method and device for identifying tampered web pages
Kumar World towards advance web mining: A review
KR20100023630A (en) Method and system of classifying web page using categogory tag information and recording medium using by the same
Li [Retracted] Internet Tourism Resource Retrieval Using PageRank Search Ranking Algorithm
Devi et al. An efficient approach for web indexing of big data through hyperlinks in web crawling
US7886217B1 (en) Identification of web sites that contain session identifiers
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
KR20120090131A (en) Method, system and computer readable recording medium for providing search results
CN101000611A (en) Method for providing and inquiry information for public by interconnection network
Duan et al. Cloaker catcher: A client-based cloaking detection system
RU2589856C2 (en) Method of processing target message, method of processing new target message and server (versions)
WO2010049760A1 (en) Method to find user generated content web pages.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08875894

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08875894

Country of ref document: EP

Kind code of ref document: A1