[go: up one dir, main page]

WO2000045297A1 - Systems, methods and computer program products for mining data from host computers via the internet - Google Patents

Systems, methods and computer program products for mining data from host computers via the internet Download PDF

Info

Publication number
WO2000045297A1
WO2000045297A1 PCT/US1999/026632 US9926632W WO0045297A1 WO 2000045297 A1 WO2000045297 A1 WO 2000045297A1 US 9926632 W US9926632 W US 9926632W WO 0045297 A1 WO0045297 A1 WO 0045297A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
web page
root portion
displayed
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US1999/026632
Other languages
French (fr)
Inventor
Henry Nouri
Jose Collazo
Ashutosh Narhari
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SMART ONLINE Inc
Original Assignee
SMART ONLINE Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SMART ONLINE Inc filed Critical SMART ONLINE Inc
Priority to AU14757/00A priority Critical patent/AU1475700A/en
Publication of WO2000045297A1 publication Critical patent/WO2000045297A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Definitions

  • the present invention relates generally to computer networks and, more particularly, to the Internet .
  • the Internet is a worldwide decentralized network of computers having the ability to communicate with each other.
  • the Internet has gained broad recognition as a viable medium for communicating and interacting across multiple networks .
  • the World Wide Web (Web) was created in the early 1990 ' s, and is comprised of server-hosting computers (web servers) connected to the Internet having hypertext documents or web pages stored therewithin. Web pages are accessible by_ client programs (i.e., web browsers) utilizing the Hypertext Transfer Protocol (HTTP) via a Transmission Control Protocol/Internet Protocol (TCP/IP) connection between a client-hosting device and a server-hosting device.
  • HTTP Hypertext Transfer Protocol
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • Exemplary web browsers include Netscape Navigator ® (Netscape Communications Corporation, Mountain View, CA) and Internet Explorer ® (Microsoft Corporation, Redmond., WA) .
  • Web browsers typically provide a graphical user interface for retrieving and viewing Web pages hosted by web servers .
  • a Web page using a standard page description language known as HyperText Markup Language (HTML) , typically displays text and graphics, and can play sound, animation, and video clips.
  • HTML provides basic document formatting and allows a Web page developer to specify hypertext links (typically manifested as highlighted text) to other web servers and files.
  • a web browser reads and interprets the address, called a URL (Uniform Resource Locator) associated with the link, connects the client-hosting device with the web server at that address (also referred to as a "web site"), and makes an HTTP request for the web page identified in the link.
  • the web server then sends the requested web page to the client-hosting device in HTML format which the web browser interprets and displays to the user.
  • a URL gives the type of resource being accessed (e . g. , HTTP, GOPHER, WAIS, etc.) and optionally the path of the file sought.
  • resource //host .domain/path/filename, wherein the resource can include "file”, “HTTP”, “GOPHER”, “WAIS”, “NEWS”, “TELNET”, and so forth.
  • a challenge presented by the Internet is how to enable greater simplicity and efficiency in accessing and comprehending information available through the Internet . Because of the immense amount of information available via the Internet and because of the generally unstructured nature of the Internet, searching for information on the Internet can be a daunting exercise. Another challenge presented by the Internet is the difficulty in obtaining and manipulating data from the Internet.
  • Conventional methods of obtaining Internet data include saving a web page currently displayed within a user's web browser.
  • Conventional methods of obtaining multimedia files, such as images include positioning a mouse cursor over a portion of a displayed web page and "clicking" the right mouse button to save a desired file. Unfortunately, these conventional methods can be inefficient and labor intensive because only the data from a web page currently being displayed can be obtained.
  • Many web sites host multiple files containing related content. For example, a web site may host various web pages containing information about each country in the world.
  • a user may not be able to retrieve all available information contained within the web site unless he or she displays each and every page within his or her web browser.
  • a user may not become aware of all web pages available at a particular web site unless an exhaustive search of the web site is performed.
  • the complexity of many web sites can make an exhaustive search difficult and time consuming to perform.
  • HTML HyperText Markup Language
  • HTML table format Unfortunately, the conversion of data displayed in HTML table format may be difficult to convert to other useful formats, such as database and _ . spreadsheet formats .
  • spreadsheet formats In order to incorporate data displayed within a web page, a user may need to re-type the displayed data within the desired format, or utilize various known cut-and-paste techniques.
  • each file in the set is identified by a unique URL, and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set.
  • a user initially selects a file from a host computer via a client computer in communication with the host computer. For example, a user selects a web page displayed within a web browser and/or a portion thereof.
  • the root portion and variable portion of a URL for the selected file is determined by comparing the URL with a URL of another file at the same host computer. This comparison can be made by users or automatically by various comparison agents. Files within the host computer having a respective URL root portion matching the URL root portion of the selected file are then identified. Each file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then automatically downloaded to the user's computer.
  • URLs containing multiple root portions (i.e., non-variable portions) and/or multiple variable portions also can be compared to identify various sets of related files, according to the present invention.
  • the present invention can utilize complex URL agents to compare complex URLs to identify sets of related files within a host computer.
  • data contained within downloaded files can be extracted and arranged within various user-defined formats that are different from the displayed format of the data.
  • a table of data displayed within a web page in HTML format can be easily and quickly converted to a user defined database, spreadsheet, or word processing document format.
  • the present invention is advantageous because users can automatically collect or "mine” information from the Internet without having to access each file containing the desired information.
  • the present invention is adaptable to the mining of all types of data including, but not limited to, text, images, sound, and video.
  • the value to a user of downloaded information can be enhanced by the present invention because a user can easily re-format the data into various user-defined formats.
  • FIG. 1 illustrates a client-server computing environment in which the present invention may be implemented.
  • Fig. 2 illustrates an exemplary HTML document with markup language tags displayed.
  • Fig. 3 schematically illustrates a system for mining data from a set of files within a host computer, according to an embodiment of the present invention.
  • Fig. 4 illustrates operations for mining data from a set of files within a host computer, according to the present invention.
  • Fig. 5A illustrates a web page displayed within a web browser, wherein the web page contains an image file within a portion thereof.
  • Fig. 5B illustrates the image file of Fig. 5A separately displayed within a web browser.
  • Fig. 6 illustrates a user interface for downloading files from a host computer according to an embodiment of the present invention.
  • Fig. 7 illustrates a text file containing the variable portions of a set of URLs having a common root portion .
  • Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within a set to a directory entitled "maps" on a user's client computer .
  • Fig. 9 illustrates a set of files stored within a user's computer that were downloaded via the present invention.
  • Fig. 10 illustrates an exemplary table of data in HTML format as displayed within a web page.
  • Fig. 11A illustrates data from the displayed table of Fig. 10 extracted and arranged in a user- defined database format .
  • Fig. 11B illustrates the user-defined format of the database of Fig. 11A.
  • the present invention may be embodied as a method, data processing system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware ⁇ embodiment , an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code means embodied in the medium. Any suitable computer readable medium may be utilized including, but not limited to, hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
  • client-server is a model for " a relationship between two computer programs in which one program, the client, makes a service request from another program, the server, which fulfills the request.
  • client-server model can be used by programs within a single computer, it is more commonly used in a network where computing functions and data can be distributed more efficiently among many client and server programs at different network locations.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • client programs typically share the services of a common server program. Both client programs and server programs are often part of a larger program or application.
  • a web browser is a client program that requests services (the sending of web pages or files) jfrom a web server in..another computer somewhere on the Internet .
  • client-server environments within which the present invention may operate include public networks, such as the Internet, and private networks often referred to as “Intranets” and “Extranets.”
  • Internet shall incorporate the terms “Intranet” and “Extranet” and any references to accessing the Internet shall be understood to mean accessing an Intranet and/or and Extranet, as well.
  • computer network shall incorporate publicly accessible computer networks, such as the Internet, as well as private computer networks.
  • Fig. 1 illustrates a client-server computing environment in which the present invention may be implemented.
  • a remote user's computer 10 has a client application (i.e., a web browser) resident thereon and a host computer 20 has a server application (i.e., a web server) resident thereon.
  • the user's computer 10 preferably includes a central processing unit 11, a display 12, a pointing device 13, a keyboard 14, access to persistent data storage, and a communications link 16 for communicating with the host computer 20.
  • the keyboard 14, having a plurality of keys thereon, is in communication with the central processing unit 11.
  • a pointing device 13, such as a mouse, is also connected to the central processing unit 11.
  • the communications link 16 may be established via a modem 15 connected to traditional phone lines, an ISDN link, a Tl link, a T3 link, via cable television, via an ethernet network, and the like.
  • Modem 15 may also be a wireless modem configured to communicate with the modem 25 of the host -. computer 20 via wireless communications systems.
  • the communications link 16 also may be made by a direct connection of the user's computer 10 to the host computer 20 or indirectly via a computer network 17, such as the Internet, in communication with the host computer 20.
  • the central processing unit 11 contains one or more microprocessors (not shown) or other computational devices and random access memory (not shown) or its functional equivalent, including but not limited to, RAM, FLASHRAM, and VRAM for storing programs therein for processing by the microprocessor (s) or other computational devices.
  • a portion of the random access memory and/or persistent data storage, referred to as "cache,” may be utilized during communications between a user's computer 10 and a host computer 20 to store various data transferred from the host computer.
  • a user's computer 10 has an
  • Intel ® Pentium ® processor (or equivalent) with at least thirty-two megabytes (32 MB) of RAM, and at least five megabytes (5 MB) of persistent computer storage for caching.
  • processors may be utilized to practice the present invention without being limited to those enumerated herein.
  • a color display is preferable, a black and white display or standard broadcast or cable television monitor may be used.
  • Exemplary user computers having a web browser resident thereon may include, but are not limited to, an Apple ® , Sun Microsystems ® , IBM ® , or IBM ® -compatible personal computer.
  • Abuser's computer 10 if an IBM ® , or - IBM ® -compatible personal computer, preferably utilizes either a Windows ® 3.1, Windows 95 ® , Windows 98 ® , Windows NT ® , UNIX ® , or OS/2 ® operating system.
  • Windows ® 3.1 Windows 95 ®
  • Windows 98 ® Windows NT ®
  • OS/2 ® operating system OS/2 ® operating system
  • other operating systems may also be utilized without limitation.
  • a terminal not having computational capability, or having limited computational capability may be utilized in accordance with the present invention for accessing a host computer 20 in a client capacity.
  • a web browser resident on a user's computer 10 is a Java ® -enabled browser, such as Netscape Navigator ® , Version 3.0 and higher.
  • a Java ® - enabled browser includes a Java ® virtual machine (JVM) that interprets Java ® bytecode into code that will run on the user's computer.
  • JVM Java ® virtual machine
  • a host computer 20, functioning as a web server, may have a configuration similar to that of a user's computer 10 and may include a central processing unit 21, a display 22, a pointing device 23, a keyboard 24, access to persistent data storage, and a communications link 26 for connecting to the user's computer 10 via a modem 25, or otherwise.
  • Various types of host computers including, but not limited to, mainframe computing systems, mini-computers, and PCs, may be accessed and data downloaded therefrom via the present invention.
  • An HTML document can be comprised of text , images and a variety of objects, each of which are surrounded by various markup language tag . s that control -. format attributes and identify different portions of the document (i.e., ⁇ tag_name>text ⁇ /tag_name>) .
  • HTML documents are typically written and stored in ASCII text format using a text editor.
  • An exemplary HTML document 30 is illustrated in Fig. 2.
  • the illustrated HTML document 30 includes a header section 32 demarcated by ⁇ HEAD> tags 32a, 32b.
  • a body section 34, demarcated by ⁇ BODY> tags 34a, 34b, includes various "content" portions 36a, 36b, 36c, 36d. It is these content portions 36a, 36b, 36c, 36d that contain information displayed to a user viewing the HTML document 30 with a web browser.
  • the present invention is preferably implemented as a plurality of modules and agents within a client computer. These modules and agents, are preferably written in the Java ® programming language so as to be compatible with most client computing platforms. However, portions of the computer program code for carrying out operations of the present invention could also be written in other object oriented programming languages such as Smalltalk or C++, as well as conventional procedural programming languages, such as the "C" programming language.
  • Java ® is an object-oriented programming language developed by Sun Microsystems, Mountain View, California. Java ® is a portable and architecturally neutral language. Java ® source code is compiled into a machine-independent format that can be run on any machine with a Java ® runtime system known as the Java ® Virtual Machine (JVM) .
  • the JVM is defined as an imaginary machine that is implemented by emulating a processor through the use of software on a real machine. Accordingly, machines running under diverse operating systems, including UNIX ® , Windows NT ® , and Macintosh ® having a JVM can execute the same Java ® program.
  • Java ® source code is compiled into bytecode using a Java ® compiler referred to as a Javac. Compiled Java ® programs are saved in files with the extension " .class" .
  • a client computer 42 is in communication with a host computer 50, such as a web server, via a computer network 60, such as the Internet.
  • a Pattern Entry Module 43 running within the client computer 42, serves the function of collecting information from a user in order to initiate data collection from the host computer 50.
  • a URL List Generator Module 44 creates a list of files to be downloaded from the host computer 50 to the client computer 42.
  • a URL Batch Downloader Module 45 downloads the files identified by the URL List Generator Module 44 and stores the files in their original format within the client computer 42.
  • a Data Parser Module 46 performs various data parsing techniques on the downloaded files according to information provided by a user.
  • a Data Export Module 47 stores parsed data in various formats as defined by a user.
  • a set of URL comparison agents 48 are provided for automatically comparing simple and complex URLs for files within a host computer.
  • the illustrated host computer 50 contains a plurality of files 52.
  • a set 54 of files can be identified from the plurality of files 52 wherein each file in the identified set 54 has at least one common non-variable (root) portion and a variable portion.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks .
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
  • Each file in a set is identified by a unique universal resource locator (URL) .
  • Each unique URL contains at least one root or non-variable portion common to all files in the set and at least one variable portion unique to a respective file in the set.
  • a file is selected from a host computer via a client computer in communication with the host computer (Block 100) .
  • the root portions and variable portions of the URL of the file is determined (Block 200) .
  • the identification of URL root and variable portions can be performed directly by a user via the Pattern Entry Module (43, Fig. 3) or automatically via one or more URL comparison agents (48, Fig. 3) .
  • Files within. the host computer (or accessible thereby) having a respective URL root portion matching the URL root portion of the selected file are then identified (Block 300) .
  • Files having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then downloaded to the client computer (Block 400) .
  • Operations, according to the present invention also include extracting data contained within each downloaded file (Block 500) and arranging the extracted data in a format different from the format of a downloaded file (Block 600) .
  • a user has accessed a web page (http://www.cia.gov/cia/publications/ factbook/country-frame.html) 70 from a web server (i.e., a host computer) via a web browser 72 (Block 100, Fig. 4) .
  • the illustrated web page 70 includes a left frame 70a that displays a web page (http://www.cia.gov/cia/ publications/factbook/country. tml) 73 and a right frame 70b that displays a web page (http://www.cia.gov/cia/ publications/factbook/ag.html) 74.
  • the web page 73 contained within the left frame 70a provides a list of hypertext links to web pages pertaining to various countries .
  • the web page 74 contained within the right frame 70b is a web page for a respective country (Algeria) whose hypertext link has been activated in the web page 73 displayed within the left frame 70a.
  • the illustrated web page 74 displayed within the right frame 70b, includes an image (ag-150.gif) 75 of the selected country, Norway.
  • the image 75 has the following URL: http: //www, cia. qov/cia/publications/fact-book/figures/ag- 150.crif .
  • a user who desires a copy of the image 75 for the country Haiti can position a mouse cursor over the image 75 and click on the right mouse button to save the image
  • a user can select a particular file and then automatically download all files similar to the selected file.
  • the present invention searches a host computer for all files that are within a set within which a selected file is a member. Accordingly, a user can select the displayed image 75 of Norway and automatically download a respective image for each country listed in the web page 73 without having to access or display a web page for each country.
  • a user defines a set of files to download by indicating, within the user interface 78 of Fig. 6, the root and variable portions of a URL (Block 200, Fig. 4) .
  • a user could view the URL for various images for the respective countries listed in the web page 73 (Fig. 5A) to determine the root and variable portions of each URL.
  • the URL root portion of each country map image is http: //www.cia. ⁇ ov/ cia/publications/fac -book/figures/.
  • the variable portion of each URL is an abbreviation of each country.
  • a user can select a web page or a portion of a displayed web page and the URL for the selected web page, or portion thereof, can be automatically analyzed via various agents (48, Fig. 3) to determine one or more root portions that are shared _. by other files within the host computer (Block 200, Fig. 4) .
  • a user identifies a first URL of interest, such as http : //www. reports . com/Feb .html , and wants to locate all files in the host computer having the same root portion.
  • An agent can be invoked to locate another URL within the host computer and to begin comparing each URL.
  • the agent locates a URL http: //www, reports . com/Jan.html .
  • the two URLs are then parsed and compared by the agent to determine the non- variable (i.e., root) portions and the variable portions .
  • the agent determines that the root portions of each URL are "http://www.reports.com/" and ".html” and that the variable portion in each URL is between the two root portions (i.e., "Jan” and "Feb”).
  • Agents according to the present invention can also parse complex URLs containing multiple root and variable portions. For example, a user identifies a first URL of interest, such as http : //ww . FebReport . com/ Feb .html , and wants to locate all files in the host computer having the same root portion. An agent can be invoked to locate another URL within the host computer and to begin comparing the two URLs . The agent locates a URL http : //ww . JanReport . com/Jan. html . The two URLs are then parsed and compared by the agent to determine the non-variable (i.e., root) portions and the variable portions. The agent determines that the root portions of each URL are "http : //www. " , "Report.com", and ".html". The agent also determines that the variable portions in each URL are between the first and second root portions and between the second and third root portions (i.e., "Jan” and "Feb”) .
  • a set of all files within the host computer having a matching root portion (or portions) are identified (Block 300, Fig. 4) .
  • the variable portions of each file in the set are identified and stored within a file 79 as illustrated in Fig. 7.
  • the root portion of the URL for the selected file has been entered within the field 78a, entitled “First Part . "
  • the file containing the respective URL variable portions for each file, entitled “delete.txt”, has been entered into the field 78b, entitled “Middle.”
  • the suffix for the URLs within the identified set has been entered within the field 78c, entitled “Last Part.”
  • a user may indicate, via field 78d, the name of a local directory within which to store the files as they are downloaded from the host computer.
  • the button 78e entitled "Get
  • Files, " all files within the set that satisfy the criteria set forth in fields 78a, 78b and 78c are downloaded to the user's client computer (Block 400, Fig. 4) .
  • Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within the set to a directory entitled "maps" on the user's client computer.
  • Fig. 9 illustrates the downloaded files stored within a specified directory on the user's client computer.
  • data can be extracted from downloaded HTML files (Block 500, Fig. 4) using various know parsing techniques and arranged in different formats (Block 600, Fig. 4) to suit the needs of a user.
  • FIG. 10 illustrates an exemplary table of data 80 in HTML format as displayed within a web page.
  • a web page containing the displayed table 80 can be downloaded as described above, and the data from the table 80 extracted therefrom and arranged in a different format.
  • data for the first eight entries in the web page table 80 of Fig. 10 has been extracted and arranged in a database 82 (Fig. 11A) having a format 84 as indicated in Fig. 11B.
  • the table columns of Fig. 10 have been converted to fields within the database 82.
  • the foregoing is illustrative of the present invention and is not to be construed as limiting thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Methods, systems and computer program products are provided for mining data from a set of files within a web server, wherein each file in the set is identified by a unique universal resource locator (URL), and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set. A user initially selects a file from a web server. Each file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then automatically downloaded to the user's computer. Data contained within downloaded files can be extracted and arranged within various user-defined formats.

Description

SYSTEMS, METHODS AND COMPUTER PROGRAM PRODUCTS FOR MINING DATA FROM HOST COMPUTERS VIA THE INTERNET
Field of the Invention
The present invention relates generally to computer networks and, more particularly, to the Internet .
Background of the Invention
The Internet is a worldwide decentralized network of computers having the ability to communicate with each other. The Internet has gained broad recognition as a viable medium for communicating and interacting across multiple networks . The World Wide Web (Web) was created in the early 1990 ' s, and is comprised of server-hosting computers (web servers) connected to the Internet having hypertext documents or web pages stored therewithin. Web pages are accessible by_ client programs (i.e., web browsers) utilizing the Hypertext Transfer Protocol (HTTP) via a Transmission Control Protocol/Internet Protocol (TCP/IP) connection between a client-hosting device and a server-hosting device. Exemplary web browsers include Netscape Navigator® (Netscape Communications Corporation, Mountain View, CA) and Internet Explorer® (Microsoft Corporation, Redmond., WA) . Web browsers typically provide a graphical user interface for retrieving and viewing Web pages hosted by web servers .
A Web page, using a standard page description language known as HyperText Markup Language (HTML) , typically displays text and graphics, and can play sound, animation, and video clips. HTML provides basic document formatting and allows a Web page developer to specify hypertext links (typically manifested as highlighted text) to other web servers and files. When a user selects a particular hypertext link, a web browser reads and interprets the address, called a URL (Uniform Resource Locator) associated with the link, connects the client-hosting device with the web server at that address (also referred to as a "web site"), and makes an HTTP request for the web page identified in the link. The web server then sends the requested web page to the client-hosting device in HTML format which the web browser interprets and displays to the user. A URL gives the type of resource being accessed ( e . g. , HTTP, GOPHER, WAIS, etc.) and optionally the path of the file sought. For example: resource : //host .domain/path/filename, wherein the resource can include "file", "HTTP", "GOPHER", "WAIS", "NEWS", "TELNET", and so forth. Through the Web, users can access various Internet services including, but not limited to, HTTP, GOPHER, TELNET, and FTP.
A challenge presented by the Internet is how to enable greater simplicity and efficiency in accessing and comprehending information available through the Internet . Because of the immense amount of information available via the Internet and because of the generally unstructured nature of the Internet, searching for information on the Internet can be a daunting exercise. Another challenge presented by the Internet is the difficulty in obtaining and manipulating data from the Internet. Conventional methods of obtaining Internet data include saving a web page currently displayed within a user's web browser. Conventional methods of obtaining multimedia files, such as images, include positioning a mouse cursor over a portion of a displayed web page and "clicking" the right mouse button to save a desired file. Unfortunately, these conventional methods can be inefficient and labor intensive because only the data from a web page currently being displayed can be obtained.
Many web sites host multiple files containing related content. For example, a web site may host various web pages containing information about each country in the world. However, using conventional data gathering techniques, a user may not be able to retrieve all available information contained within the web site unless he or she displays each and every page within his or her web browser. Furthermore, a user may not become aware of all web pages available at a particular web site unless an exhaustive search of the web site is performed. Unfortunately, the complexity of many web sites, can make an exhaustive search difficult and time consuming to perform.
Often, useful data is displayed in a particular format within a web page, such as HTML
"table" format. Unfortunately, the conversion of data displayed in HTML table format may be difficult to convert to other useful formats, such as database and _. spreadsheet formats . In order to incorporate data displayed within a web page, a user may need to re-type the displayed data within the desired format, or utilize various known cut-and-paste techniques.
Unfortunately, re-typing and cut-and-paste techniques can be inefficient and time consuming.
Summary of the Invention In view of the above discussion, it is an object of the present invention to facilitate the collection of data from Internet web sites without requiring users to manually download files or portions thereof . It is another object of the present invention to allow Internet users to quickly and easily extract data from web pages and arrange the extracted data into various, user-defined formats.
It is another object of the present invention to allow users to select a web page and then automatically download all web pages, or portions thereof, similar to the selected web page without having to view each web page within a browser.
These and other objects of the present invention are provided by methods, systems and computer program products for mining data from a set of files within a host computer, such as a web server, wherein each file in the set is identified by a unique URL, and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set. A user initially selects a file from a host computer via a client computer in communication with the host computer. For example, a user selects a web page displayed within a web browser and/or a portion thereof.
Next, the root portion and variable portion of a URL for the selected file is determined by comparing the URL with a URL of another file at the same host computer. This comparison can be made by users or automatically by various comparison agents. Files within the host computer having a respective URL root portion matching the URL root portion of the selected file are then identified. Each file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then automatically downloaded to the user's computer.
URLs containing multiple root portions (i.e., non-variable portions) and/or multiple variable portions also can be compared to identify various sets of related files, according to the present invention.
The present invention can utilize complex URL agents to compare complex URLs to identify sets of related files within a host computer.
According to another aspect of the present invention, data contained within downloaded files can be extracted and arranged within various user-defined formats that are different from the displayed format of the data. For example, a table of data displayed within a web page in HTML format can be easily and quickly converted to a user defined database, spreadsheet, or word processing document format.
The present invention is advantageous because users can automatically collect or "mine" information from the Internet without having to access each file containing the desired information. The present invention is adaptable to the mining of all types of data including, but not limited to, text, images, sound, and video. Furthermore, the value to a user of downloaded information can be enhanced by the present invention because a user can easily re-format the data into various user-defined formats.
Brief Description of the Drawings Fig. 1 illustrates a client-server computing environment in which the present invention may be implemented. Fig. 2 illustrates an exemplary HTML document with markup language tags displayed.
Fig. 3 schematically illustrates a system for mining data from a set of files within a host computer, according to an embodiment of the present invention. Fig. 4 illustrates operations for mining data from a set of files within a host computer, according to the present invention.
Fig. 5A illustrates a web page displayed within a web browser, wherein the web page contains an image file within a portion thereof.
Fig. 5B illustrates the image file of Fig. 5A separately displayed within a web browser.
Fig. 6 illustrates a user interface for downloading files from a host computer according to an embodiment of the present invention.
Fig. 7 illustrates a text file containing the variable portions of a set of URLs having a common root portion .
Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within a set to a directory entitled "maps" on a user's client computer .
Fig. 9 illustrates a set of files stored within a user's computer that were downloaded via the present invention.
Fig. 10 illustrates an exemplary table of data in HTML format as displayed within a web page.
Fig. 11A illustrates data from the displayed table of Fig. 10 extracted and arranged in a user- defined database format .
Fig. 11B illustrates the user-defined format of the database of Fig. 11A.
Detailed Description of the Invention
The present invention now is described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout .
As will be appreciated by one of skill in the art, the present invention may be embodied as a method, data processing system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware^ embodiment , an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code means embodied in the medium. Any suitable computer readable medium may be utilized including, but not limited to, hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
Client-Server Communications
The present invention is preferably practiced within a client-server computer network environment. As is known by those skilled in this art, client-server is a model for "a relationship between two computer programs in which one program, the client, makes a service request from another program, the server, which fulfills the request. Although the client-server model can be used by programs within a single computer, it is more commonly used in a network where computing functions and data can be distributed more efficiently among many client and server programs at different network locations.
Many business applications being written today use the client-server model, as does the Internet's main program, "Transmission Control Protocol/Internet Protocol" (TCP/IP) . Typically, multiple client programs share the services of a common server program. Both client programs and server programs are often part of a larger program or application. Relative to the Internet, a web browser is a client program that requests services (the sending of web pages or files) jfrom a web server in..another computer somewhere on the Internet .
As is known to those with skill in this art, client-server environments within which the present invention may operate include public networks, such as the Internet, and private networks often referred to as "Intranets" and "Extranets." The term "Internet" shall incorporate the terms "Intranet" and "Extranet" and any references to accessing the Internet shall be understood to mean accessing an Intranet and/or and Extranet, as well. The term "computer network" shall incorporate publicly accessible computer networks, such as the Internet, as well as private computer networks. Fig. 1 illustrates a client-server computing environment in which the present invention may be implemented. In the illustrated clien -server computing environment, a remote user's computer 10 has a client application (i.e., a web browser) resident thereon and a host computer 20 has a server application (i.e., a web server) resident thereon. The user's computer 10 preferably includes a central processing unit 11, a display 12, a pointing device 13, a keyboard 14, access to persistent data storage, and a communications link 16 for communicating with the host computer 20. The keyboard 14, having a plurality of keys thereon, is in communication with the central processing unit 11. A pointing device 13, such as a mouse, is also connected to the central processing unit 11. The communications link 16 may be established via a modem 15 connected to traditional phone lines, an ISDN link, a Tl link, a T3 link, via cable television, via an ethernet network, and the like. Modem 15 may also be a wireless modem configured to communicate with the modem 25 of the host -. computer 20 via wireless communications systems. The communications link 16 also may be made by a direct connection of the user's computer 10 to the host computer 20 or indirectly via a computer network 17, such as the Internet, in communication with the host computer 20.
The central processing unit 11 contains one or more microprocessors (not shown) or other computational devices and random access memory (not shown) or its functional equivalent, including but not limited to, RAM, FLASHRAM, and VRAM for storing programs therein for processing by the microprocessor (s) or other computational devices. A portion of the random access memory and/or persistent data storage, referred to as "cache," may be utilized during communications between a user's computer 10 and a host computer 20 to store various data transferred from the host computer. Preferably, a user's computer 10 has an
Intel® Pentium® processor (or equivalent) with at least thirty-two megabytes (32 MB) of RAM, and at least five megabytes (5 MB) of persistent computer storage for caching. However, it is to be understood that various processors may be utilized to practice the present invention without being limited to those enumerated herein. Although a color display is preferable, a black and white display or standard broadcast or cable television monitor may be used. Exemplary user computers having a web browser resident thereon may include, but are not limited to, an Apple®, Sun Microsystems®, IBM®, or IBM®-compatible personal computer. Abuser's computer 10, if an IBM®, or - IBM®-compatible personal computer, preferably utilizes either a Windows®3.1, Windows 95®, Windows 98®, Windows NT®, UNIX®, or OS/2® operating system. However, other operating systems may also be utilized without limitation. In addition, it is to be understood that a terminal not having computational capability, or having limited computational capability may be utilized in accordance with the present invention for accessing a host computer 20 in a client capacity.
Preferably, a web browser resident on a user's computer 10 is a Java®-enabled browser, such as Netscape Navigator®, Version 3.0 and higher. As is understood by those skilled in this art, a Java®- enabled browser includes a Java® virtual machine (JVM) that interprets Java® bytecode into code that will run on the user's computer.
A host computer 20, functioning as a web server, may have a configuration similar to that of a user's computer 10 and may include a central processing unit 21, a display 22, a pointing device 23, a keyboard 24, access to persistent data storage, and a communications link 26 for connecting to the user's computer 10 via a modem 25, or otherwise. Various types of host computers including, but not limited to, mainframe computing systems, mini-computers, and PCs, may be accessed and data downloaded therefrom via the present invention.
HTML Documents
An HTML document can be comprised of text , images and a variety of objects, each of which are surrounded by various markup language tag.s that control -. format attributes and identify different portions of the document (i.e., <tag_name>text</tag_name>) . HTML documents are typically written and stored in ASCII text format using a text editor. An exemplary HTML document 30 is illustrated in Fig. 2. The illustrated HTML document 30 includes a header section 32 demarcated by <HEAD> tags 32a, 32b. A body section 34, demarcated by <BODY> tags 34a, 34b, includes various "content" portions 36a, 36b, 36c, 36d. It is these content portions 36a, 36b, 36c, 36d that contain information displayed to a user viewing the HTML document 30 with a web browser.
Hardware and Software Environment
The present invention is preferably implemented as a plurality of modules and agents within a client computer. These modules and agents, are preferably written in the Java® programming language so as to be compatible with most client computing platforms. However, portions of the computer program code for carrying out operations of the present invention could also be written in other object oriented programming languages such as Smalltalk or C++, as well as conventional procedural programming languages, such as the "C" programming language.
Java® is an object-oriented programming language developed by Sun Microsystems, Mountain View, California. Java® is a portable and architecturally neutral language. Java® source code is compiled into a machine-independent format that can be run on any machine with a Java® runtime system known as the Java® Virtual Machine (JVM) . The JVM is defined as an imaginary machine that is implemented by emulating a processor through the use of software on a real machine. Accordingly, machines running under diverse operating systems, including UNIX® , Windows NT®, and Macintosh® having a JVM can execute the same Java® program. As is known to those skilled in this art, Java® source code is compiled into bytecode using a Java® compiler referred to as a Javac. Compiled Java® programs are saved in files with the extension " .class" .
Referring now to Fig. 3, an exemplary system 40 for practicing the present invention, is schematically illustrated. In the illustrated configuration, a client computer 42 is in communication with a host computer 50, such as a web server, via a computer network 60, such as the Internet. A Pattern Entry Module 43, running within the client computer 42, serves the function of collecting information from a user in order to initiate data collection from the host computer 50. A URL List Generator Module 44 creates a list of files to be downloaded from the host computer 50 to the client computer 42. A URL Batch Downloader Module 45 downloads the files identified by the URL List Generator Module 44 and stores the files in their original format within the client computer 42. A Data Parser Module 46 performs various data parsing techniques on the downloaded files according to information provided by a user. A Data Export Module 47 stores parsed data in various formats as defined by a user. In addition, a set of URL comparison agents 48 are provided for automatically comparing simple and complex URLs for files within a host computer.
The illustrated host computer 50 contains a plurality of files 52. According to the present invention, a set 54 of files can be identified from the plurality of files 52 wherein each file in the identified set 54 has at least one common non-variable (root) portion and a variable portion.
The present invention is described below with reference to flowchart illustrations of methods, apparatus (systems) and computer program products according to an embodiment of the invention. It will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks .
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
Referring now to Fig. 4, operations for mining data from a set of files within a host computer, according to the present invention, are illustrated. Each file in a set is identified by a unique universal resource locator (URL) . Each unique URL contains at least one root or non-variable portion common to all files in the set and at least one variable portion unique to a respective file in the set. Initially, a file is selected from a host computer via a client computer in communication with the host computer (Block 100) . For the selected file, the root portions and variable portions of the URL of the file is determined (Block 200) . The identification of URL root and variable portions can be performed directly by a user via the Pattern Entry Module (43, Fig. 3) or automatically via one or more URL comparison agents (48, Fig. 3) .
Files within. the host computer (or accessible thereby) having a respective URL root portion matching the URL root portion of the selected file are then identified (Block 300) . Files having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then downloaded to the client computer (Block 400) . Operations, according to the present invention also include extracting data contained within each downloaded file (Block 500) and arranging the extracted data in a format different from the format of a downloaded file (Block 600) .
Referring now to Figs. 5A-5B, a user has accessed a web page (http://www.cia.gov/cia/publications/ factbook/country-frame.html) 70 from a web server (i.e., a host computer) via a web browser 72 (Block 100, Fig. 4) . The illustrated web page 70 includes a left frame 70a that displays a web page (http://www.cia.gov/cia/ publications/factbook/country. tml) 73 and a right frame 70b that displays a web page (http://www.cia.gov/cia/ publications/factbook/ag.html) 74. The web page 73 contained within the left frame 70a provides a list of hypertext links to web pages pertaining to various countries . The web page 74 contained within the right frame 70b is a web page for a respective country (Algeria) whose hypertext link has been activated in the web page 73 displayed within the left frame 70a.
The illustrated web page 74, displayed within the right frame 70b, includes an image (ag-150.gif) 75 of the selected country, Algeria. As illustrated in Fig. 5B, the image 75 has the following URL: http: //www, cia. qov/cia/publications/fact-book/figures/ag- 150.crif . As is known to those skilled in the art, a user who desires a copy of the image 75 for the country Algeria can position a mouse cursor over the image 75 and click on the right mouse button to save the image
75 to his or her client computer. However, to copy each map for each country listed in the web page 73, a user would have to open each country file and repeat the above procedure .
According to one embodiment of the present invention, however, a user can select a particular file and then automatically download all files similar to the selected file. In other words, the present invention searches a host computer for all files that are within a set within which a selected file is a member. Accordingly, a user can select the displayed image 75 of Algeria and automatically download a respective image for each country listed in the web page 73 without having to access or display a web page for each country.
Referring now to Fig. 6, the Pattern Entry Module (43, Fig. 3) has been initiated and a user interface 78 associated therewith is displayed. According to one embodiment of the present invention, a user defines a set of files to download by indicating, within the user interface 78 of Fig. 6, the root and variable portions of a URL (Block 200, Fig. 4) . For example, a user could view the URL for various images for the respective countries listed in the web page 73 (Fig. 5A) to determine the root and variable portions of each URL. For the illustrated example, the URL root portion of each country map image is http: //www.cia.αov/ cia/publications/fac -book/figures/. The variable portion of each URL is an abbreviation of each country.
According to another embodiment of the present invention, a user can select a web page or a portion of a displayed web page and the URL for the selected web page, or portion thereof, can be automatically analyzed via various agents (48, Fig. 3) to determine one or more root portions that are shared _. by other files within the host computer (Block 200, Fig. 4) .
For example, a user identifies a first URL of interest, such as http : //www. reports . com/Feb .html , and wants to locate all files in the host computer having the same root portion. An agent can be invoked to locate another URL within the host computer and to begin comparing each URL. The agent locates a URL http: //www, reports . com/Jan.html . The two URLs are then parsed and compared by the agent to determine the non- variable (i.e., root) portions and the variable portions . The agent determines that the root portions of each URL are "http://www.reports.com/" and ".html" and that the variable portion in each URL is between the two root portions (i.e., "Jan" and "Feb").
Agents according to the present invention can also parse complex URLs containing multiple root and variable portions. For example, a user identifies a first URL of interest, such as http : //ww . FebReport . com/ Feb .html , and wants to locate all files in the host computer having the same root portion. An agent can be invoked to locate another URL within the host computer and to begin comparing the two URLs . The agent locates a URL http : //ww . JanReport . com/Jan. html . The two URLs are then parsed and compared by the agent to determine the non-variable (i.e., root) portions and the variable portions. The agent determines that the root portions of each URL are "http : //www. " , "Report.com", and ".html". The agent also determines that the variable portions in each URL are between the first and second root portions and between the second and third root portions (i.e., "Jan" and "Feb") .
Once the root portions of a URL for a specific file is identified, either automatically or by a user, a set of all files within the host computer having a matching root portion (or portions) are identified (Block 300, Fig. 4) . The variable portions of each file in the set are identified and stored within a file 79 as illustrated in Fig. 7.
In the illustrated user interface 78 of Fig. 6, the root portion of the URL for the selected file has been entered within the field 78a, entitled "First Part . " The file containing the respective URL variable portions for each file, entitled "delete.txt", has been entered into the field 78b, entitled "Middle." The suffix for the URLs within the identified set has been entered within the field 78c, entitled "Last Part." A user may indicate, via field 78d, the name of a local directory within which to store the files as they are downloaded from the host computer. Upon activating the button 78e, entitled "Get
Files, " all files within the set that satisfy the criteria set forth in fields 78a, 78b and 78c are downloaded to the user's client computer (Block 400, Fig. 4) . Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within the set to a directory entitled "maps" on the user's client computer. Fig. 9 illustrates the downloaded files stored within a specified directory on the user's client computer. According to another aspect of the present invention, data can be extracted from downloaded HTML files (Block 500, Fig. 4) using various know parsing techniques and arranged in different formats (Block 600, Fig. 4) to suit the needs of a user. Fig. 10 illustrates an exemplary table of data 80 in HTML format as displayed within a web page. According to the present invention, a web page containing the displayed table 80 can be downloaded as described above, and the data from the table 80 extracted therefrom and arranged in a different format. As illustrated in Figs. 11A-11B, data for the first eight entries in the web page table 80 of Fig. 10 has been extracted and arranged in a database 82 (Fig. 11A) having a format 84 as indicated in Fig. 11B. As illustrated, the table columns of Fig. 10 have been converted to fields within the database 82. The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. In the claims, means-plus-function clause are intended to cover the structures described herein as performing the recited function and not only structural equivalents but also equivalent structures. Therefore, it is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the following claims, with equivalents of the claims to be included therein.

Claims

THAT WHICH IS CLAIMED IS:
1. A method of mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, the method comprising the steps of : displaying a web page from the set via a client computer in communication with the host computer via the computer network; determining a root portion and variable portion of a URL for the displayed web page; identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page; and automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL.
2. A method according to Claim 1 wherein the step of determining a root portion and variable portion of a URL for the displayed web page comprises the steps of : presenting a user interface via the client computer that allows a user to identify a root portion and variable portion_of the URL for the displayed web page ; and accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page.
3. A method according to Claim 1 wherein the step of determining a root portion and variable portion of a URL for the displayed web page is performed automatically when the web page is displayed via the client computer.
4. A method according to Claim 1 wherein the step of determining the root portion and variable portion of a URL for the displayed web page comprises: identifying a URL for a second web page contained within the host computer; parsing the URL for the displayed web page and the URL for the second web page ; and comparing parsed portions of the URL for the displayed web page and the URL for the second web page.
5. A method according to Claim 1 further comprising the steps of: extracting data from each downloaded web page, wherein each downloaded web page has a first format ; and arranging the extracted data in a user- defined format that is different from the first format.
6. A method according to Claim 5 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
7. A method according to Claim 1 wherein the step of displaying a web page comprises displaying a web page via a web browser on the client computer.
8. A method according to Claim 7 further comprising the step of determining a root portion and variable portion of a URL for a file displayed within the displayed web page.
9. A method according to Claim 8 wherein the file displayed within the displayed web page is an image, audio, or video file.
10. A method according to Claim 8 further comprising: determining the root portion and variable portion of a URL for the file displayed within the displayed web page; identifying files within the host computer having a respective URL root portion matching the URL root portion of the file displayed within the displayed web page ; and downloading to the client computer each file having a respective URL root portion matching the root portion of the URL of the file displayed within the displayed web page and a URL variable portion different from the variable portion of the URL of the file displayed within the_ displayed web page.
11. A method according to Claim 8 wherein the step of determining a root portion and variable portion of a URL for the file displayed within the displayed web page comprises the steps of : presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the file displayed within the displayed web page; and accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the file displayed within the displayed web page .
12. A method according to Claim 8 wherein the step of determining a root portion and variable portion of a URL for the file displayed within the displayed web page is performed automatically when the file is displayed within the displayed web page .
13. A method according to Claim 10 further comprising extracting data contained within each downloaded file, wherein each downloaded file has a first format; and arranging the extracted data in a user- defined format that is different from the first format.
14. A method according to Claim 13 wherein the first format is selected from the group consisting of image formats, video formats and audio formats and wherein the second format is a format selected from the _. group consisting of database formats, spreadsheet formats, and word processing document formats.
15. A method of mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, the method comprising the steps of: displaying a web page from the set via a web browser on a client computer in communication with the host computer via the computer network; determining a root portion and variable portion of a URL for the displayed web page comprising the steps of : presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page; identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page; automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL; and extracting data from each downloaded web page, wherein each downloaded web page has a first format .
16. A method according to Claim 15 further comprising the step of arranging the extracted data in a user-defined format that is different from the first format .
17. A method according to Claim 15 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
18. A method of mining data from a set of files within a host computer connected to a computer network, wherein each file in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set, the method comprising the steps of : selecting a file from the host computer via a client computer in communication with the host computer via the computer network; determining a root portion and variable portion of a URL for the selected file; identifying files within the host computer having a respective URL root portion matching the URL root portion of the selected file; and automatically downloading to the client computer each identified file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL.
19. A method according to Claim 18 wherein the step of determining a root portion and variable portion of a URL for the displayed web page comprises the steps of : presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page ; and accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page.
20. A method according to Claim 18 wherein the step of determining a root portion and variable portion of a URL for the displayed web page is performed automatically when the web page is displayed via the client computer.
21. A method according to Claim 18 further comprising: extracting data from each downloaded file, wherein each downloaded file has a first format; and arranging the extracted data in a user- defined format that is different from the first format...
22. A method according to Claim 21 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
23. A method according to Claim 18 wherein the step of selecting a file from the host computer comprises displaying a web page via a web browser on the client computer.
24. A method according to Claim 18 wherein the step of determining the root portion and variable portion of a URL for the selected file comprises: identifying a URL for a second file contained within the host computer; parsing the URL for the selected file and the URL for the second file; and comparing parsed portions of the URL for the selected file and the URL for the second file.
25. A system for mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, comprising: means for displaying a web page. from the set - via a client computer in communication with the host computer via the computer network; means for determining a root portion and variable portion of a URL for the displayed web page; means for identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page ; and means for automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL.
26. A system according to Claim 25 wherein the means for determining a root portion and variable portion of a URL for the displayed web page comprises : means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page.
27. A system according to Claim 25 wherein the means for determining a root portion and variable portion of a URL for the displayed web page comprises means for automatically determining a root portion and variable portion of the web page URL when the web page is displayed via the_ client computer.
28. A system according to Claim 25 wherein the means for determining the root portion and variable portion of a URL for the displayed web page comprises: means for identifying a URL for a second web page contained within the host computer; means for parsing the URL for the displayed web page and the URL for the second web page; and means for comparing parsed portions of the URL for the displayed web page and- the URL for the second web page.
29. A system according to Claim 25 further comprising: means for extracting data from each downloaded web page, wherein each downloaded web page has a first format; and means for arranging the extracted data in a user-defined format that is different from the first forma .
30. A system according to Claim 29 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
31. A system according to Claim 25 wherein the means for displaying a web page comprises means for displaying a web page via a web browser on the client computer .
32. A system according to Claim 31 further comprising means for determining a root portion and variable portion of a URL for a file displayed within the displayed web page .
33. A system according to Claim 32 wherein the file displayed within the displayed web page is an image, audio, or video file.
34. A system according to Claim 32 further comprising: means for determining the root portion and variable portion of a URL for the file displayed within the displayed web page; means for identifying files within the host computer having a respective URL root portion matching the URL root portion of the file displayed within the displayed web page; and means for downloading to the client computer each file having a respective URL root portion matching the root portion of the URL of the file displayed within the displayed web page and a URL variable portion different from the variable portion of the URL of the file displayed within the displayed web page.
35. A system according to Claim 32 wherein the means for determining a root portion and variable portion of a URL for the file displayed within the displayed web page comprises : means for presenting a user interface via the client computer that_ allows a user to identify a root portion and variable portion of the URL for the file displayed within the displayed web page; and means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the file displayed within the displayed web page .
36. A system according to Claim 32 wherein the means for determining a root portion and variable portion of a URL for the file displayed within the displayed web page comprises means for automatically determining a root portion and variable portion of the file when displayed within the displayed web page.
37. A system according to Claim 34 further comprising: means for extracting data contained within each downloaded file, wherein each downloaded file has a first format; and means for arranging the extracted data in a user-defined format that is different from the first format .
38. A system according to Claim 37 wherein the first format is selected from the group consisting of image formats, video formats and audio formats and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats.
39. A system for mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, comprising: means for displaying a web page from the set via a web browser on a client computer in communication with the host computer via the computer network; means for determining a root portion and variable portion of a URL for the displayed web page comprising: means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page; means for identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page ; means for automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL; and means for extracting data from each downloaded web page,_ wherein each downloaded web page has a first format.
40. A system according to Claim 39 further comprising means for arranging the extracted data in a user-defined format that is different from the first format .
41. A system according to Claim 39 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
42. A system for mining data from a set of files within a host computer connected to a computer network, wherein each file in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set, comprising: means for selecting a file from the host computer via a client computer in communication with the host computer via the computer network; means for determining a root portion and variable portion of a URL for the selected file; means for identifying files within the host computer having a respective URL root portion matching the URL root portion of the selected file; and means for automatically downloading to the client computer each identified file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL.
43. A system according to Claim 42 wherein the means for determining a root portion and variable portion of a URL for the displayed web page comprises: means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page.
44. A system according to Claim 42 wherein the means for determining a root portion and variable portion of a URL for the displayed web page comprises means for automatically determining a root portion and variable portion of a web page URL when the web page is displayed via the client computer.
45. A system according to Claim 42 further comprising: means for extracting data from each downloaded file, wherein each downloaded file has a first format; and means for arranging the extracted data in a user-defined format that is different from the first format .
46. A system according to Claim 45 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
47. A system according to Claim 42 wherein the means for selecting a file from the host computer comprises means for displaying a web page via a web browser on the client computer.
48. A system according to Claim 42 wherein the means for determining the root portion and variable portion of a URL for the selected file comprises: means for identifying a URL for a second file contained within the host computer; means for parsing the URL for the selected file and the URL for the second file; and means for comparing parsed portions of the URL for the selected file and the URL for the second file.
49. A computer program product for mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, the computer program product comprising a computer usable storage medium having computer readable program code means embodied in the medium, the computer readable program code means comprising: computer readable program code means for displaying a web page from the set via a client computer in communication with the host computer via the computer network; computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page; computer readable program code means for identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page; and computer readable program code means for automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL.
50. A computer program product according to Claim 49 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page comprises : computer readable program code means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and computer readable program code means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page .
51. A computer program product according to Claim 49 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page comprises computer readable program code means for automatically determining a root portion and variable portion of the web page URL when the web page is displayed via the client computer.
52. A computer program product according to Claim 49 wherein the computer readable program code means for determining the root portion and variable portion of a URL for the displayed web page comprises : computer readable program code means for identifying a URL for a second web page contained within the host computer; computer readable program code means for parsing the URL for the displayed web page and the URL for the second web page; and computer readable program code means for comparing parsed portions of the URL for the displayed web page and the URL for the second web page .
53. A computer program product according to Claim 49 further comprising: computer readable program code means for extracting data from each downloaded web page, wherein each downloaded web page has a first format; and computer readable program code means for arranging the extracted data in a user-defined format that is different from the first format.
54. A computer program product according to Claim 53 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
55. A computer program product according to Claim 49 wherein the computer readable program code means for displaying a web page comprises computer readable program code means for displaying a web page via a web browser on the client computer.
56. A computer program product according to Claim 55 further comprising computer readable program code means for determining a root portion and variable portion of a URL for a file displayed within the displayed web page.
57. A computer program product according to Claim 56 wherein the file displayed within the displayed web page is an image, audio, or video file.
58. A computer program product according to Claim 56 further comprising: computer readable program code means for determining the root portion and variable portion of a URL for the file displayed within the displayed web page ; computer readable program code means for identifying files within the host computer having a respective URL root portion matching the URL root portion of the file displayed within the displayed web page ; and computer readable program code means for downloading to the client computer each file having a respective URL root portion matching the root portion of the URL of the file displayed within the displayed web page and a URL variable portion different from the variable portion of the URL of the file displayed within the displayed web page.
59. A computer program product according to Claim 56 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the file displayed within the displayed web page comprises : computer readable program code means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the file displayed within the displayed web page; and computer readable program code means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the file displayed within the displayed web page.
60. A computer program product according to Claim 56 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the file displayed within the displayed web page comprises computer readable program code means for automatically determining a root portion and variable portion of the file when displayed within the displayed web page.
61. A computer program product according to Claim 58 further comprising: computer readable program code means for extracting data contained within each downloaded file, wherein each downloaded file has a first format; and computer readable program code means for arranging the extracted data in a user-defined format that is different from the first format .
62. A computer program product according to Claim 61 wherein the first format is selected from the group consisting of image formats, video formats and audio formats and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats .
63. A computer program product for mining data from a set of web pages within a host computer connected to a computer network, wherein each web page in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all web pages in the set and a variable portion unique to a respective web page in the set, the computer program product comprising a computer usable storage medium having computer readable program code means embodied in the medium, the computer readable program code means comprising: computer readable program code means for displaying a web page from the set via a web browser on a client computer in communication with the host computer via the computer network; computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page comprising: computer readable program code means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and computer readable program code means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page ; computer readable program code means for identifying web pages within the host computer having a respective URL root portion matching the URL root portion of the displayed web page; computer readable program code means for automatically downloading to the client computer each identified web page having a respective URL root portion matching the root portion of the displayed web page URL and a URL variable portion different from the variable portion of the displayed web page URL; and computer readable program code means for extracting data from each downloaded web page, wherein each downloaded web page has a first format.
64. A computer program product according to - Claim 63 further comprising computer readable program code means for arranging the extracted data in a user- defined format that is different from the first format.
65. A computer program product according to Claim 63 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats.
66. A computer program product for mining data from a set of files within a host computer connected to a computer network, wherein each file in the set is identified by a unique uniform resource locator (URL) , and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set, the computer program product comprising a computer usable storage medium having computer readable program code means embodied in the medium, the computer readable program code means comprising: computer readable program code means for selecting a file from the host computer via a client computer in communication with the host computer via the computer network; computer readable program code means for determining a root portion and variable portion of a URL for the selected file; computer readable program code means for identifying files within the host computer having a respective URL root portion matching the .URL root portion of the selected file; and computer readable program code means for automatically downloading to the client computer each identified file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL.
67. A computer program product according to Claim 66 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page comprises: computer readable program code means for presenting a user interface via the client computer that allows a user to identify a root portion and variable portion of the URL for the displayed web page; and computer readable program code means for accepting, via the user interface, a user identification of a root portion and variable portion of the URL for the displayed web page .
68. A computer program product according to Claim 66 wherein the computer readable program code means for determining a root portion and variable portion of a URL for the displayed web page comprises computer readable program code means for automatically determining a root portion and variable portion of the web page URL when the web page is displayed via the client computer.
69. A computer program product, according to -. Claim 66 further comprising: computer readable program code means for extracting data from each downloaded file, wherein each downloaded file has a first format; and computer readable program code means for arranging the extracted data in a user-defined format that is different from the first format.
70. A computer program product according to Claim 69 wherein the first format is hypertext markup language (HTML) format and wherein the second format is a format selected from the group consisting of database formats, spreadsheet formats, and word processing document formats.
71. A computer program product according to Claim 66 wherein the computer readable program code means for selecting a file from the host computer comprises computer readable program code means for displaying a web page via a web browser on the client computer.
72. A computer program product according to Claim 66 wherein the computer readable program code means for determining the root portion and variable portion of a URL for the selected file comprises: computer readable program code means for identifying a URL for a second file contained within the host computer; computer readable program code means for parsing the URL for the selected file and the URL for the second file; and computer readable program code means for comparing parsed portions of the URL for the selected file and the URL for the second file.
PCT/US1999/026632 1998-11-20 1999-11-12 Systems, methods and computer program products for mining data from host computers via the internet Ceased WO2000045297A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU14757/00A AU1475700A (en) 1998-11-20 1999-12-13 Systems, methods and computer program products for mining data from host computers via the internet

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US19679498A 1998-11-20 1998-11-20
US09/196,794 1998-11-20

Publications (1)

Publication Number Publication Date
WO2000045297A1 true WO2000045297A1 (en) 2000-08-03

Family

ID=22726820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/026632 Ceased WO2000045297A1 (en) 1998-11-20 1999-11-12 Systems, methods and computer program products for mining data from host computers via the internet

Country Status (2)

Country Link
AU (1) AU1475700A (en)
WO (1) WO2000045297A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2363216A (en) * 2000-03-08 2001-12-12 Ibm Publish/subscribe data processing with subscriptions based on changing business concepts
US20110125918A1 (en) * 2009-11-13 2011-05-26 Samsung Electronics Co., Ltd. Adaptive streaming method and apparatus
US8140468B2 (en) 2006-06-22 2012-03-20 International Business Machines Corporation Systems and methods to extract data automatically from a composite electronic document
US8453050B2 (en) 2006-06-28 2013-05-28 International Business Machines Corporation Method and apparatus for creating and editing electronic documents
US8650221B2 (en) 2007-09-10 2014-02-11 International Business Machines Corporation Systems and methods to associate invoice data with a corresponding original invoice copy in a stack of invoices
US10713429B2 (en) 2017-02-10 2020-07-14 Microsoft Technology Licensing, Llc Joining web data with spreadsheet data using examples

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0777187A2 (en) * 1995-11-30 1997-06-04 Matsushita Electric Industrial Co., Ltd. A history display apparatus

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0777187A2 (en) * 1995-11-30 1997-06-04 Matsushita Electric Industrial Co., Ltd. A history display apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AROCENA G O ET AL: "Applications of a Web query language", COMPUTER NETWORKS AND ISDN SYSTEMS,NL,NORTH HOLLAND PUBLISHING. AMSTERDAM, vol. 29, no. 8-13, 1 September 1997 (1997-09-01), pages 1305 - 1316, XP004095326, ISSN: 0169-7552 *
MILLER R C ET AL: "SPHINX: a framework for creating personal, site-specific Web crawlers", COMPUTER NETWORKS AND ISDN SYSTEMS,NL,NORTH HOLLAND PUBLISHING. AMSTERDAM, vol. 30, no. 1-7, 1 April 1998 (1998-04-01), pages 119 - 130, XP004121434, ISSN: 0169-7552 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2363216A (en) * 2000-03-08 2001-12-12 Ibm Publish/subscribe data processing with subscriptions based on changing business concepts
US8140468B2 (en) 2006-06-22 2012-03-20 International Business Machines Corporation Systems and methods to extract data automatically from a composite electronic document
US8453050B2 (en) 2006-06-28 2013-05-28 International Business Machines Corporation Method and apparatus for creating and editing electronic documents
US8650221B2 (en) 2007-09-10 2014-02-11 International Business Machines Corporation Systems and methods to associate invoice data with a corresponding original invoice copy in a stack of invoices
US20110125918A1 (en) * 2009-11-13 2011-05-26 Samsung Electronics Co., Ltd. Adaptive streaming method and apparatus
US9967598B2 (en) * 2009-11-13 2018-05-08 Samsung Electronics Co., Ltd. Adaptive streaming method and apparatus
US10713429B2 (en) 2017-02-10 2020-07-14 Microsoft Technology Licensing, Llc Joining web data with spreadsheet data using examples

Also Published As

Publication number Publication date
AU1475700A (en) 2000-08-18

Similar Documents

Publication Publication Date Title
Berners-Lee et al. The world-wide web
US7885950B2 (en) Creating search enabled web pages
US5892908A (en) Method of extracting network information
US8060518B2 (en) System and methodology for extraction and aggregation of data from dynamic content
US5761673A (en) Method and apparatus for generating dynamic web pages by invoking a predefined procedural package stored in a database
US6564259B1 (en) Systems, methods and computer program products for assigning, generating and delivering content to intranet users
US6907423B2 (en) Search engine interface and method of controlling client searches
US6061686A (en) Updating a copy of a remote document stored in a local computer system
CA2453225C (en) Apparatus for and method of selectively retrieving information and enabling its subsequent display
US8423587B2 (en) System and method for real-time content aggregation and syndication
US6209029B1 (en) Method and apparatus for accessing data sources in a three tier environment
US6789076B1 (en) System, method and program for augmenting information retrieval in a client/server network using client-side searching
US7607085B1 (en) Client side localizations on the world wide web
JP2002540506A (en) Glamor template query system
US20020078014A1 (en) Network crawling with lateral link handling
US7165220B1 (en) Apparatus and method for processing bookmark events for a web page
US6883020B1 (en) Apparatus and method for filtering downloaded network sites
USRE45021E1 (en) Method and software for processing server pages
EP1069515A1 (en) Method and apparatus for web information extraction service
US7877434B2 (en) Method, system and apparatus for presenting forms and publishing form data
JPH0844643A (en) Gateway device
US7975238B2 (en) Identifying previously bookmarked hyperlinks in a received Web page in a World Wide Web network browser system for searching
CA2509154A1 (en) Intermediary server for facilitating retrieval of mid-point, state-associated web pages
WO2000045297A1 (en) Systems, methods and computer program products for mining data from host computers via the internet
WO2001022194A9 (en) A method and system for facilitating research of electronically stored information on a network

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref country code: AU

Ref document number: 2000 14757

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AT AU AZ BA BB BG BR BY CA CH CN CU CZ CZ DE DE DK DK EE EE ES FI FI GB GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642