SYSTEMS, METHODS AND COMPUTER PROGRAM PRODUCTS FOR MINING DATA FROM HOST COMPUTERS VIA THE INTERNET
Field of the Invention
The present invention relates generally to computer networks and, more particularly, to the Internet .
Background of the Invention
The Internet is a worldwide decentralized network of computers having the ability to communicate with each other. The Internet has gained broad recognition as a viable medium for communicating and interacting across multiple networks . The World Wide Web (Web) was created in the early 1990 ' s, and is comprised of server-hosting computers (web servers) connected to the Internet having hypertext documents or web pages stored therewithin. Web pages are accessible by_ client programs (i.e., web browsers) utilizing the Hypertext Transfer Protocol (HTTP) via a Transmission Control Protocol/Internet Protocol (TCP/IP) connection between a client-hosting device and a server-hosting device. Exemplary web browsers include Netscape Navigator® (Netscape Communications Corporation, Mountain View, CA) and Internet Explorer® (Microsoft
Corporation, Redmond., WA) . Web browsers typically provide a graphical user interface for retrieving and viewing Web pages hosted by web servers .
A Web page, using a standard page description language known as HyperText Markup Language (HTML) , typically displays text and graphics, and can play sound, animation, and video clips. HTML provides basic document formatting and allows a Web page developer to specify hypertext links (typically manifested as highlighted text) to other web servers and files. When a user selects a particular hypertext link, a web browser reads and interprets the address, called a URL (Uniform Resource Locator) associated with the link, connects the client-hosting device with the web server at that address (also referred to as a "web site"), and makes an HTTP request for the web page identified in the link. The web server then sends the requested web page to the client-hosting device in HTML format which the web browser interprets and displays to the user. A URL gives the type of resource being accessed ( e . g. , HTTP, GOPHER, WAIS, etc.) and optionally the path of the file sought. For example: resource : //host .domain/path/filename, wherein the resource can include "file", "HTTP", "GOPHER", "WAIS", "NEWS", "TELNET", and so forth. Through the Web, users can access various Internet services including, but not limited to, HTTP, GOPHER, TELNET, and FTP.
A challenge presented by the Internet is how to enable greater simplicity and efficiency in accessing and comprehending information available through the Internet . Because of the immense amount of information available via the Internet and because of
the generally unstructured nature of the Internet, searching for information on the Internet can be a daunting exercise. Another challenge presented by the Internet is the difficulty in obtaining and manipulating data from the Internet. Conventional methods of obtaining Internet data include saving a web page currently displayed within a user's web browser. Conventional methods of obtaining multimedia files, such as images, include positioning a mouse cursor over a portion of a displayed web page and "clicking" the right mouse button to save a desired file. Unfortunately, these conventional methods can be inefficient and labor intensive because only the data from a web page currently being displayed can be obtained.
Many web sites host multiple files containing related content. For example, a web site may host various web pages containing information about each country in the world. However, using conventional data gathering techniques, a user may not be able to retrieve all available information contained within the web site unless he or she displays each and every page within his or her web browser. Furthermore, a user may not become aware of all web pages available at a particular web site unless an exhaustive search of the web site is performed. Unfortunately, the complexity of many web sites, can make an exhaustive search difficult and time consuming to perform.
Often, useful data is displayed in a particular format within a web page, such as HTML
"table" format. Unfortunately, the conversion of data displayed in HTML table format may be difficult to
convert to other useful formats, such as database and _. spreadsheet formats . In order to incorporate data displayed within a web page, a user may need to re-type the displayed data within the desired format, or utilize various known cut-and-paste techniques.
Unfortunately, re-typing and cut-and-paste techniques can be inefficient and time consuming.
Summary of the Invention In view of the above discussion, it is an object of the present invention to facilitate the collection of data from Internet web sites without requiring users to manually download files or portions thereof . It is another object of the present invention to allow Internet users to quickly and easily extract data from web pages and arrange the extracted data into various, user-defined formats.
It is another object of the present invention to allow users to select a web page and then automatically download all web pages, or portions thereof, similar to the selected web page without having to view each web page within a browser.
These and other objects of the present invention are provided by methods, systems and computer program products for mining data from a set of files within a host computer, such as a web server, wherein each file in the set is identified by a unique URL, and wherein each unique URL comprises a root portion common to all files in the set and a variable portion unique to a respective file in the set. A user initially selects a file from a host computer via a client
computer in communication with the host computer. For example, a user selects a web page displayed within a web browser and/or a portion thereof.
Next, the root portion and variable portion of a URL for the selected file is determined by comparing the URL with a URL of another file at the same host computer. This comparison can be made by users or automatically by various comparison agents. Files within the host computer having a respective URL root portion matching the URL root portion of the selected file are then identified. Each file having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then automatically downloaded to the user's computer.
URLs containing multiple root portions (i.e., non-variable portions) and/or multiple variable portions also can be compared to identify various sets of related files, according to the present invention.
The present invention can utilize complex URL agents to compare complex URLs to identify sets of related files within a host computer.
According to another aspect of the present invention, data contained within downloaded files can be extracted and arranged within various user-defined formats that are different from the displayed format of the data. For example, a table of data displayed within a web page in HTML format can be easily and quickly converted to a user defined database, spreadsheet, or word processing document format.
The present invention is advantageous because
users can automatically collect or "mine" information from the Internet without having to access each file containing the desired information. The present invention is adaptable to the mining of all types of data including, but not limited to, text, images, sound, and video. Furthermore, the value to a user of downloaded information can be enhanced by the present invention because a user can easily re-format the data into various user-defined formats.
Brief Description of the Drawings Fig. 1 illustrates a client-server computing environment in which the present invention may be implemented. Fig. 2 illustrates an exemplary HTML document with markup language tags displayed.
Fig. 3 schematically illustrates a system for mining data from a set of files within a host computer, according to an embodiment of the present invention. Fig. 4 illustrates operations for mining data from a set of files within a host computer, according to the present invention.
Fig. 5A illustrates a web page displayed within a web browser, wherein the web page contains an image file within a portion thereof.
Fig. 5B illustrates the image file of Fig. 5A separately displayed within a web browser.
Fig. 6 illustrates a user interface for downloading files from a host computer according to an embodiment of the present invention.
Fig. 7 illustrates a text file containing the variable portions of a set of URLs having a common root
portion .
Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within a set to a directory entitled "maps" on a user's client computer .
Fig. 9 illustrates a set of files stored within a user's computer that were downloaded via the present invention.
Fig. 10 illustrates an exemplary table of data in HTML format as displayed within a web page.
Fig. 11A illustrates data from the displayed table of Fig. 10 extracted and arranged in a user- defined database format .
Fig. 11B illustrates the user-defined format of the database of Fig. 11A.
Detailed Description of the Invention
The present invention now is described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout .
As will be appreciated by one of skill in the art, the present invention may be embodied as a method, data processing system, or computer program product. Accordingly, the present invention may take the form of
an entirely hardware^ embodiment , an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code means embodied in the medium. Any suitable computer readable medium may be utilized including, but not limited to, hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
Client-Server Communications
The present invention is preferably practiced within a client-server computer network environment. As is known by those skilled in this art, client-server is a model for "a relationship between two computer programs in which one program, the client, makes a service request from another program, the server, which fulfills the request. Although the client-server model can be used by programs within a single computer, it is more commonly used in a network where computing functions and data can be distributed more efficiently among many client and server programs at different network locations.
Many business applications being written today use the client-server model, as does the Internet's main program, "Transmission Control Protocol/Internet Protocol" (TCP/IP) . Typically, multiple client programs share the services of a common server program. Both client programs and server programs are often part of a larger program or application. Relative to the Internet, a web browser is a client program that requests services (the sending of
web pages or files) jfrom a web server in..another computer somewhere on the Internet .
As is known to those with skill in this art, client-server environments within which the present invention may operate include public networks, such as the Internet, and private networks often referred to as "Intranets" and "Extranets." The term "Internet" shall incorporate the terms "Intranet" and "Extranet" and any references to accessing the Internet shall be understood to mean accessing an Intranet and/or and Extranet, as well. The term "computer network" shall incorporate publicly accessible computer networks, such as the Internet, as well as private computer networks. Fig. 1 illustrates a client-server computing environment in which the present invention may be implemented. In the illustrated clien -server computing environment, a remote user's computer 10 has a client application (i.e., a web browser) resident thereon and a host computer 20 has a server application (i.e., a web server) resident thereon. The user's computer 10 preferably includes a central processing unit 11, a display 12, a pointing device 13, a keyboard 14, access to persistent data storage, and a communications link 16 for communicating with the host computer 20. The keyboard 14, having a plurality of keys thereon, is in communication with the central processing unit 11. A pointing device 13, such as a mouse, is also connected to the central processing unit 11. The communications link 16 may be established via a modem 15 connected to traditional phone lines, an ISDN link, a Tl link, a T3 link, via cable television, via an ethernet network, and the like. Modem 15 may also be a wireless modem
configured to communicate with the modem 25 of the host -. computer 20 via wireless communications systems. The communications link 16 also may be made by a direct connection of the user's computer 10 to the host computer 20 or indirectly via a computer network 17, such as the Internet, in communication with the host computer 20.
The central processing unit 11 contains one or more microprocessors (not shown) or other computational devices and random access memory (not shown) or its functional equivalent, including but not limited to, RAM, FLASHRAM, and VRAM for storing programs therein for processing by the microprocessor (s) or other computational devices. A portion of the random access memory and/or persistent data storage, referred to as "cache," may be utilized during communications between a user's computer 10 and a host computer 20 to store various data transferred from the host computer. Preferably, a user's computer 10 has an
Intel® Pentium® processor (or equivalent) with at least thirty-two megabytes (32 MB) of RAM, and at least five megabytes (5 MB) of persistent computer storage for caching. However, it is to be understood that various processors may be utilized to practice the present invention without being limited to those enumerated herein. Although a color display is preferable, a black and white display or standard broadcast or cable television monitor may be used. Exemplary user computers having a web browser resident thereon may include, but are not limited to, an Apple®, Sun Microsystems®, IBM®, or IBM®-compatible
personal computer. Abuser's computer 10, if an IBM®, or - IBM®-compatible personal computer, preferably utilizes either a Windows®3.1, Windows 95®, Windows 98®, Windows NT®, UNIX®, or OS/2® operating system. However, other operating systems may also be utilized without limitation. In addition, it is to be understood that a terminal not having computational capability, or having limited computational capability may be utilized in accordance with the present invention for accessing a host computer 20 in a client capacity.
Preferably, a web browser resident on a user's computer 10 is a Java®-enabled browser, such as Netscape Navigator®, Version 3.0 and higher. As is understood by those skilled in this art, a Java®- enabled browser includes a Java® virtual machine (JVM) that interprets Java® bytecode into code that will run on the user's computer.
A host computer 20, functioning as a web server, may have a configuration similar to that of a user's computer 10 and may include a central processing unit 21, a display 22, a pointing device 23, a keyboard 24, access to persistent data storage, and a communications link 26 for connecting to the user's computer 10 via a modem 25, or otherwise. Various types of host computers including, but not limited to, mainframe computing systems, mini-computers, and PCs, may be accessed and data downloaded therefrom via the present invention.
HTML Documents
An HTML document can be comprised of text , images and a variety of objects, each of which are
surrounded by various markup language tag.s that control -. format attributes and identify different portions of the document (i.e., <tag_name>text</tag_name>) . HTML documents are typically written and stored in ASCII text format using a text editor. An exemplary HTML document 30 is illustrated in Fig. 2. The illustrated HTML document 30 includes a header section 32 demarcated by <HEAD> tags 32a, 32b. A body section 34, demarcated by <BODY> tags 34a, 34b, includes various "content" portions 36a, 36b, 36c, 36d. It is these content portions 36a, 36b, 36c, 36d that contain information displayed to a user viewing the HTML document 30 with a web browser.
Hardware and Software Environment
The present invention is preferably implemented as a plurality of modules and agents within a client computer. These modules and agents, are preferably written in the Java® programming language so as to be compatible with most client computing platforms. However, portions of the computer program code for carrying out operations of the present invention could also be written in other object oriented programming languages such as Smalltalk or C++, as well as conventional procedural programming languages, such as the "C" programming language.
Java® is an object-oriented programming language developed by Sun Microsystems, Mountain View, California. Java® is a portable and architecturally neutral language. Java® source code is compiled into a machine-independent format that can be run on any machine with a Java® runtime system known as the Java®
Virtual Machine (JVM) . The JVM is defined as an imaginary machine that is implemented by emulating a processor through the use of software on a real machine. Accordingly, machines running under diverse operating systems, including UNIX® , Windows NT®, and Macintosh® having a JVM can execute the same Java® program. As is known to those skilled in this art, Java® source code is compiled into bytecode using a Java® compiler referred to as a Javac. Compiled Java® programs are saved in files with the extension " .class" .
Referring now to Fig. 3, an exemplary system 40 for practicing the present invention, is schematically illustrated. In the illustrated configuration, a client computer 42 is in communication with a host computer 50, such as a web server, via a computer network 60, such as the Internet. A Pattern Entry Module 43, running within the client computer 42, serves the function of collecting information from a user in order to initiate data collection from the host computer 50. A URL List Generator Module 44 creates a list of files to be downloaded from the host computer 50 to the client computer 42. A URL Batch Downloader Module 45 downloads the files identified by the URL List Generator Module 44 and stores the files in their original format within the client computer 42. A Data Parser Module 46 performs various data parsing techniques on the downloaded files according to information provided by a user. A Data Export Module 47 stores parsed data in various formats as defined by a user. In addition, a set of URL comparison agents 48 are provided for automatically comparing simple and
complex URLs for files within a host computer.
The illustrated host computer 50 contains a plurality of files 52. According to the present invention, a set 54 of files can be identified from the plurality of files 52 wherein each file in the identified set 54 has at least one common non-variable (root) portion and a variable portion.
The present invention is described below with reference to flowchart illustrations of methods, apparatus (systems) and computer program products according to an embodiment of the invention. It will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks .
The computer program instructions may also be
loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
Referring now to Fig. 4, operations for mining data from a set of files within a host computer, according to the present invention, are illustrated. Each file in a set is identified by a unique universal resource locator (URL) . Each unique URL contains at least one root or non-variable portion common to all files in the set and at least one variable portion unique to a respective file in the set. Initially, a file is selected from a host computer via a client computer in communication with the host computer (Block 100) . For the selected file, the root portions and variable portions of the URL of the file is determined (Block 200) . The identification of URL root and variable portions can be performed directly by a user via the Pattern Entry Module (43, Fig. 3) or automatically via one or more URL comparison agents (48, Fig. 3) .
Files within. the host computer (or accessible thereby) having a respective URL root portion matching the URL root portion of the selected file are then identified (Block 300) . Files having a respective URL root portion matching the root portion of the selected file URL and a URL variable portion different from the variable portion of the selected file URL are then
downloaded to the client computer (Block 400) . Operations, according to the present invention also include extracting data contained within each downloaded file (Block 500) and arranging the extracted data in a format different from the format of a downloaded file (Block 600) .
Referring now to Figs. 5A-5B, a user has accessed a web page (http://www.cia.gov/cia/publications/ factbook/country-frame.html) 70 from a web server (i.e., a host computer) via a web browser 72 (Block 100, Fig. 4) . The illustrated web page 70 includes a left frame 70a that displays a web page (http://www.cia.gov/cia/ publications/factbook/country. tml) 73 and a right frame 70b that displays a web page (http://www.cia.gov/cia/ publications/factbook/ag.html) 74. The web page 73 contained within the left frame 70a provides a list of hypertext links to web pages pertaining to various countries . The web page 74 contained within the right frame 70b is a web page for a respective country (Algeria) whose hypertext link has been activated in the web page 73 displayed within the left frame 70a.
The illustrated web page 74, displayed within the right frame 70b, includes an image (ag-150.gif) 75 of the selected country, Algeria. As illustrated in Fig. 5B, the image 75 has the following URL: http: //www, cia. qov/cia/publications/fact-book/figures/ag- 150.crif . As is known to those skilled in the art, a user who desires a copy of the image 75 for the country Algeria can position a mouse cursor over the image 75 and click on the right mouse button to save the image
75 to his or her client computer. However, to copy each map for each country listed in the web page 73, a user
would have to open each country file and repeat the above procedure .
According to one embodiment of the present invention, however, a user can select a particular file and then automatically download all files similar to the selected file. In other words, the present invention searches a host computer for all files that are within a set within which a selected file is a member. Accordingly, a user can select the displayed image 75 of Algeria and automatically download a respective image for each country listed in the web page 73 without having to access or display a web page for each country.
Referring now to Fig. 6, the Pattern Entry Module (43, Fig. 3) has been initiated and a user interface 78 associated therewith is displayed. According to one embodiment of the present invention, a user defines a set of files to download by indicating, within the user interface 78 of Fig. 6, the root and variable portions of a URL (Block 200, Fig. 4) . For example, a user could view the URL for various images for the respective countries listed in the web page 73 (Fig. 5A) to determine the root and variable portions of each URL. For the illustrated example, the URL root portion of each country map image is http: //www.cia.αov/ cia/publications/fac -book/figures/. The variable portion of each URL is an abbreviation of each country.
According to another embodiment of the present invention, a user can select a web page or a portion of a displayed web page and the URL for the selected web page, or portion thereof, can be automatically analyzed via various agents (48, Fig. 3)
to determine one or more root portions that are shared _. by other files within the host computer (Block 200, Fig. 4) .
For example, a user identifies a first URL of interest, such as http : //www. reports . com/Feb .html , and wants to locate all files in the host computer having the same root portion. An agent can be invoked to locate another URL within the host computer and to begin comparing each URL. The agent locates a URL http: //www, reports . com/Jan.html . The two URLs are then parsed and compared by the agent to determine the non- variable (i.e., root) portions and the variable portions . The agent determines that the root portions of each URL are "http://www.reports.com/" and ".html" and that the variable portion in each URL is between the two root portions (i.e., "Jan" and "Feb").
Agents according to the present invention can also parse complex URLs containing multiple root and variable portions. For example, a user identifies a first URL of interest, such as http : //ww . FebReport . com/ Feb .html , and wants to locate all files in the host computer having the same root portion. An agent can be invoked to locate another URL within the host computer and to begin comparing the two URLs . The agent locates a URL http : //ww . JanReport . com/Jan. html . The two URLs are then parsed and compared by the agent to determine the non-variable (i.e., root) portions and the variable portions. The agent determines that the root portions of each URL are "http : //www. " , "Report.com", and ".html". The agent also determines that the variable portions in each URL are between the first and second root portions and between the second and third root
portions (i.e., "Jan" and "Feb") .
Once the root portions of a URL for a specific file is identified, either automatically or by a user, a set of all files within the host computer having a matching root portion (or portions) are identified (Block 300, Fig. 4) . The variable portions of each file in the set are identified and stored within a file 79 as illustrated in Fig. 7.
In the illustrated user interface 78 of Fig. 6, the root portion of the URL for the selected file has been entered within the field 78a, entitled "First Part . " The file containing the respective URL variable portions for each file, entitled "delete.txt", has been entered into the field 78b, entitled "Middle." The suffix for the URLs within the identified set has been entered within the field 78c, entitled "Last Part." A user may indicate, via field 78d, the name of a local directory within which to store the files as they are downloaded from the host computer. Upon activating the button 78e, entitled "Get
Files, " all files within the set that satisfy the criteria set forth in fields 78a, 78b and 78c are downloaded to the user's client computer (Block 400, Fig. 4) . Fig. 8 illustrates the user interface of Fig. 6 during the process of downloading each file within the set to a directory entitled "maps" on the user's client computer. Fig. 9 illustrates the downloaded files stored within a specified directory on the user's client computer. According to another aspect of the present invention, data can be extracted from downloaded HTML files (Block 500, Fig. 4) using various know parsing
techniques and arranged in different formats (Block 600, Fig. 4) to suit the needs of a user. Fig. 10 illustrates an exemplary table of data 80 in HTML format as displayed within a web page. According to the present invention, a web page containing the displayed table 80 can be downloaded as described above, and the data from the table 80 extracted therefrom and arranged in a different format. As illustrated in Figs. 11A-11B, data for the first eight entries in the web page table 80 of Fig. 10 has been extracted and arranged in a database 82 (Fig. 11A) having a format 84 as indicated in Fig. 11B. As illustrated, the table columns of Fig. 10 have been converted to fields within the database 82. The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. In the claims, means-plus-function clause are intended to cover the structures described herein as performing the recited function and not only structural equivalents but also equivalent structures. Therefore, it is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as
other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the following claims, with equivalents of the claims to be included therein.