WO2002046970A2 - Systeme permettant de repondre a un besoin d'information par des techniques d'appariement approfondies - Google Patents
Systeme permettant de repondre a un besoin d'information par des techniques d'appariement approfondies Download PDFInfo
- Publication number
- WO2002046970A2 WO2002046970A2 PCT/US2001/046542 US0146542W WO0246970A2 WO 2002046970 A2 WO2002046970 A2 WO 2002046970A2 US 0146542 W US0146542 W US 0146542W WO 0246970 A2 WO0246970 A2 WO 0246970A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- terms
- query
- documents
- contexts
- match
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
Definitions
- the present invention relates to an apparatus and accompanying methods for a system that accepts a query and fulfills an information need expressed by the query based on a body of information such as a collection of documents.
- the invention has particular utility in connection with text indexing and retrieval systems, such as retrieval of information from the World Wide Web.
- Search engines may also permit the formulation of Boolean queries, which allow words in the query to be combined using logical operations such as AND, OR, and NOT.
- Boolean queries allow words in the query to be combined using logical operations such as AND, OR, and NOT.
- the query Clinton AND congress AND (NOT Hillary) selects documents that contain the word Clinton and the word congress but that do not contain the word Hillary.
- search engines retrieves documents that contain the exact phrase Bill Clinton while rejecting documents that contain the word Bill and/or the word Clinton separately.
- Examples of search engines offering these capabilities are search engines used with the World Wide Web such as AltaVistaTM, LycosTM, InktomiTM, InfoSeekTM, NorthernLightTM, HotBotTM, MSN SearchTM, GoogleTM and Yahoo ! TM.
- Additional search engines include those used for searching documents found in databases, digital libraries or other information sources such as InfoSeek UltraSeek ServerTM.
- the result of a search using search engines such as those mentioned above is a list of relevant documents, generally displayed in some order, for example, from the most relevant document to the least relevant document.
- search engines rank the documents according to some metric. Typically, the ranking will first show documents containing the highest number of keywords.
- Fig. 1 shows the result of a search on HotBotTM (http://www.hotbot.com for the query presidential candidates.
- Fig. 1 shows the first 10 documents ranked from the most relevant document to the least relevant document.
- Each of the results consists of a description of a document. Such description includes the title of the document, a description of the document, and its Internet Uniform Resource Locator (URL).
- URL Internet Uniform Resource Locator
- the search will return documents that contain Alexander Heard and Packard Bell since such documents contain the words Alexander and Bell. Such documents are unlikely to be relevant to a search for documents about Alexander Bell. With most conventional search engines, as described above, it is not possible to search for the phrase Alexander followed by any word and followed by Bell, e.g., Alexander _ Bell. Such a query should match documents that contain the string Alexander Graham Bell. To provide more flexibility in searching, some search and query systems offer proximity operators. These operators allow a user to request that the words in the query be within a certain distance of each other in a document. For example, a query using the NEAR operator (a feature of the Alta Vista search engine) retrieves documents that contain the search terms within 10 words of each other. Thus a user seeking to learn the middle name of Alexander Bell could enter a query such as
- This query will retrieve documents containing any of the strings Alexander Bell, Alexander Graham Bell, Alexander Heard and Packard Bell, Alexander heard the bell, Alexander Frederick Edward Bell, among others.
- these strings contain several types of words such as verbs, determiners, etc., in addition to names.
- the query results still suffer from a lack of specificity.
- Certain search and query systems such as DIALOG ® allow a further refinement in that the user may request that the search terms occur within a given distance of each other within a document. For example, the (3W) operator requests that the search terms occur within three words of each other. This option offers more control over the query results but still retrieves documents based on the positional relationship of the search terms within the document, regardless of the actual content of the intervening text. Although reducing the number of intervening terms may reduce the number of irrelevant documents retrieved, in the case of many queries it may also result in failure to retrieve relevant documents.
- the query structure available in traditional search and query systems is not conducive to searching for information that may lie outside the bounds of the search terms.
- a user who has forgotten the last name of George Washington Carver an agricultural chemist and holder of U. S . Patent Nos. 1 ,522, 176 and 1,541,478) could conduct a search using the specified terms George and Washington. While yielding hundreds of documents related to George Washington, such a search would be unlikely to provide the desired information about George Washington Carver.
- identifying documents that contain phrases requires either indexing the phrases as single terms or identifying documents that contain the individual terms, locating the terms, and determining their positional relationship to each other. Indexing phrases as single terms is generally limited to phrases of two or at most three terms, and it is not feasible to index all or even a majority of phrases. Also, in order to provide information about the terms adjacent to or surrounding a phrase of interest, prior art systems must access the document that contains the phrase.
- conventional searching systems suffer from a major drawback in that they typically provide only names, titles, URLs, file names, or identifiers of documents in response to a query.
- conventional Web search engines provide only the name of a document containing a search term, together with a uniform resource locator (URL) for that document. The user must then access and examine the document in order to locate the desired information.
- URL uniform resource locator
- results are returned on a document by document basis. The order of the documents is frequently based on the number of occurrences of the search terms within each document.
- Traditional search and query systems do not perform an assessment of the results of the query across multiple documents. Thus the results provided by traditional search and query systems do not consider the information content of the database as a whole.
- a search and query system meeting the needs described above also offered the ability to wholly or partially specify the ordering of terms as they appear within the text that may be considered matches for the partially unspecified query. Additionally, it is desirable that such a system not disregard potential matching strings simply because terms not matching the fully or partially specified terms of the query intervene between the fully or partially specified terms of the query. Finally, when there are multiple different strings that match the partially unspecified query, there is a need to present those strings in a relevant order. In particular, there is a need for a search and query system that ranks the results based not on the contents of the individual documents that contain the search terms, but rather on the content of multiple documents. In other words, there is a need for a search system that considers the query in relation to the information available within a large body of information such as a large collection of documents rather than in relation to single documents.
- the present invention provides a system and method for fulfilling an information need using an extended matching technique.
- the invention extends the matching capabilities presented in Applicants' co-pending U.S. Patent Application entitled “System for Fulfilling an Information Need” (hereinafter "the Information Need application”).
- the extended matching technique processes queries containing a partially unspecified portion.
- the technique allows for the identification of matches in documents in which the matching terms need not appear in the same relative order as in the query and in which there may be intervening words between the matching terms.
- Such queries are referred to as unordered queries.
- the technique also allows for queries in which the order of some or all of the terms is specified, referred to as ordered queries.
- a second implementation encodes a partially or wholly unspecified query using alternative data structures such as trees, graphs, or finite state machines.
- finite state transducers FSTs
- Weights associated with each arc of the FST may be used to accumulate a score for a match that reflects the difference(s) between the query and the matching context.
- the query is additionally converted into a Boolean expression to optimize search efficiency.
- Another aspect of the invention employs novel index data structures including index data structures that identify terms within documents that satisfy restrictions associated with partially unspecified terms of the query. Matches for the query and/or the partially unspecified terms of the query are identified by intersecting lists of documents.
- the invention addresses the needs described above by providing new approaches to the task of gathering information related to an information need based on a large body of information.
- the invention is a system (e.g., a method, apparatus, and computer-executable process steps) for fulfilling an information need.
- the invention identifies mformation to provide a result for a query containing an unspecified portion in addition to or instead of a specified portion.
- the unspecified portion of the query corresponds to an information need, for example, an item of information sought by a user.
- the system allows a user to enter a query such as _Gates, and receive back matches including the name Bill.
- a query containing an unspecified portion can more explicitly express the need.
- the query structure of the present invention facilitates accurate expression of the information need by permitting restrictions on the unspecified portion as described below.
- the invention then addresses the need by identifying matches for the query within documents.
- the invention allows for specification of tightened or relaxed ordering of terms within text strings considered matching the query, which will be referred to as ordered and unordered queries, respectively. Additionally, some embodiments of the invention allow acceptance of text strings as matches despite the presence of terms intervening among the terms of the query.
- the present invention returns the matches themselves or relevant portions of the matches (e.g., portions that correspond to the unspecified portion of the query).
- Document locations and/or identifiers and other information can optionally be returned in addition to the matches, and the documents themselves can be ranked.
- the invention ranks the results based on the occurrence of matches across a plurality of documents rather than on a document by document basis. The invention has particular utility in obtaining results for queries by searching documents on the World Wide Web.
- the invention is a system (i.e., a method, apparatus, and computer-executable process steps) for obtaining a result for a query that contains one or more unspecified portions.
- the unspecified portion can be wholly unspecified, as in the query Alexander _ Bell.
- the unspecified portion is only partially unspecified, in which case it includes a restriction that defines a category of terms that are acceptable matches for the unspecified portion.
- the restriction may indicate that an unspecified portion of a query must meet a category criterion, a morphological or syntactic criterion, or a criterion defined by a computer program.
- the restriction may indicate that an unspecified portion of a query must be a name, date, noun phrase, etc.
- the query Microsoft _ [NUM] includes a partially unspecified portion with a restriction indicating that only phrases in which the word Microsoft is followed by a number should be identified as a valid match.
- the query Microsoft _ [NPJ includes a partially unspecified portion with a restriction indicating that only phrases in which the word Microsoft is followed by a noun phrase (e.g., Windows operating system, computer corporation, etc.) should be identified as valid matches.
- a partially unspecified portion is indicated by a single underscore character (in some cases followed by an associated restriction)
- the matching text may comprise multiple terms.
- the convention that a partially unspecified term is represented by an underscore character followed by a restriction will be adopted herein.
- the underscore and its associated restriction are generally separated by a blank within this document.
- the system receives a query containing an unspecified portion.
- the system identifies matches for the query within a body of information.
- the body of information comprises a body of text, which may be organized in the form of discrete documents.
- the invention preferably identifies a plurality of matches for the query and ranks the identified matches.
- the ranking is based on any of a variety of criteria. In preferred embodiments of the invention the ranking reflects the number of times an instance of the match is identified within the portion of the body of information that is searched.
- the matches are output, preferably in an order based on the ranking, to provide a result for the query. Additional information, such as a score for the match, can also be provided as part of the results.
- a result for a query containing an unspecified portion is obtained based on the contents of documents in a database.
- the system stores, in memory, an index identifying documents that contain terms.
- the index may also store the locations of the documents within a database (e.g., the URLs of the documents when the database comprises pages on the World Wide Web).
- the index also stores information identifying terms that satisfy restrictions and/or computer programs that define restrictions. For example, the index may store terms that meet the restriction [PROPER NAME], terms that meet the restriction [COUNTRY], terms that meet the restriction [NP] (noun phrase), terms that meet the restriction [BASEBALL PLAYER] among others.
- the system when it receives a query containing an unspecified portion, it converts the query into a Boolean expression and applies searching techniques to identify documents and preanalyzed contexts (text strings) from the index that satisfy the Boolean expression, e.g., containing some or all of the terms in the query. The system then locates matches for the query by converting the query to a FST that is matched against the content of the identified contexts through the Boolean matching process.
- the query includes an unspecified portion containing a restriction
- the system uses the information identifying terms that satisfy the particular restriction and/or the computer programs that define restrictions to determine whether a potential match is indeed a match. In certain preferred embodiments, this information has already been encoded into the preanalyzed contexts in the form of weighted arcs in a finite state automata representation of the preanalyzed context.
- the invention assigns a penalty or cost to a particular match reflecting the dissimilarity of term ordering between the query and the potentially matching context. In other embodiments, such a penalty or cost is assigned if intervening terms appear between the query terms as they appear in a potentially matching context. A threshold penalty or cost may be established above which a potentially matching context may be rejected.
- the system provides (i.e., outputs) some or all of the matches as a result for the query.
- the result may include a portion of a match that corresponds to a wholly or partially unspecified portion of the query rather than the entire match.
- the result also includes identifiers and/or locations for the documents containing the matches and, optionally, additional information about the match.
- the result can be provided to a user by displaying the matches or portions thereof, the document identifiers or locations, etc., in an appropriate format on a display screen. If a particular match appears in multiple documents, the documents in which that match is found may themselves be ranked, e.g., based on the number of times an instance of the match is located within a document.
- the system locates matches it preferably accumulates information related to the matches and, optionally, information related to the documents that contain the matches.
- a located match is assigned a score, and the match is stored in a match list together with the score.
- the score preferably reflects the number of times an instance of the match is identified among a plurality of documents.
- the system ranks the located matches. The ranking is preferably based on the content of a plurality of documents, e.g., the number of times an instance of the match is located among a plurality of documents. Additional information that is accumulated as the matches are located may also be used to assign a score to a match and/or to rank the matches.
- the system outputs the results of the query in an order based on the ranking.
- the system Prior to the processing of any queries, the system stores information (e.g., in one or more indices) identifying a set of contexts for a term.
- the contexts correspond to linguistically analyzed text strings containing the given term, the strings occurring within documents in the database.
- the contexts are stored as finite state automata. Indications of which contexts are associated with a given document, and information relating to restrictions that are satisfied by given contexts may also be stored.
- the system locates matches for the query within the set of contexts rather than searching for matches within the documents themselves, thereby providing an opportunity for faster and more efficient processing of the query.
- the system locates matches among the contexts it also accumulates information related to the matches.
- a located match is assigned a score, and the match is stored in a match list together with the score.
- the score reflects the number of occurrences of the match among a plurality of contexts (and therefore among a plurality of documents).
- the system ranks the located matches, preferably based on the content of a plurality of contexts (and therefore also on the content of a plurality of documents). For example, the ranking can be based on the number of times an instance of a match is found across a plurality of contexts.
- the system provides (i.e., outputs) the results of the query, preferably in an order based on the ranking.
- the results include identifiers and/or locations of the documents containing contexts that match the query and, optionally, additional information about the matches and/or documents.
- the invention is a system for fulfilling an information need by searching contexts for one or more terms that appear in a query.
- the contexts are obtained from a body of information such as a collection of documents.
- the query may consist of fully specified terms or partially and/or unspecified terms.
- the results for the query can comprise contexts themselves or portions thereof, in addition to or instead of document identifiers or locations.
- a feature of the invention in this aspect is that rather than searching documents containing query terms, the system searches contexts that contain the terms.
- the invention is a system for creating an index identifying contexts for terms, the contexts occurring within documents in a database.
- the system first selects a document in the database and then selects a term within the document.
- the system identifies contexts for the term within the document, the contexts corresponding to strings within the document that contain the term. .
- the system further stores information identifying the document and associating the document identifier with the context.
- a given term may have multiple contexts within several document
- the system stores information indicating the number of documents that contain the term and, for each document, the number of contexts containing the term.
- the system selects another term within the document and identifies contexts for that term within the document.
- the context identification process continues for the selected document until a set of contexts has been identified for a plurality of terms (e.g., all the terms in the document or all the terms excluding determiners, etc.). The entire process is repeated for a plurality of documents in the database.
- information about the contexts is also stored. Such information includes factors that may provide an indication of the significance of the context, such as the position of the context within the document, the age of the document in which the context appears, or the co-occurrence of certain words within the context.
- the contexts are stored as finite state automata. Information about a context is stored within the finite state automaton that represents the context.
- the system does not store information identifying the document in which a context appears.
- the system selects a document in the database and then selects a term within the document.
- the system identifies contexts for the term within the document, the contexts corresponding to strings within the document that contain the term.
- Each identified context is added to a set of contexts for that term.
- the context identification process continues for the selected document until a set of contexts has been identified for a plurality of terms in the document. The entire process is repeated for a plurality of documents in the database.
- the system stores information indicating the number of documents that contain the term.
- the contexts are stored as finite state automata. Such information includes factors that may provide an indication of the significance of the context, such as the position of the context within the document, the age of the document in which the context appears, or the co-occurrence of certain words within the context.
- the contexts are stored as finite state automata. Information about a context is stored within the finite state automaton that represents the context.
- the invention is a data structure identifying subsets of preanalyzed contexts for terms, each subset of contexts occurring within a document in a database.
- the data structure includes a term array, a document array, and a context array.
- the context array contains subsets of preanalyzed contexts, each subset consisting of preanalyzed contexts in which the term occurs within a document.
- the context array may also include, for each subset of contexts, information indicating the number of contexts in the subset.
- the document array includes information identifying the documents within which the term appears.
- the context array contains information for accessing, within the context array, the subset of contexts for the given term and document.
- the term array includes the number of documents containing the term and information for accessing, within the document array, the document identifiers for documents that contain the term.
- the contexts are stored as finite state automata.
- the invention is a data structure identifying contexts for terms, the contexts occurring within documents in a database.
- the data structure includes a term array and a context array.
- the context array contains a set of contexts in which the term occurs in documents in the database.
- the term array contains the number of contexts stored in the context array for the given term and also contains information for accessing the set of contexts for that term within the context array.
- the contexts are stored as finite state automata.
- Fig. 1 is an example of a search result obtained using a traditional search system.
- Fig. 2 shows a representative embodiment of a networked computer system that may be used to implement the present invention.
- Fig. 3 is a perspective view of a query server of in the present invention.
- Fig. 4 shows the architecture of the query server shown in Fig. 2.
- Fig. 5 shows a portion of the architecture of the indexing computer shown in Fig. 2.
- Fig. 6 shows an example of search results obtained in accordance with the
- Fig. 7 shows an example of search results for the queries Microsoft Windows _ and _
- Fig. 8 shows an example of search results for the query Microsoft _ 2000 in accordance with the Information Need inventions.
- Fig. 9 shows an example of search results for the query Microsoft Windows _ [NUM] in which the unspecified portion of the query is restricted to match only a number.
- Fig. 10 shows an example of search results for the query Windows _ [NUM]_ in which the unspecified portion includes a partially unspecified term and a wholly unspecified term.
- Fig. 11 shows an example of search results for the query Alexander Graham Bell invented the _ [NP] in which the unspecified portion of the query is restricted to match only a noun phrase.
- Fig. 12 shows a screen dump illustrating results for the query _ [WHO] French President
- Fig. 13 shows a screen dump illustrating results for the query _ [WHEN] born
- Fig. 14 shows a screen dump illustrating results for the query _ [CITY] born
- Benjamin Franklin Fig. 15 shows an alternative user interface that allows a user to select the question word Wliere to express an information need.
- Fig. 16 shows an alternative user interface that allows a user to select the question word Wliere to express an information need.
- Fig. 17 is an illustration of data structures according to the present inventtion.
- Fig. 18 is a flow diagram of an embodiment of the extended matching process.
- Fig. 19 is a flow diagram of an embodiment of the conversion process for converting a query into a Boolean expression.
- Figs. 20A-F illustrations of finite state machine representations of example query NHlfM] invent and _ invent in any order.
- Figs. 21A-B shows screen dumps of alternative user interfaces for the present invention.
- Fig. 22 shows a screen dump of search results for the query win U.S. open on the
- Fig. 23 is a flow diagram for a process that stores contexts for terms within an index.
- Fig. 24 is an illustration of a finite state automaton for Alexander Graham Bell.
- Fig. 25 is an illustration of a finite state automaton for Alexander Bell invented the telephone.
- Fig. 26 is a flow diagram of a process for generating FSA representations of text strings.
- Fig. 27 is a flow diagram of an alternate embodiment of the extended matching process. Detailed Description of Certain Embodiments
- FIG. 2 shows a representative embodiment of a networked computer system 1 that may be used to implement the present invention.
- a searching site 2 is logically connected, via a network such as the Internet, to one or more client computer systems 3.
- Client computer system 3 can comprise any available computer but is typically a personal computer equipped with a processor, memory, display, keyboard, mouse, storage devices, and appropriate interfaces for these components.
- Client system 3 accepts a user-generated query and transmits the query to searching site 2.
- client system 3 operates a Web browser, and the query is entered into the Web browser that transmits the query to searching site 2.
- Searching site 2 may include one or more query servers 4, which may be logically connected to one another for example through a local area network, intranet, or the like.
- Query servers 4 may be connected to one or more peripheral storage devices 5.
- Searching site 2 also includes indexing computer 6, which is preferably also logically linked to the network and to query server(s) 4.
- indexing computer 6 collects documents and analyzes the documents to produce an index, as described in more detail below.
- query server 4 includes a local area network connection 7 for interfacing to additional query server(s), indexing computer(s) 6, peripheral storage device(s), etc.
- Query server 4 also includes a general network connection 8 for interfacing to a network such as the Internet, and fax/modem connection 9 for interfacing with other remote sources.
- query server 4 includes display screen 10 for displaying information, keyboard 11 for inputting text and user commands, mouse 12 for positioning a cursor on display screen 10 and for inputting user commands, disk drive 13 for reading from and writing to floppy disks installed therein, and CD-ROM drive 14 for accessing information stored on CD-ROM.
- query server 4 may also have one or more peripheral storage devices attached thereto.
- Fig.4 shows the internal structure of query server 4.
- query server 4 includes memory 15, which comprises one or more computer-readable media, such as a computer hard disk.
- memory 15 comprises a hard disk, but it may comprise other storage media such as floppy disks, CD-ROMs or read/write CDs, or other devices known in the art.
- Memory 15 stores data 16, applications, and an operating system 17.
- Processor 25 preferably comprises a microprocessor or the like for executing applications out of RAM 24.
- processor 25 accesses applications (or other data) stored on a floppy disk via disk drive interface 21 and accesses applications (or other data) stored on a CD-ROM via CD-ROM drive interface 22.
- query server 4 may be controlled and/or altered using keyboard 11 or mouse 12, commands from which are transmitted to processor 25 via keyboard interface 19 and mouse interface 20, respectively. Such tasks may also be controlled via commands transmitted to query server 4 via network connections 7, 8, or 9.
- Output results from some applications running on query server 4 may be processed by display interface 18 and then displayed to a user on display 10.
- display interface 18 preferably comprises a display processor for forming images based on data provided by processor 25 over computer bus 23, and for outputting those images to display 10. Certain applications may provide outputs to network interfaces 7, 8, or 9.
- query server 4 also includes index storage area 26 and document storage area 27.
- query server 4 optionally includes peripheral storage devices 5 that may mclude index and/or document storage areas 26 and 27.
- data stored in memory 15 includes category lists 28, dictionaries 29, and index or indices 50 which are discussed further below.
- Information retrieval application 30 includes query module 32, match engine 34, and ranking module 36, which together comprise computer-executable process steps to accept a query and satisfy an information need expressed by the query based on a body of information such as a collection of documents in a database.
- information retrieval application 30 receives a query and identifies matches for the query that occur within the body of information.
- the application locates a match it gathers information about the match, assigns a score to the match, and stores the match and the associated score.
- the score is based on the number of times an instance of the match is identified within the body of information.
- information retrieval application 30 uses the score to rank the matches.
- the results for the query comprise the matches themselves or portions thereof.
- the results may include a list of documents in which a given match is found. Additional information such as the location of the documents and/or other information about the matches or the documents may be provided in the results. A detailed description of these process steps is provided below.
- a user-generated query is entered into a client system and transmitted to the searching site 2.
- queries may also be entered directly into query server 4 or may be generated by an application program stored either on the query server or elsewhere.
- the body of information searched by information retrieval application 30 comprises a collection of Web pages.
- the invention is not limited to searching Web documents and, in fact, can be used in conjunction with a variety of different types of documents and databases.
- the invention will be described with respect to searching based on text/character strings, the invention is not limited to this either. That is, the invention may also be used to provide results for queries based on images or other types of data from any type of database.
- applications stored in memory 15 may include results page generator 38, which comprises computer- executable process steps to format and output search results in a variety of ways.
- information obtained by information retrieval application 30 in response to the query can be provided to another application stored either on the query server or a connected computer, to a user of query server 4, or to a user at a remote location via network connection 7.
- Indexing computer 6 comprises a processor 39, bus 40, RAM 41, memory 42, and optionally peripheral storage 43.
- Memory 42 comprises one or more computer- readable media such as a hard disk.
- Indexing computer 6 preferably includes a general network connection (not shown) for interfacing to a network such as the Internet and also a local area network connection (not shown) for interfacing to query servers 4.
- Applications stored in memory 42 include indexer module 44, which comprises computer-executable process steps to construct an index identifying documents containing terms.
- applications include context indexer module 45, which comprises computer-executable process steps to store contexts for terms.
- Other applications may include Web robot 46.
- Web pages are used as a source of documents to be indexed by indexer module 44 and/or context indexer module 45.
- Web robot 46 may periodically search the Web and retrieve Web pages.
- Web robot 46 can comprise any Web robot, of which a number are commercially available.
- Memory 42 can mclude document storage area 47, although of course documents can also be stored on peripheral storage 43.
- Data stored in memory 43 preferably includes category lists 48 and dictionaries 49. In preferred embodiments of the invention these are identical to category lists 28 and dictionaries 29 stored on query server(s) 4.
- query servers 4 and indexing computer 6 access the same category lists and dictionaries.
- Index or indices 50 generated by indexing computer 6 are also preferably stored in memory 43.
- indexer module 44 is based upon the indexer described in "The SMART Retrieval System: Experiments in Automatic Document Processing” by Gerald Salton (Prentice-Hall, Inc. (1971)) and "A Theory of Indexing” also by Gerald Salton (J. W. Arrowsmith, Ltd. (1975)). The contents of these two documents are hereby incorporated by reference into the subject application as if set forth herein in full. Indexer module 44 and context indexer module 46 are described in more detail below.
- Indices generated by indexer module 44 and/or context indexer module 46 may be copied to query servers 4 or to any storage device to which query servers 4 have access as may the documents themselves or portions thereof, either separately or as part of the index. It is noted that the foregoing description is intended by way of example and is not intended to be limiting. A variety of computer configurations are consistent with the present invention. The two functions of indexing and query fulfillment (i.e., obtaining a result for a query) may be divided in any of a number of ways among one or more computers.
- the Information Need invention provides a new approach to the task of gathering information related to a query.
- the invention allows the processing of a query containing an unspecified portion in addition to or instead of a specified portion.
- the unspecified portion of the query corresponds to an information need, for example, an item of information sought by a user.
- a query containing an unspecified portion according to the present invention can more explicitly express the need.
- the invention then addresses the need by identifying matches for the query that occur within a body of information such as a collection of documents.
- the present invention returns the matches themselves or portions of the matches that correspond to the unspecified portions of the query.
- Document locations and/or identifiers and other information can optionally be returned in addition to the matches.
- the matches are ranked according to any of a variety of criteria.
- a general feature of the ranking criteria is that they are based on the content of a plurality of documents.
- One such criterion is the number of times an instance of the match is identified across all or a subset of documents in which the match occurs.
- ranking criteria of the present invention differ from criteria used in traditional search and query systems. Such systems may rank a given document based on the number of occurrences of the query term(s) within that document. In other words, in traditional systems documents in which the query terms appear with a high frequency receive a higher ranking than documents in which the query terms occur less frequently. In contrast, the present invention ranks matches for queries rather than ranking documents that contain query terms, and the ranking is based on the content of multiple documents.
- a score e.g., a score based on the number of times an instance of the match is identified
- ranking the different matches based on the score are not limited to situations in which the matches are located within a plurality of documents. Instead, the strategies are also applicable to anybody of information regardless of whether the information is in the form of discrete documents.
- a query having the query structure of the present invention as described below i.e., a query containing one or more wholly unspecified and/or one or more partially unspecified terms, may be used to identify matches within any body of information, including a single document as well as within a plurality of documents.
- the present Information Need invention operates on partially unspecified queries, i.e., queries that contain an unspecified portion.
- the unspecified portion of a query may be represented by a special symbol such as an underscore within the query and may include one or more unspecified terms.
- an unspecified term may be wholly or partially unspecified.
- the query Windows _ contains an unspecified portion. This query consists of two terms, of which the first is the specified (fully defined) term Windows. The second term (represented by an underscore) is unspecified and can represent any word, number, or other character sequence.
- the unspecified portion of this query consists of a single unspecified term.
- a query containing an unspecified portion is matched to a character string when the following conditions are met: (1) For each of the specified terms in the query, an equivalent (matching) term is located at the corresponding position in the character string.
- a term (any term) is located at the corresponding position in the string.
- the phrases Windows 95, Windows 98, Windows NT and Windows 2000 are all matches for the query Windows _.
- Fig. 6 shows results for the query Windows _ according to one embodiment of the Information Need invention.
- the matches are ranked based on the number of documents that contain the match and are presented according to this order. Since the match Windows 95 occurs in the most documents (five) it is presented at the top of the list. The match Windows 2000 occurs in only one document and is therefore presented at the end of the list. Information identifying the document(s) in which a particular match occurs is also provided in the results.
- the specified portion of the query can be of arbitrary length, i.e., can contain any number of terms.
- Fig. 7 shows an example of results for the query Microsoft Windows _ in which the specified portion consists of two specified terms. Such a query matches the phrases Microsoft Windows 95, Microsoft Windows NT, and Microsoft Windows browser among others.
- An unspecified portion can occur at any position within the query. In the preceding example the unspecified portion (a single unspecified term) follows the specified portion.
- Fig. 7 further shows an example of results for the query _ 2000 in which the unspecified portion (a single unspecified term) precedes the specified portion. Matches for this query include year 2000, Windows 2000, and election 2000 among others.
- the query _ Windows _ shows exemplary results for the query Microsoft _ 2000 in which the specified portion contains two terms that are separated by the unspecified portion (a single unspecified term).
- the query _ Windows _ consists of two unspecified terms separated by the specified term Windows.
- This query matches the phrases Microsoft Windows 95, Microsoft Windows 98, installing Windows 95, and using Windows 98, among others.
- a term in a character string is equivalent to a specified term in a query if it is identical to the specified term.
- a term may be considered as equivalent to a specified term even if the term is not identical to the specified term but is instead related to the specified term in some manner.
- morphologically related words such as leave and left may be considered as equivalent.
- an unspecified term can represent any character sequence. Such an unspecified term is referred to herein as a wholly unspecified term.
- an unspecified portion of a query can contain one or more partially unspecified terms in addition to or instead of one or more wholly unspecified terms.
- a partially unspecified term includes a restriction that defines a particular set of character sequences that can match the term. For example, a restriction on a partially unspecified term may require that a matching term be a number, a proper name, a noun phrase, etc.
- the queries Microsoft Windows _ [NUM] and
- a term (any term) is located at the corresponding position in the string.
- a term or a group of terms that satisfies the restriction associated with the partially unspecified term is located at the corresponding position in the string.
- Fig. 9 shows an example of results for the query Microsoft Windows _
- the Information Need invention also allows a query to contain one or more wholly unspecified terms in addition to one or more partially specified terms. For example, the query
- Windows _ [NUM] _ contains a partially unspecified term (with an associated restriction that requires that a matching term be a number) in addition to a wholly unspecified term.
- Fig. 10 shows an example of results for this query. As shown in the Figure, the query matches the phrases Windows 95 drivers, Windows 2000 software, and Windows 98 information among others.
- the set of terms represented by a partially unspecified term can be defined by characteristics a member must possess to satisfy the associated restriction. Examples of restrictions include categories such as proper name, location, country, date, unit of measurement, company name, baseball players, etc. Thus certain restrictions are best defined by lists of terms that fall into certain categories.
- a morphological criterion defines a requirement that must be met by a term in order to satisfy the restriction.
- a morphological criterion defines a feature of word structure, function, or inflection.
- a morphological restriction is used to indicate that a matching term must contain a particular unit (morpheme) such as the unit slow in the words slowly, slowest, etc.
- a mo ⁇ hological restriction may indicate a category such as a part of speech (e.g., noun, verb, preposition, adjective). Words falling within the category satisfy the restriction.
- the three phrases Microsoft wins, Microsoft loses, and Microsoft settles are examples of matches for the partially unspecified query Microsoft _ [VERB] since wins, loses, and settles each fall within the part of speech category VERB and therefore satisfy the restriction that the partially unspecified term be a verb.
- a morphological restriction is also used to indicate inflection, i.e., the variation in form of a single word for grammatical purposes, as with take, takes, taken, took, etc.
- the morphological restriction [PAST TENSE] indicates that a matching term must be the past tense form of a verb.
- a syntactic criterion defines a feature of sentence structure such as a grammatical unit from which sentences are constructed.
- phrasal categories such as noun phrase, verb phrase, or prepositional phrase.
- Fig. 11 shows an example of results for the query Alexander Graham Bell invented _ [NPJ in which [NP] is a restriction requiring that only a noun phrase can match the partially unspecified portion of the query.
- [NP] is a restriction requiring that only a noun phrase can match the partially unspecified portion of the query.
- the three strings the telephone, telephones, and many useful devices are noun phrases that match the partially unspecified term _[NP].
- the Information Need invention allows either a single term that occurs within a document or a group of terms that occurs within a document to match a partially unspecified term depending upon the nature of the restriction.
- a restriction can also be expressed as a computer program that is invoked to determine whether a particular term or group of terms satisfies the restriction. Further details regarding this form of restriction are provided below.
- a partially unspecified query of the Information Need and present inventions are distinct from a query in which a term contains a wildcard character.
- a wildcard character is a character whose pu ⁇ ose is to allow variations in length or spelling of a single term.
- such wildcard characters may occur as part of a term that is otherwise fully specified.
- the asterisk in the term col*r permits the recognition of either color or colour.
- the asterisk in comput* permits recognition of computer, computers, computation, etc.
- a wildcard character permits variation within the term that contains the wildcard character, but the variation is constrained by the defined letters in the term.
- a query in which a term includes a wildcard character is, in fact, fully specified with the exception that certain variations in length or spelling are permitted within an individual term including the wildcard.
- an entire term or terms can be unspecified.
- a wholly unspecified term in a query matches any term, without requiring the presence of specific letters.
- a match for a partially unspecified term is constrained by the restriction associated with the partially unspecified term rather than by a requirement for the presence of specific letters.
- the inventions also encompasses the use of wildcard characters within terms that are otherwise fully specified.
- query preprocessor 32 accepts a natural language question and converts the question into a partially unspecified query. A match for the query constitutes an answer for the question.
- applications stored on indexing computer 6 include indexer module 44.
- an indexing phase is performed in which terms that occur within documents are stored in a data structure known as an index 50.
- the index 50 identifies documents that contain the term.
- text strings containing the term known as contexts are also stored.
- certam embodiments of the inventions the number of instances of a term within a document is also stored.
- any collection of documents to which the inventions have access can be used as a source of documents.
- the documents are stored in document storage area 47.
- the invention does not necessarily store the actual full text of the documents but may instead store a representation of the documents, e.g., using numbers to represent text. Although for pu ⁇ oses of description it will be assumed that in those embodiments of the inventions in which matches are located within documents, the inventions store the full text, it is to be understood that this need not be the case. Furthermore, in descriptions of searching and matching provided below, the actual operations may be carried out on numerical representations of character strings rather than on the character strings themselves. In addition to storing the documents themselves, information about the documents such as the URL (for Web pages), date of creation, etc., is preferably also stored. In certain embodiments of the inventions document storage area 47 resides in memory 43, although this need not be the case. Portions of document storage area 47 may be located on different storage media and devices. In certain preferred embodiments of the inventions the documents are World Wide Web pages. Web documents may be identified by Web robot 46.
- indexing is used broadly to refer to any of a number of processes by which information associating a term T with a body of information containing the term is stored on a computer-readable medium. For example, indexing may identify documents in which term T occurs. As described below, indexing may identify character strings or contexts in which a term T occurs. Since for pu ⁇ oses of the present invention a body of information containing a term is typically a body of text, these two phrases will be used interchangeably.
- An index comprises the terms and actual information identifying, for a term T, the bodies of text that contain T. In certain embodiments of the invention, an index also comprises a body of text that contains T.
- indexer module 44 uses traditional techniques of indexing such as the one described by Salton (mentioned above) to create an index of terms, e.g., words that appear within documents.
- the index contains an identifier for documents in which the term appears.
- the index also includes data identifying the location(s) of a given term within the document(s) in which the term occurs. In certain situations, depending upon factors such as the size and number of the documents, the inclusion of such information greatly speeds the process of searching for matches.
- the indexed terms preferably include single words and numbers and also certain groups of words or phrases.
- index 50 is stored in memory 42 of indexing computer 6 and is copied to memory 15 of query server 4 at appropriate intervals (e.g., after updating).
- the invention is not limited to storing the index there.
- the index can be stored in peripheral storage comprising additional hard disks or, in general, on any computer-readable medium.
- portions of the index may reside on different devices and/or media, and part or all of the index can be copied or transferred from one device to another as appropriate. Construction and manipulation of indices containing thousands of terms and referencing thousands of documents is well known to those of ordinary skill in the art.
- indexer module 44 also indexes restrictions that may be associated with partially unspecified terms in a query. In other words, for each restriction, indexer module 44 identifies documents that contain a string that satisfies the restriction. Indexing of restrictions may occur concurrently with indexing of specified terms or in a separate phase. In the former case, as indexer module 44 identifies a term to be indexed, it also determines which, if any, restrictions the term satisfies. For example, if indexer module 44 encounters the term Microsoft within a document DI , it recognizes that Microsoft satisfies the restrictions [NOUN], [NAME], and [COMPANY NAME] among others.
- indexer module 44 stores the information that a string that meets the [NOUN] restriction occurs within document DI .
- Indexer module 44 also stores the information that a string that meets the [NAME] and [COMPANY NAME] restrictions occurs within document DI .
- indexer module 44 encounters the term 1890 within a document D2, it recognizes that 1890 satisfies the restrictions [NUM] and [DATE] among others.
- indexer module 44 stores the information that a string that satisfies the [NUM] and [DATE] restrictions occurs in document D2.
- indexer module 44 first selects a restriction and then extracts from a document those terms or groups of terms that satisfy the restriction. Indexer module 44 repeats this process for each restriction.
- indexer module may use a combination of these or other approaches or select a particular approach depending upon the situation.
- the restrictions that are indexed may include any or all of those described in the section on query structure and features, including those that are satisfied by groups of terms in addition to those that are satisfied by a single term.
- indexer module 44 identifies documents that contain terms or groups of terms that satisfy the restrictions [NP] (noun phrase), [VP] (verb phrase), [PAST TENSE], etc.
- indexer module 44 indexes both terms that occur within documents and also restrictions satisfied by strings within documents.
- the index identifies documents containing the term.
- the index identifies documents containing a term or group of terms that satisfies the restriction.
- indexer module 44 uses a variety of techniques to determine whether a string satisfies a restriction and/or to extract strings that satisfy restrictions from documents. For example, in the case of restrictions such as [COMPANY NAME], [COUNTRY], etc., it is possible to enumerate a list of terms that satisfy the restriction. In such a case indexer module 44 may simply compare a term T that occurs within a document with predefined category lists 48 that include terms that satisfy the restriction to determine whether term T satisfies the restriction. Depending upon the restriction, category lists 48 may be prepared by a human being.
- indexer module 44 may refer to a stored dictionary or dictionaries 49 containing information regarding word form, part of speech, compound words and phrases, etc.
- category lists 48 and dictionaries 49 as are needed by indexer module 44 are stored in memory 43, at least during the indexing phase.
- indexer module 44 may use any of a variety of known techniques of text analysis.
- mdexer module 44 invokes a computer program to determine whether a given term satisfies the restriction.
- the computer program accepts the term as input and provides as output an indication of whether the term satisfies the restriction.
- One example is the restriction [PRIMENUM], which is satisfied by the infinite set of prime numbers. Such a set cannot be enumerated, and the most straightforward way to recognize terms that satisfy the restriction is through the use of a program that is able to determine whether a number is a prime.
- the following final query example illustrates the use of this restriction.
- the query _ [NAME] was born in _ [PRIMENUM] identifies individuals who were born in a year that is a prime number. Results for this query include, among others, Charlie Chaplin was born in 1889 since Charlie Chaplin satisfies the restriction [NAME] and 1889 is a prime number. Such a query would not match Agatha Christie was born in 1890 because, although Agatha Christie is a name, 1890 is not a prime number.
- indexer module 44 stores the location of the terms within the documents. For example, if the term 'Microsoft' appears in three locations LI, L2, and L3 within document DI, then indexer module 44 stores the fact that a term satisfying the [NOUN] restriction appears within document DI at locations LI, L2, and L3. Indexer module 44 also stores the fact that a term satisfying the [NAME] and [COMPANY NAME] restrictions appears in DI at LI, L2, and L3). When a restriction is satisfied by a string containing more than a single term, indexer module 44 stores both the beginning and ending locations of the string.
- the number of instances of a term within a document is also stored.
- mdexer module 44 identifies and indexes terms that appear within documents, it may also store various types of information about the terms. Such information may include the position of the term within a document, features of the format of the term within the document, etc. For example, if a term appears within a title or within a bulleted list in document DI, indexer module 44 may include such information when storing the fact that the term occurs within DI . Format features include such characteristics as font size, boldface, italics, color, etc. For example, if the term icrosoft' occurs within the title in boldface in document DI, embedded within the text of document D4, and in italics within document D7, an entry for the term Microsoft may appear (in schematic form) as follows: Microsoft DI bold, title
- indexer module 44 indexes restrictions as well as terms, i.e., indexer module identifies documents that contain terms that satisfy the indexed restrictions.
- a restriction entry may also include information regarding the position and format of term(s) that satisfy the restriction.
- the entries for the restriction [COMPANY NAME] may appear (in schematic form) as follows: Company Name DI Microsoft bold, title
- restriction entries also include the locations within documents of term(s) that satisfy the restriction. It is noted that in the preferred embodiment of the invention in which the documents are Web pages, location and format features of terms within documents are typically encoded in the form of HTML tags within the documents themselves. This fact simplifies extraction of location and format features for terms.
- the partially unspecified query Agatha Christie was born _ [LOCATION] would be matched by the string Agatha Christie was born in England (since in England satisfies the restriction [LOCATION]) but would not be matched by the string Agatha Christie was born and raised in England since there are no terms in the query that correspond to the terms and and raised in the latter string.
- the word match can be used to refer either to the entire string that matches the query as described above or to simply that portion of the entire matching string that corresponds to a particular partially unspecified term in a partially unspecified query.
- the string Agatha Christie was born in England is a match for the query above.
- the string in England is also referred to as a match since it corresponds to the partially unspecified term _ [LOCATION], i.e., it fulfills the [LOCATION] restriction.
- the word "match" will be assumed to refer to just the portion of a matching string that corresponds to a partially unspecified term rather than to a matching string that also includes fully specified terms from the query.
- the extended matching technique modifies the criteria for a match.
- the extended matching technique in an extended match the relative order of (1) the term or group of terms in the extended match that correspond(s) to a partially unspecified term in the query and (2) the terms or term synonyms in the extended match that correspond to the fully specified terms in the query need not be preserved.
- the partially unspecified term indicated by the restriction [LOCATION] follows the fully specified terms in the query.
- the methods presented in the Information Need application would only identify a valid match in a passage of text in which a term or group of terms that satisfies the restriction (e.g., in England) follow terms that match the fully specified terms.
- a valid match would be identified in a passage such as the following:
- Unordered Query the extended matching technique allows for a match in cases in which the query terms (both fully specified and partially unspecified terms) all appear within a single document (or, as discussed below, in certain embodiments of the invention within a single context) but where they appear in any order relative to one another. In other words, there are no requirements that the terms in the document that match the terms in the query have any particular order.
- the document or context may contain intervening words separating the terms that match the fully specified terms in the query. There may also be intervening words separating the terms that match the partially unspecified term(s) and those that match the fully specified term(s).
- the extended matching technique allows the user to specify that some or all of the terms in the query must appear in a particular order in order for a match to occur.
- the following examples illustrate the operation of the Unordered Query embodiments of the extended matching technique.
- the extended matching technique will identify locations that appear either before or after the fully specified term Smithsonian whether or not there are intervening words between a string that fulfills the [LOCATION] restriction and the term Smithsonian. Thus the extended matching technique will return as a match the location Washington, D. C. from any of the following three passages of text (among others):
- the Smithsonian is one of the most popular tourist attractions in Washington, D.C.
- Orderered Query the extended matching technique allows a requirement that some or all of the terms in the query appear in the given order, without any intervening terms, within the document or context. (Of course a match for a partially unspecified term may be a group of terms rather than a single term.) The terms whose order must be maintained will be indicated herein by placing them within quotation marks, though of course a variety of other forms of annotation may be used.
- the order requirement may also encompass one or more partially unspecified terms and/or a mixture of fully specified terms and partially unspecified terms.
- the ordered query "French president _ [NHUM] " will only identify matches where the terms French and president appear in the specified order with a human name adjacent to president. Thus only the first passage above contains a match for this query.
- functionally an ordered query in which all terms appear within quotation marks should return the same results as a similar query using the technology described in the Information Need application. This type of query is referred to as a fully ordered query.
- the various query forms represent a tradeoff in terms of specificity and breadth.
- an ordered query in which all terms are ordered may be most likely to return an accurate result (e.g., a match that correctly fulfill the infonnation need) but at the expense of missing some matches that may correctly fulfill the need.
- the unordered query may offer the greatest likelihood of finding any potentially correct information. Note that although in certain embodiments of the invention the user is permitted to specify whether the query should be ordered or unordered, in other preferred embodiments the system may automatically generate one or more queries including ordered and/or unordered queries from an input query.
- the query results are ranked.
- the ranking may take into account, in addition to the criteria discussed in that application, other criteria such as whether the match was identified using an ordered or unordered query.
- the ranking may consider, for example, the degree of order imposed by the query, the number of intervening terms between the matching terms in the document or context, the number of instances of the match within all documents or contexts, etc. Additional ranking criteria are described in the Information Need application.
- the extended matching technique while expanding the set of possible matches for a partially unspecified query, may increase the likelihood of returning an inappropriate result for the query, in certain preferred embodiments of the invention the extended matching technique using unordered or partially ordered queries is only employed if use of the more rigid fully ordered query fails to identify a sufficiently large set of matches.
- a system according to the present invention may allow any of a variety of approaches for expressing a particular query. For example, different selections can be made for the words or abbreviations that are used to indicate restrictions. As will be evident to one of ordinary skill in the art, the particular choice of words or abbreviations that designate a particular restriction is not significant provided that consistency is maintained between the indexing of terms that satisfy the restriction and the restriction as ultimately used in performing the matching process. Thus it may be easier for a user to employ a question word (e.g., WHO, WHAT, WHEN, WHERE) rather than a corresponding restriction (e.g., NHUM, NOUN PHRASE, DATE, LOCATION).
- a question word e.g., WHO, WHAT, WHEN, WHERE
- a corresponding restriction e.g., NHUM, NOUN PHRASE, DATE, LOCATION
- the query _ [LOCATION] born Benjamin Franklin could be expressed as _ [WHERE] born Benjamin Franklin.
- the invention may accept input queries containing question words and convert them into queries containing the appropriate restriction. (Note that the preceding lists are nonexhaustive and are intended to be representative only.)
- FIG. 12 shows a screen dump illustrating results for the query _ [WHO] French President (which could also have been expressed as _ [NHUM] French President according to the query structure of the Information Need application).
- Fig. 13 shows a screen dump illustrating results for the query _ [WHEN] bom Benjamin Franklin (which could have been expressed as _ [DATE] born Benjamin Franklin).
- Fig. 14 shows a screen dump illustrating results for the query _ [CITY] born Benjamin Franklin.
- Fig. 15 shows an alternative user interface that allows a user to select the question word Where to express an information need. The query generated by this selection is presented following the word Search in the figure.
- Fig. 16 shows an alternative user interface that allows a user to select the question word Where to express an information need. The query generated by this selection is presented following the word Search in the figure.
- Fig. 16 presents the user with the option to refine the search, e.g., by applying narrower restrictions such as [CITY], [COUNTRY], etc.
- narrower restrictions such as [CITY], [COUNTRY], etc.
- a wide range of more detailed categories of restrictions may also be available to the user in some embodiments.
- the results optionally include a passage of text containing the match and/or a document identifier (in this case a link to the document).
- a drop-down window allows a user to select among a large number of category restrictions that assist in building a query.
- matches are returned as answers to the query JNHUM] win U.S. open with an indication of how relevant the system believes the answer to be.
- contexts for each match appear in this interface without a document identifier, but with a hyperlink to the complete document containing the context.
- An option presented to a user in this interface is to select to display more documents, if any exist, associated with a particular match.
- Fig. 21B illustrates the results when a user chooses to view additional documents associated with match Pete Sampras.
- Fig. 22 illustrates the results for running the query win U.S. open on the GOOGLETM search engine system.
- an indexing module indexes documents that will serve as an information source.
- the indexing module undertakes a linguistic analysis of the documents.
- terms that satisfy any of a variety of restrictions e.g., PERSONAL NAME, COMPANY NAME, LOCATION, ADDRESS, CITY, DATE, etc.
- the indexing module identifies terms in text that fall into various categories.
- the indexing module also identifies and indexes terms that satisfy mo ⁇ hological or syntactic criteria or that satisfy a computer program. For example, the indexing module identifies and indexes terms that meet criteria such as NOUN PHRASE, VERB, PREPOSITIONAL PHRASE, etc.
- Fig. 23 illustrates the preparation of an index that stores contexts.
- the context index is initialized to be empty.
- context indexer module 46 selects a new document (i.e., a document that has not yet been processed) from a database.
- a new document D i.e., a document that has not yet been processed
- context indexer 38 proceeds to step 1320, in which a new term T in document D (i.e., a term for which a context set from document D has not yet been obtained) is selected.
- a new term T has been found context indexer module 45 proceeds to step 1330.
- context indexer module 45 searches for a context C for term T.
- context indexer module 45 proceeds through decision step 1335 to step 1340, in which context C is added to the set of contexts CS for term T in association with a unique identifier for document D. At this step information about the context such as its position within a document may also be stored. Step 1340 is discussed in more detail below.
- step 1345 context indexer module 45 searches for the next context C for term T within document D. Steps 1335 through 1345 are repeated sequentially until no more contexts for T within D can be found. Processing then returns to step 1320, in which a next term T is selected. Steps 1325 through 1345 are repeated until a new term can no longer be found within document D.
- Steps 1310 through 1345 are repeated until no new documents can be found, at which point the context index is complete and processing stops.
- context indexer module 45 does not obtain a context set for certain terms, e.g., common words or words that have a low information content such as a, the, it, etc. Such words can be screened out at steps 1320.
- context indexer module 45 may store contexts that satisfy restrictions.
- a string such as Alexander Bell invented the telephone hi 1876 can serve as a context for the specified term telephone. Since the telephone is a noun phrase, Alexander Bell invented the telephone in 1876 is also a context that satisfies the restriction [NP].
- the string is also a context for the restrictions [PROPER NAME], [VERB], [DETERMINER], among others, since it contains terms or groups of terms that satisfy these restrictions.
- the context set CS for a term T actually consists of multiple subsets of contexts, each subset consisting of contexts found within a single document.
- Such a context set will be referred to as a type I context set.
- the contexts are stored without a document identifier. In such a case a single context set exists for each term.
- Such a context set will be referred to as a type II context set.
- the data structures in which the context sets are stored are described in further detail below.
- a finite state automaton is a structure having a finite number of states including an initial state and at least one final state (also referred to as an accept state).
- the states are connected by a plurality of arcs.
- the arcs have an input, which defines a requirement that must be met in order to traverse the arc, i.e., to proceed from one state to another.
- the automaton is driven from state to state by a sequence of inputs. If the automaton is in a final state (accept state) after processing the elements in the input string, then the FSA has accepted the input.
- Fig, 24 shows an example of an FSA that represents a context for the string
- the FSA includes the terms Alexander, Graham, and Bell.
- the FSA includes information about the restrictions that are satisfied by these terms.
- the FSA contains the information that Alexander Graham Bell is a proper name.
- the FSA further contains the information that in this case Alexander is a first name, Graham is a middle name, and Bell is a last name. Assuming that the leftmost state is an initial state and the rightmost state is a final state, this FSA accepts the input string Alexander Graham Bell. Depending upon which states are considered as initial and final states, the FSA may accept other inputs.
- Fig. 25 shows an FSA that represents a context for the term telephone.
- the labels on the arcs of the FSA represent information about the restrictions that are satisfied by various portions of this context. For example, by examining the FSA it can be determined that invented satisfies the restriction [VERB] and that the string the telephone satisfies the restriction [NP] (noun phrase). In this manner, any of the restrictions used in partially unspecified queries, including mo ⁇ hological and syntactic restrictions as well as computer programs, can be expressed in the form of arcs within an FSA. Storing the contexts as FSAs facilitates the rapid and efficient processing of queries containing partially unspecified terms. In particular, the matching process is streamlined since the FSA contains all information needed to determine whether terms in the context satisfy restrictions associated with partially unspecified terms in the query.
- Fig. 26 shows the execution of step 1340 in a case that the contexts are stored as FSAs.
- Context indexer module 45 passes a character string 51 consisting of an identified context to automaton generation module 52, which is comprised of computer-executable process steps to generate an FSA from an input character string.
- the context is provided to automaton generation module 52, which produces an FSA 53 that represents the character string.
- converting the character string into an FSA involves first parsing the string to identify individual terms.
- the FSA generated by automaton generator 52 includes a plurality of states connected by arcs, with one arc corresponding to each term in the input text. The states and arcs are arranged such that the arc that connects the first two states corresponds to the first term in the character string, and arcs connecting successive states correspond to successive terms in the character string.
- automaton generation module 52 has constructed an FSA for a context, information indicating which portions of the context satisfy the various restrictions (categories, mo ⁇ hological, syntactic, etc.) discussed above is inco ⁇ orated. This task is performed by category recognition module 54, compound words and lexical phrases module 56, mo ⁇ hology module 58, and syntactic phrase module 60.
- FSA 53 is provided to category recognition module 54.
- Category recognition module 54 identifies terms that satisfy various restrictions that are defined as categories (e.g., COUNTRY, CITY, COMPANY NAME) and adds a new arc labeled with the restriction for each term or group of terms that satisfies the restriction.
- Category recognition module 54 uses stored category lists 28 described above for this pu ⁇ ose.
- FSA 55 generated by category recognition module 54, is provided to compound words and lexical phrases module 56, which identifies compound words such as air base and lexical phrases such as according to, once in a blue moon, etc.
- compound words or phrases although comprised of multiple terms, typically behave as a single part of speech and therefore satisfy restrictions such as [NOUN], [PREPOSITION], etc.
- Compound words and lexical phrases module 56 adds a new arc labeled with the appropriate restriction for each group of terms that satisfies the restriction. The arc begins at the state from which the arc corresponding to the first term in the group originates.
- Compound words and lexical phrases module 56 preferably uses a stored dictionary 29 containing such words and phrases along with their associated part of speech.
- An example of such a dictionary is provided as Appendix C in applicants' co-pending application (Ser. No. 09/084,535, filed May 26, 1998) entitled "Spelling and Grammar Checking System". The contents of this application are inco ⁇ orated herein by reference in their entirety.
- FSA 57, generated by compound words and lexical phrases module 56 is input to mo ⁇ hology module 58, which adds mo ⁇ hological analyses of each term to the FSA based on entries in a stored mo ⁇ hological dictionary 29.
- Such analyses preferably include, as a minimum, a part of speech for each term.
- mo ⁇ hology module 58 adds an arc labeled with the appropriate part of speech for each term.
- the analyses may mclude an indication of the tense and/or the base form of a term.
- Mo ⁇ hology module 58 adds appropriate arcs to the FSA to represent each of these items.
- FSA 59 generated by mo ⁇ hology module 58, is then input to syntactic phrase module 60, which identifies phrases such as noun phrases, verb phrases, etc. that occur within the context and adds an appropriate arc for each phrase to the FSA.
- FSA 61 generated by syntactic phrase module 60, includes arcs for category restrictions, compound words and lexical phrases, mo ⁇ hological restrictions, and syntactic restrictions satisfied by a term or group of terms in the context.
- step 1340 preferably comprises the construction of the FSA by automaton generation module 52, category recognition module 54, compound words and lexical phrases module 56, mo ⁇ hology module 58, and syntactic phrase module 60.
- automaton generation module 52 category recognition module 54
- compound words and lexical phrases module 56 mo ⁇ hology module 58
- syntactic phrase module 60 syntactic phrase module 60.
- context indexer module 46 may store information about the context that may be used in assigning a score to a match that is located within the context. Such information may include features such as the position of the context within a document. In embodiments of the invention in which the contexts are represented as FSAs, such information is assigned to the FSA. In a preferred embodiment, the information extracted by the indexing module is used in creating index data structures (52, 53, 56, and 58 as illustrated in Fig. 17) employed in finding contexts containing fully specified and partially specified portions of a query.
- an array FT 52 For each word appearing in the documents, an array FT 52 is created with a number of entries corresponding to the number of key terms that appear in the documents plus the number of category restrictions the system is able to accommodate.
- a term T is transformed into a unique numerical identifier i through the use of the hashing technique.
- Information relevant to term T preferably the number of documents containing term T, is stored at location i in array FT 52.
- the location of indexed information for that term can be readily determined by obtaining the hashed value for T.
- hashing techniques and algorithms have long been known in the art. For example, consider the query Congress MNHUM] and suppose that the hashed value for the term Congress is 157. It is then known how many documents contain the term Congress by reviewing the contents of the 157 th entry in FT 52.
- Term array 53 contains the same number of entries as array FT 52 and is similarly accessed by hashing the term T and accessing the contents of the entry corresponding to the resulting hash value.
- the term array 53 contains context identifiers (or pointers) to positions in document array 58 corresponding to the beginning of information related to the term and contexts 54 for the term in as many documents in which the term appears. As shown in Fig. 17, at the i th position in term array 53 is a pointer which points to entry sizeO in document array 58. SizeO is an indication of the number of contexts 54 in which term T appears in a first document docO.
- document array 58 Also in document array 58 are document identifiers, such as docO and docl, for the documents in which a given term T appears. Additionally, a number of pointers to context array 56 equal to the number of contexts 54 in a given document for a given term follow the document identifier.
- the context array 56 contains all the contexts 54 for each term found in the documents. The contents of each context array entry may be any representation of the context for the term, but as discussed is preferably a finite state machine.
- sets of documents corresponding to a given term T are sorted by size, with smaller documents appearing in the array before larger documents.
- sets of context identifiers (or pointers) corresponding to a given document are similarly arranged by size from smallest to largest. The significance of these sort orders to efficient searching and matching will be described below.
- a context 54 for a term T is a set of surrounding terms in which T appears, e.g., a character string including T.
- the context 54 can consist of a predetermined number of preceding and following terms including T, a sentence or paragraph containing T, etc.
- the invention stores a context array 56 that includes, for each term T, a set of contexts in which the term appears within documents.
- a document array 58 can optionally include identifiers for documents in which a context appears, an indication of the number of contexts for a given term in a given document, and pointers to the contexts 54 in the context array 56.
- the match engine locates matches that occur within contexts 54 rather than searching for matches within the documents themselves.
- Fig. 18 illustrates the operation of match engine 34 and ranking module 36 in a preferred embodiment of the invention.
- match engine 34 generates a list of matches (match list) for the query.
- match list refers to the collection of identified matches (and possibly associated scores) is merely for descriptive pu ⁇ oses and is not intended to reflect the choice of any particular data structure. Instead, any convenient format may be used to store the identified matches, or the matches can be output as they are identified.
- the list of matches includes the instantiations of the partially unspecified query that match text in documents.
- Such a list of matches consists of sequences of words that realize the partially unspecified query, with information useful for ranking the sequences of words and/or the documents in which those sequences have been found.
- match engine 34 receives a partially unspecified query Q.
- the match list is initialized to be empty.
- the partially unspecified query is converted by conventional means to a Boolean expression.
- Fig. 19 illustrates the process of converting the partially unspecified query into a Boolean expression. (For a discussion of Boolean processing, see Managing Gigabytes, Compressing and Indexing Documents and Images, Witten et al, 1994, Van Norrrand Reinhold, New York Chapter 4, section 4.3 and Boolean query processing, pagesl36-141.)
- the partially unspecified query is received.
- step 601 "noise" words are eliminated from the query, noise words including determiners, articles, prepositions, and auxiliaries. For example, given the query JWHO] invented the telephone, the word the would be eliminated at step 601 leaving _ WHO] invented telephone.
- mo ⁇ hological variants for each term in the query is added as an OR expression. For the given example, this would yield _ WHO] (invent OR invents OR invented OR inventing) (telephone OR telephones).
- step 603 synonyms for each term are added to the query as OR expressions. This expands the potential number of matches, as some synonyms may have a greater number of associated contexts than the original term. For the given example, this could add the mo ⁇ hological variants of the word discover, among others.
- step 604 all the terms of the query are combined with an AND operator, and the resulting Boolean expression is returned to the matching process.
- the resulting Boolean expression could appear as JWHO] AND ( invent OR invents OR invented OR inventing OR discover OR discovers OR discovered OR discovering) AND (telephone OR telephones).
- step 503 the partially unspecified query is converted to a graph G. It is in this conversion step that flexibility in extending matching is added with respect to order relaxation and allowance of intervening terms in potential matching contexts.
- the contexts are encoded using FSAs.
- partially unspecified queries are preferably encoded using FSAs.
- an FST is an FSA in which both an input (represented, for example, by a label above the arc) and an output are associated with each arc.
- the output may be a weight or score that is associated with traversal of the arc.
- the invention may maintain a cumulative score by adding the score associated with each arc that is traversed.
- An FST allowing traversal in any order between pairs of terms may be generated by assigning weights to flat graphs between permutations of pairs, then taking the union of the automata.
- Figs. 20A-F depicts numerous graphs illustrating different aspects of the present invention's flexibility, as compared with the Information Need invention.
- the figures are for exemplary pu ⁇ oses only and are not intended to indicate particular weights that must be assigned to arcs or any particular structure of the FST.
- an FSA or an FST may be used to represent all possible orders of the string.
- Fig. 20A which is an exact order finite state machine representation of partially unspecified query JNHUMJ invent, that transitions are pairs, wherein the first element is the term, and the second indicates whether the match ought to be collected ("1") or not ("0").
- Fig. 20B illustrates an example of an FST in accordance with the present invention that represents all possible orders for the query JNHUM] invent.
- the transitions are triples, wherein the first element is again the term, the second element 60 the collection indicator, and the third element 62 in transition is a weight.
- the arcs associated with the most similar order all have a weight of 0, and thus if the query represented by this FST operates on a context that includes a human name followed by the term invent, the score will be 0. However, if the query represented by this FST operates on a context that includes the term invent preceding a human name, the resulting score will be 10, indicating a permutation from the original order.
- Weights are summed in transition, with lower resulting weights indicating a greater likelihood that the context will accurately reflect the original order and fulfill the information need of the partially unspecified query.
- Adding a loop (arcs originating from and arriving at the same state) to this FST structure, as shown in Fig. 20C, allows for the appearance of any intervening words.
- the question mark matches any other word than the words on the outgoing edges.
- These loops can be assigned weights so that the overall score of a match found in a context in which there are many intervening terms between the terms appearing in the query will be assigned a higher score. Note that a numerical score is not required. For example, the output could be simply an indication of whether the order has or has not been permuted.
- FIG. 20D illustrates a FST in accordance with the present invention that represents the query JNHUM] invent with a synonym added. Again, arc weights are summed as an indicator of how close a potential matching context is to the initial partially unspecified query.
- Fig. 20E illustrates a FSA representing JNHUM] invent and allowing terms to appear in any order and have intervening words.
- Fig. 20F depicts a FSA representing _ invent which allows the term invent to be preceded or followed by any term, as represented by the question mark.
- match engine 34 uses the data structures (52, 53, 56, 58 in Fig.
- a document identifier from document array 58 that satisfies the Boolean expression.
- a first document identifier for any given term T may be located by accessing the pointer to document array 58 contained in the entry of term array 53 corresponding to the hashed value of the term T, and subsequent document identifiers for the term T may be identified by knowing that documents in which term T appears are arranged in order (from smallest to largest) in document array 58, and discovering the number of documents in which term T appears.
- match engine 34 attempts to identify a context identifier (or pointer) that satisfies the Boolean expression constructed in step 502.
- the context identifiers corresponding to term T in a given document D are similarly arranged in an increasing size order in document array 58.
- the number of context identifiers present for a term T in document D is also available in document array 58.
- the match engine 34 attempts to select a context identifier corresponding to a context in context array 56 which contains preferably all the specified terms and fully or partially unspecified portions of the query.
- the match engine 34 checks that each fully specified term in the query appears in the context. Then the match engine determines whether the selected context contains a term (or group of terms) that matches a partially unspecified term in the query (i.e., a term satisfies the restriction associated with a partially unspecified term in the query. ) In the case of partially ordered queries, the match engine additionally checks to make sure that the appropriate relative order is maintained. Note that for queries that contain more than one partially unspecified term this process must be performed for each term.
- match engine 34 determines whether a context identifier associated satisfying the Boolean expression has been found. If no such context identifier is found, match engine 34 proceeds to decision block 513 to determine whether additional documents are needed. If a context identifier is found, match engine 34 proceeds to step 508.
- match engine 34 seeks document identifiers associated with documents that contain both terms. Rather than intersecting the lists of document identifiers, the match engine 34 employs a comparison approach that leads to results faster and involves iteratively comparing individual document identifiers of the first list to document identifiers of the second list. Because of the arrangement of document identifiers in document array 58, certain comparisons may be effectively eliminated. Match engine 34 compares the first document identifier of Congress, doc5, sequentially to the document identifiers of JNHUM].
- match engine 34 need only compare the next document identifier of Congress, doc7, to document identifiers of JNHUM] greater than or equal to doc5, and immediately finds a document identifier, doc7, satisfying the Boolean expression. Having found such a document identifier, match engine 34 proceeds to step 506 to attempt to find a context identifier of a context in doc7 which satisfies the Boolean expression. As described earlier, the number of context identifiers for each term in doc7 is stored in document array 58, and the context identifiers are sorted in ascending order.
- step 508 context content C associated with the found context identifier is extracted from the context array 56.
- step 509 the FST constructed in step 503 representing the partially unspecified query is applied to the context content, which itself is preferably a finite state machine.
- match engine 34 determines whether a match has been located. If not, processing proceeds to step 512. If, on the other hand, a match has been located then in step 511 match engine 34 accumulates information about the match and adds the match to the match list. Step 511 is discussed in more detail below. At this point it is noted that in addition to storing the matches, match engine 34 stores a score for each match, preferably as part of the match list. At decision step 512, match engine 34 determines whether more contexts are needed.
- Match engine 34 may consider the set of matches already located. For example, if match engine 34 has already located a particular match multiple times within document D (and especially if match engine 34 has not located any other distinct matches within document D), it may not be worthwhile to continue searching for matches within D. In such a case match engine 34 may determine at step 512 that no more contexts from document D are needed. Match engine 34 may also use information regarding the matches located thus far across a plurality of documents. For example, the total number of matches or the total number of different matches found thus far may be compared with a predetermined number or with a number selected by a user or may base the determination on the length of time already spent finding matches.
- match engine 34 may use a combination of these or other methods to determine whether more contexts are needed. If match engine 34 determines that more contexts are needed, processing returns to step 506. Once all matches for the query within document D have been located or match engine 34 determines that no more contexts are needed, processing progresses to decision step 513.
- match engine 34 determines whether more documents need to be searched. As when deciding whether more matches are needed, a variety of criteria may be employed to determine whether more documents are needed. The determination may be based in part or entirely on the contents of the match list assembled thus far. For example, if the match list already contains a predetermined number of matches, match engine 34 may determine that no more documents need to be examined. Match engine 34 may also use information regarding the documents already searched for matches or the time spent searching. For example, it may be desirable to search a predetermined number of documents regardless of how many matches are already stored in the match list. If more documents are needed, processing returns to step 504, where another document containing potential matches for the query is selected as described above.
- match engine 34 resumes selecting document identifiers in step 504.
- match engine 34 would then compare the next document identifier of term Congress, doc8, to doc7 of term JNHUM].
- Steps 505 through 513 are repeated until either no more documents and contexts containing potential matches are found or until no more documents and contexts are needed.
- match engine 34 proceeds to step 514.
- ranking module 36 ranks the assembled match list.
- the match list includes a score for each match.
- ranking module 36 may simply rank the matches based on the score.
- ranking module 36 may take additional information into account in assigning a ranking to the matches in the match list. If redundant matches have not been eliminated at an earlier step, this task may be performed in step 514.
- the matches are also placed in an order based on the ranking although ordering may be performed instead by results page generator 38.
- match engine 34 outputs the match list.
- the output can be provided to results page generator 38, to another computer program, etc. It is noted that ranking is included in preferred embodiments of the invention, but it is not a requirement that the matches are ranked. Therefore ranking step 514 is optional.
- the match engine 34 may operate directly upon the preanalyzed contexts stored in the data structures.
- Fig. 27 illustrates such an alternative matching process.
- the match list includes a score for each of the matches listed therein.
- step 511 may simply add each newly located match to the match list, regardless of whether the identical match has been previously located and already appears in the match list, i.e., whether the newly identified match represents a distinct match that has not been previously identified or represents an instance of a match that has been previously identified.
- the score for the match may reflect factors such as the position and/or format of the match within a document.
- match engine 34 determines whether a newly located match already appears in the match list and, if so, does not add another copy of the match to the match list. Instead, match engine 34 updates the score for the match to reflect the fact that the match has been located again. In such a case the score for a match reflects the number of times an instance of the match is located across a plurality of documents. Of course the score for the match may also reflect factors such as the position and/or format of the match within a document (e.g., whether the match appears in a title, in bold face, etc.). Features of the document in which a match is found such as the age or source of the document can also be used in assigning a score to the match.
- match engine 34 maintains a document score for documents in which a match is identified. Additional criteria may be used to determine the score for a match.
- match engine 34 may perform a normalization process based on the occurrence of particular words in the match across a plurality of documents in the database. The normalization process considers the frequency with which a portion of a match that matches the unspecified portion of the query occurs across a plurality of documents. If the portion of the match that matches the unspecified portion of the query occurs commonly, then it is relatively less likely that the match contains specific information relevant to the query.
- the query _ Windows Phrases such as Microsoft Windows that include specific information related to
- Windows may appear very frequently within certain documents. However, in a large body of text with diverse content such as the set of Web pages, it is likely that phrases such as the windows, some windows, many windows, etc., will be more common. Since matches that contain common but information poor words such as the, some, etc. are relatively unlikely to contain desired information, the score for such matches should be low relative to the score for matches that do not contain such words. Normalization helps to ensure that this is the case.
- match engine 34 normalizes the score by dividing it by a weighting factor that reflects the number of occurrences of the portion of the match that corresponds to the unspecified portion of the query among a plurality of documents. This information is available in the index of terms.
- match engine 34 determines that Microsoft appears much less frequently among all the indexed documents than the common words the and some. Thus the weighting factor for Microsoft is much lower than the weighting factor for the other terms. Therefore, after dividing by the weighting factor the final score for Microsoft Windows will be greater than the score for the matches containing the common words.
- Different methods of performing normalization based on the occurrence of terms within the documents are also within the scope of the invention. Additional criteria such as the presence of certain terms whose co-occurrence is of particular significance can also be used to modify the score for a match.
- Match engine 34 preferably keeps track of the number of instances of each distinct match within each document. Such information is stored and is used, for example, to rank the documents that contain the match as described in more detail below. This information can also be used in assigning a score to a match. For example, in certain embodiments of the invention one factor that is taken into consideration in assigning a score to a match is the relative distribution of instances of the match among documents. In particular, if instances of a match appear a large number of times within one or a few documents, the match may be less significant than if instances of the match appear within a large number of documents, even if the number of instances of the match within each of these documents is relatively small.
- the match may be less significant than if the same match appears 5 times within 20,000 individual documents.
- information accumulated about a match such as the number of instances of the match within a document, may be used to assign or modify a score associated with the match in order to take such factors into account.
- the score can also reflect factors such as the position and/or format of the match within a document or other information accumulated about the match. For example, when another instance of a match that already appears within the match list is located, the score for the match may be incremented by a certain amount A. If the newly located instance of the match occurs in boldface, the score may be incremented by an amount A+B.
- the score can also reflect features associated with the document in which a match appears. For example, the score may reflect such factors as the age of the document or the document's source.
- the match list may be maintained in an order based, for example, on the content of the matches or on their scores.
- a match list entry for a query Q includes an entire string that matches Q along with an associated score for the match. However, in certain embodiments of the invention only that portion of a match that corresponds to the unspecified portion of the query is stored. In other words, match engine 34 extracts from a located match M that portion that corresponds to the unspecified portion and stores only that portion in the match list rather than storing the entire matching string. In this case for a query such as
- match list consists of those portions of the matching strings that satisfy the restriction [CITY], e.g., Chartres, Ulm, Cologne, Durham.
- the match list also includes a score for each match reflecting the number of times the match is identified among a plurality of documents.
- a match list for the query may appear as follows in schematic form: ⁇ (Chartres, 125) ( Ulm, 100) (Cologne, 75) (Durham 50) ⁇
- a query that contains multiple wholly or partially unspecified terms, it is possible that only a portion of a located match corresponding to one or more of the terms is of interest.
- matches for the query _ [NAME] won _ [NP] in the 1976 Olympics might include Nadia Comaneci won several gold medals in the 1976 Olympics; Danielle Debernard won a bronze medal in women's alpine skiing in the 1976 Olympics; John Naber won five medals in swimming in the 1976 Olympics, etc.
- a user may be interested only in the names of athletes who won medals medal in the 1976 Olympics, rather than in the specific details.
- the portion of the query of interest indicated by _ [NAME] can be designated so that only the portion of a match that satisfies the restriction is added to the match list.
- _ [NAME] is the designated portion, would include Nadia Comaneci, Danielle Debernard, and John Naber among others.
- _ [NP] allows the capture of a wide variety of matches since the phrases several gold medals, a bronze medal in women's alpine skiing, and five medals in swimming all satisfy the restriction [NP]. This breadth maximizes the use of available information.
- the user's specific information need (the names of medal winners) can be optimally satisfied without providing additional, possibly unwanted, details.
- the portion of a query that is of particular interest can be designated in any of a number of ways. For example, a character such as a question mark can be used instead of an underscore character to designate wholly or partially unspecified terms of particular interest.
- match engine 34 assigns a score to each located match and adds the match to the match list.
- the score is simply a running total of the number of occurrences of the match within the documents or context sets being searched.
- match engine 34 searches the match list to determine whether that match has been previously added to the match list. If so, another copy of the match is not added to the match list but rather the score for the match is incremented.
- the score reflects the frequency of the match across a plurality of documents .
- Ranking module 36 uses the score to order the match list. In the simplest case, the ranking correlates directly with the score. In such a case the ranked match list may begin with the match with the highest score and progress to matches with lower scores. Thus a match that was located 500 times would appear before a match that was located 400 times, which would in turn appear before a match that was located 200 times. In certain embodiments of the invention the ranking is implicit in the order of the match list itself after the process of identifying matches is complete.
- match engine 34 can reorder the match list on the fly, as new matches are identified or as additional instances of previously identified matches are identified.
- traditional search and queiy systems locate documents that contain query term(s) and rank the documents based on the number of occurrences of the term(s) within the documents.
- the present invention provides information (i.e., matches) that is relevant to a query rather than merely providing identifiers for documents that contain the query terms and thus may contain the desired information.
- the ranking of the matches reflects the occurrence of the information among a plurality of documents rather than within a single document.
- the ranking feature provides a basis on which the user and/or the system can assign a confidence level to the information.
- results page generator 38 After the match list is ranked (or without ranking in those embodiments of the invention that do not include ranking), it is passed to results page generator 38.
- the results page generator processes the results for display to a user or for input to another computer program.
- matches are formatted and presented in an order corresponding to the ranked match list. Otherwise matches can be presented in any of a variety of orders.
- Information about the match such as the actual score and/or an indication of the frequency of the match relative to other matches is displayed in certain embodiments of the invention.
- the system displays only that portion of a match that corresponds to one or more unspecified terms in the query.
- results page generator 38 formats the results as a Web page for presentation by a Web browser.
- one or more document identifiers are displayed below or adjacent to the matches that occur within the documents.
- the document identifier is preferably the URL of the Web page in which the match occurs. The URL may serve as a link to the Web page, thereby allowing the user to conveniently access the Web page if desired.
- Information about documents such as document language, age of the document, etc., may also be presented in the results.
- the document information is displayed adjacent to or below the match itself, so that the user may readily determine the identity of documents that contain a particular match.
- the match occurs in more than one document, the set of documents that contain a particular match (or a portion of the set) can be displayed.
- the documents that contain a particular match are preferably displayed in an order corresponding to the number of instances of the match within the document. In other words, in addition to ranking the matches, the documents that contain a given match are themselves ranked. Thus for any given match, documents that contain a greater number of instances of the match are presented above documents that contain fewer instances of the match.
- the actual number of instances of a given match within a document can also be presented.
- such information may be stored by match engine 34 during the process of match accumulation.
- this information may be obtained after the process of matching is complete, e.g., by ranking module 36 or results page generator 38.
- other criteria such as the document's age or source, and/or the position or features of matches identified within the document can be used in addition to or instead of the number of matches identified within the document in order to rank the document.
- the documents are displayed in a format suggestive of a relationship between the documents and the match that they contain.
- the documents may be listed beneath the match, adjacent to the match, or identified with an arrow or line extending from the match to the document list.
- the results may include a fragment of text that includes the match and that also includes additional term(s) on either side of the match as they occur in a document that contains the match.
- the invention may include information about the ranking of the match.
- the results may include the number of documents in which the match was located, or a numerical score reflecting the relative level of confidence in the match versus other matches. Such information may help the user decide which, among multiple matches, may best fulfill the information need.
- the extended matching techniques described herein which implement unordered and partially ordered queries in addition to fully ordered queries, expand the range of options for identifying matches beyond those described in the Information Need application. It is to be understood that in general all the scoring and ranking tools, etc., described in that application are applicable here, in addition to others. Furthermore, the techniques described herein are applicable to queries generated in the course of answering natural language questions as described in the relevant application.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2002220219A AU2002220219A1 (en) | 2000-12-05 | 2001-12-05 | System for fulfilling an information need using extended matching techniques |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US25160800P | 2000-12-05 | 2000-12-05 | |
| US60/251,608 | 2000-12-05 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2002046970A2 true WO2002046970A2 (fr) | 2002-06-13 |
| WO2002046970A3 WO2002046970A3 (fr) | 2004-02-26 |
Family
ID=22952674
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2001/046542 WO2002046970A2 (fr) | 2000-12-05 | 2001-12-05 | Systeme permettant de repondre a un besoin d'information par des techniques d'appariement approfondies |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2002220219A1 (fr) |
| WO (1) | WO2002046970A2 (fr) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2010062737A2 (fr) | 2008-11-03 | 2010-06-03 | Microsoft Corporation | Extraction utilisant un co-emplacement généralisé de phrases |
| US10241716B2 (en) | 2017-06-30 | 2019-03-26 | Microsoft Technology Licensing, Llc | Global occupancy aggregator for global garbage collection scheduling |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5519608A (en) * | 1993-06-24 | 1996-05-21 | Xerox Corporation | Method for extracting from a text corpus answers to questions stated in natural language by using linguistic analysis and hypothesis generation |
| WO1998025217A1 (fr) * | 1996-12-04 | 1998-06-11 | Quarterdeck Corporation | Procede et dispositif d'interrogation en langue naturelle et recherche semantique d'une information dans une base de donnees |
-
2001
- 2001-12-05 WO PCT/US2001/046542 patent/WO2002046970A2/fr not_active Application Discontinuation
- 2001-12-05 AU AU2002220219A patent/AU2002220219A1/en not_active Abandoned
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2010062737A2 (fr) | 2008-11-03 | 2010-06-03 | Microsoft Corporation | Extraction utilisant un co-emplacement généralisé de phrases |
| EP2347354A4 (fr) * | 2008-11-03 | 2016-09-21 | Microsoft Technology Licensing Llc | Extraction utilisant un co-emplacement généralisé de phrases |
| US10241716B2 (en) | 2017-06-30 | 2019-03-26 | Microsoft Technology Licensing, Llc | Global occupancy aggregator for global garbage collection scheduling |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2002046970A3 (fr) | 2004-02-26 |
| AU2002220219A1 (en) | 2002-06-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6859800B1 (en) | System for fulfilling an information need | |
| US20020123994A1 (en) | System for fulfilling an information need using extended matching techniques | |
| KR100666064B1 (ko) | 인터랙티브 검색 쿼리 개선 시스템 및 방법 | |
| US6584470B2 (en) | Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction | |
| US6678694B1 (en) | Indexed, extensible, interactive document retrieval system | |
| Clarke et al. | Question answering by passage selection | |
| US6665666B1 (en) | System, method and program product for answering questions using a search engine | |
| US8645405B2 (en) | Natural language expression in response to a query | |
| US8452766B1 (en) | Detecting query-specific duplicate documents | |
| CA2698105C (fr) | Identification de relations semantiques dans un discours rapporte | |
| US20040117352A1 (en) | System for answering natural language questions | |
| US20040167889A1 (en) | Information retrieval | |
| US20070005344A1 (en) | Concept matching system | |
| WO2002048921A1 (fr) | Method and apparatus for searching a database and providing relevance feedback | |
| EP0960376A2 (fr) | Systeme et procede de traitement et de recuperation de texte | |
| US20060259510A1 (en) | Method for detecting and fulfilling an information need corresponding to simple queries | |
| WO1998025217A1 (fr) | Procede et dispositif d'interrogation en langue naturelle et recherche semantique d'une information dans une base de donnees | |
| JP2001184358A (ja) | カテゴリ因子による情報検索装置,情報検索方法およびそのプログラム記録媒体 | |
| KR20030006201A (ko) | 홈페이지 자동 검색을 위한 통합형 자연어 질의-응답시스템 | |
| JP3617096B2 (ja) | 関係表現抽出装置および関係表現検索装置、関係表現抽出方法、関係表現検索方法 | |
| KR20020072092A (ko) | 단락 단위의 실시간 응답 색인을 이용한 자연어 질의-응답검색시스템 | |
| US7127450B1 (en) | Intelligent discard in information access system | |
| CA2914398A1 (fr) | Identification de relations semantiques dans un discours rapporte | |
| US8478732B1 (en) | Database aliasing in information access system | |
| WO2002046970A2 (fr) | Systeme permettant de repondre a un besoin d'information par des techniques d'appariement approfondies |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |