A method and system for archiving and retrieving data in an electronic document management system
The present invention relates to a method of managing electronic data within a computer system. More particularly, the invention relates to a method for archiving and retrieving electronic documents. Another object of the invention relates to an electronic document management system and computer programs used to implement the aforesaid method.
A further object of the invention relates to a user interface allowing users to search and retrieve electronic documents easily within a pool of electronic documents.
Written publications are now available in digitized form i.e. electronic documents. Since the recent increases of computer interconnection and storage capacity, it is now easy to collect and store many documents, but still difficult to organize them in a useable manner. For the end-user, the goal to achieve is twofold: the first is information retrieval, where one wants to select document excerpts answering a specific question (all, yet only the relevant ones); the second is knowledge capturing, where one wants to cover all aspects of information around a specific topic. The aim of the present invention is to provide users of a computer system with a solution for both information retrieval and knowledge capturing, which can operate on large collections of electronic documents, and which allows easy management of their appropriate organization.
The method of the present invention is mainly targeted at but is not limited to unstructured electronic documents such as collections of texts in natural language: newswires, articles, business reports, electronic mail, workgroup notes or memos, etc. but can also be used for other type of binary data such as images, sounds or video files.
Large collections of documents do exist for a long time, and have specific organization schemes: encyclopaedias, thesauri, libraries, archives and computer file servers are good examples of such collections.
For an efficient electronic document management system, two requirements for information retrieval must be fulfilled. One of them is that the storage of pieces of information is not replicated, meaning that a single file or documents shall be recorded in only one location. The second is that the pieces of information should be easily found and accessed.
The problems of information retrieval and knowledge capturing have been solved to a certain extent, as indicated below but still present many drawbacks.
The most common existing organization scheme is the simple hierarchy, as used in libraries (think of departments, rooms, shelves, sections, finally leading a person to a single book), or in computer file servers (servers, discs, directories, subdirectories, then individual files). The simple hierarchy always meets the first requirement of non-replication since one document is uniquely identified by its location within the hierarchy. Moreover the simple hierarchy is user-friendly, since it exhibits a clear organization to the user, at any level within the organization; after locating a document, one generally remembers its location and a subsequent access is quicker, but as soon as there are a large number of documents stored, or too many levels in the hierarchy, it becomes difficult to locate a document, or even impossible if the search criteria do not fit the hierarchy scheme (how each level is further partitioned into sublevels). As this is often the case, libraries or file servers are usually organized according to a topical taxonomy, but they provide extra indexes to search for information according to other taxonomies, for instance authors, publishers, years of publication, etc.
In a simple hierarchy, a set of documents is partitioned according to a predefined criterion, for instance the topic, and each partition is further partitioned until any and all documents on each topic is contained in its appropriate partition.
The partitioning can be refined and specialized at will, until there is only one document per partition.
In an index, all documents are listed according to one particular aspect, or attribute. The index is sorted, and refers to the actual documents but does not contain the documents themselves.
Knowledge can be defined as pieces of information put together with the appropriate relations. Relations include increasing levels of generalization; therefore detailed information is made approachable. Knowledge always relies on information, but requires a level of abstraction. One known method to capture knowledge into repositories is to group pieces of information (typically stored within electronic documents) in contextual "containers", where some attributes are shared in common by all the components of the container. A particular document usually contains several pieces of information and; therefore, it will be relevant to several context containers. Containers for knowledge capture aggregate documents based on similar values of their attributes. For example a document that defines some values for attributes X and attribute Y is automatically presented in both the containers for such values for attribute X and attribute Y. The above-mentioned document will also be present in a smaller container that aggregates all documents that define both attributes X and Y.
Technological advances are now challenging the current methods for information retrieval and knowledge capturing. The amount of information available online, for example on the INTERNET, is such that it fails to fit any representation to the human mind. Moreover, the information is also highly volatile, and thus its relevance is limited in time, although it can be very important at first. Regarding the volatility of information, a system must be able to file documents at least faster than their inflow rate. In the current situation, this means that documents must be automatically filed without human intervention because such human intervention for classifying new arrivals is too slow.
We showed that simple hierarchies make a convenient and efficient manner to organize documents. However simple hierarchies as explained above suffer from several drawbacks that can be summarized as follows:
First, the same criterion is used to partition the whole hierarchy. For example, in a topical classification system, one cannot further partition a given topic according to a different topic. This implies partitioning into further subtopics.
Secondly, as a document shall exist only at a single place. Even if a document has a scope spanning several topics, it has to be located in only one partition, generally the most important topic, and it will not appear in the partitions related to the other less important topics.
Third, for large to very large collections of documents, the navigation becomes difficult. In large collections with a small number of partitions, each partition contains too many documents on the same topic. However, increasing the number of partitions leads to over-specialized topics. Lastly in the hierarchy scheme, the partitioning criterion is fixed and eternal.
For instance, a chemistry library could be organized by topics, as well as by compounds, by processes, etc. The appropriate organization scheme depends on the effective usage of the library, which is personal and goal-oriented.
On the other hand, indexes are efficient but also have drawbacks, which are the following:
An index is global to the collection of documents. By way of example, it is useless, time consuming and inefficient to search through a global author index quoting hundreds of thousands of authors, when searching a very specific subject, for which only a dozen people are regular authors. Indexes have no context. An index only provides references to the actual documents, and one cannot determine the context of the documents quoted (which topic, other related documents, etc.) from the index itself. Moreover, neighboring entries in an index are always irrelevant, since they refer to unrelated material.
Furthermore, an index is a flat listing, which means that it does not encompass any kind of inherent structure, which would ease its representation. Consequently, one cannot extract knowledge from an index, because an index does not derive from the content of the documents indexed. Finally, when searching in a large index, or when searching simultaneously in several indexes, the result yields either too many documents or no document returned at all.
Existing search engines usually do not indicate the probable number of documents returned before the query is actually executed and the results listed. They do not allow a progressive refinement of the query in order to decrease the number of hits. Using current search engines, it is easy to issue a query that won't return any matching document (the more specific the query, the greater the probability of not getting an answer). A good search engine should restrain the user from making useless queries with no matching documents. The aim of the present invention is to provide a method and a system that obviate the above-mentioned drawbacks and allows efficient organization and retrieval of documents within large collections of documents or other electronic data.
This goal is achieved by a method and a system having the characteristics recited in claim 1 and 5.
Thanks to this method and system, the user is provided with a synthetic representation of a large collection of documents and a graphical display of a simple, small hierarchy.
The proposed system supports a partially machine-automated assignment of document attributes, which can be used later to create the ad-hoc organization of documents mentioned above. The user interface combines the use of indexes and simple hierarchies and allows the user to create partial indexes located within partial, ad-hoc hierarchies, both defined on the criteria that he/she determines.
This provides the power and flexibility of indexes, along with the visual representation of hierarchies.
Further advantages will become evident from the following detailed description and the accompanying drawings in which: Figure 1 is a schematic representation of the user interface for accessing the data where the definition view is displayed.
Figure 2 is a schematic representation of the user interface where the content view is displayed.
Figure 3 is a schematic representation of a structured tree of workspace, The method and system of the present invention uses a methodological approach to information retrieval and knowledge capturing with the objective to provide a synthetic representation of a large collection of documents using a graphical display of a simple, small hierarchy. This shows all the content of the repository but still allows progressive refinement from large to smaller documents collections, to the individual document.
The quality of information extracted from an electronic document management system is determined by the accuracy of the queries used to extract information. It needs time to design appropriate queries, whereas their execution is nowadays extremely quick. The system solves this problem by permanently storing predefined search queries so that their execution requires only a simple action.
Often a new query has to be issued, which is a minor modification of an existing (and effective) query, or which is a refinement to it. The system allows slight changes from existing predefined queries, without the hassle of recreating queries from scratch. In a dynamic ELECTRONIC DOCUMENT MANAGEMENT SYSTEM, it is vital to be notified of newly arrived material whether it periodically updates results online or notifies the user when documents of interest arrive offline. By using predefined queries and real-time situation update and assuming
that the incoming information flow can be submitted to such queries, the system allows an automatic sorting and organizing of incoming documents.
The installation used for implementing the above-mentioned functions will be briefly described in generic terms, as the components needed are state of the art in computer technology. The Installation comprises at least one server computer having a repository or mass storage capabilities as well as computing and communications means (for example through a local, wide or public network) to interact with at least one client workstation having input means and display means. The information or documents to be archived, organized and searched may be added and stored in the repository of the server either with or without human intervention.
In case of a human intervention, documents coming from different sources are read from an ad-hoc media or digitized and then stored in the repository of the server by an operator who interacts with a computer connected to the server or directly on the server itself. Documents may also come from a communication link and be automatically analyzed thanks to an appropriate program that reads and extracts relevant information from the incoming document and then store the document and its attributes in the repository of the server.
In the electronic document management system object of the invention, each document is defined by a content, a set of attributes and their values. The document's content is a block of binary data encoded in any application's native document format. The content is used for uploading, downloading, viewing and editing the document within the electronic document management system.
Attributes store extra information describing the document's format, names, dates, ownership, as well as content. Content attributes would typically label topics, subject areas, concepts or categories. Some attributes may be dedicated to the user, allowing personal settings.
By way of example, a word processing document may have the following attributes: contribution date, word count, topic. Attributes are chosen within a
predefined list and possible values for each attribute depend on each attribute, and are typically names, dates, numbers, currencies, etc. When an attribute has several values, it is repeated several times with each particular value.
Table 1
In the above example, the word processing document has a set of three attributes describing the document. The attribute "Topic" has two distinct values in the example shown.
Logical operations can be applied to attributes and their values. This allows selecting or rejecting a document based on logical, Boolean criteria.
Table 2
The above table is an example of a set of three criteria selecting the document described in the first table above. The combination operator between the three criteria is by default the logical "AND", meaning that all criteria must be fulfilled in order to select documents.
The documents are stored in the repository of the server in a flat data structure without any notion of hierarchy. This flat data structure constitutes a document pool where the following information is memorized. Each document is
stored with a unique identifier, plus its content and a set of attributes that describe each document. Consequently, the only method for retrieving a particular document is to select it by using appropriate criteria, as shown in the second table above. Once this document pool is created and fed with new incoming documents, either automatically or with the intervention of an operator or both, the user may start to search through the document pool thanks to a computer program which displays a user interface made of an object, called a workspace, where documents are aggregated according to a user-defined series of criteria. A workspace is used to extract documents from the Document Pool. The workspace may be displayed on the user screen and the latter may interact with it by inputting data thanks to input means like a keyboard or a pointing device. A workspace according to the present invention comprises two views: the definition view where the user states or inputs criteria, and the content view where matching documents are retrieved from the document pool. Referring to figure 1 , this shows a workspace W12 with its two distinct views that can be selected by choosing the appropriate thumbnail. The definition view is selected and displayed at figure 1. In this definition view, two criteria are already set. The definition view, as seen in table 2, uses three columns for displaying the attribute, the operator, and the value of each criterion. In each column, the user may either enter the relevant information by typing it on the keyboard or click with the mouse on a particular field. In the latter case, the user is presented with a list of possible value that may be used to form the query.
In the definition view, criteria are defined in a sequence. The topmost criterion (document type in the example shown) is entered first and it must be completed before the user can proceed to the next. Within any definition, the user first defines the attribute, then the operator, and then lastly the value. Although criteria are combined using a logical AND operator (which would allow to permute their order), there is a reason for a sequential definition: each time a criterion is entered, the number of matching documents is reported to the user in the
rightmost part of the window containing the definition view. This allows refinement of the workspace by adding more criteria, when there are too many matches. Within a criterion, after the operator is chosen, a list of possible values is automatically computed and presented to the user. The selection of irrelevant values leading to a void criterion (i.e. no matches) is avoided.
In the example of Figure 1 , documents matching the two criteria (1) "Document Type is Report " AND (2) "Publication Date after Thursday" are listed. Note that the first criterion returned 11 '219 matches, too many for the user who decided to further refine the workspace by adding the second criterion. This reduced the number of matches to 412, a more reasonable figure. Figure 2 shows the workspace when the content view is selected. In this window, the reference to the documents is displayed with some additional information (such as the type of the document, the date last modified) that is configurable by the user. In the example shown in figure 2, two documents are displayed. In this content view, one may notice that Eric and Tom are authors of two of the 412 documents. At this stage, if the user decided to refine the workspace by adding a third criterion, based on a selection of the attribute "Author", he would be presented with a choice list of all the authors that wrote the 412 documents (among which Eric and Tom), but not all the authors of all documents within the pool.
A workspace will implement automatically some of the requirements stated at the beginning of this description. The definition view is stored within the system, so the query can be reused later on without having to recreate the query. Using an interactive Graphical User Interface (GUI) it is easy to alter existing workspaces for a slight modification of the query. The content view is periodically refreshed, providing an up-to-date view of the Document Pool even whereas the definition view remains the same.
Practical user experience indicates that many workspaces stored in the system have many similar criteria and only differ in a few attribute values. For
instance, suppose that one is interested in recent company reports in the Information Technology industry for the European market, for several major companies. Several workspaces will be created, which differ only in the company name (all other criteria are the same). The situation becomes quickly frustrating when adding a new workspace, and unmanageable when some common attribute needs to be changed.
To solve this problem, the Workspace is designed with an Inheritance mechanism, so that the criteria of a child Workspace are inherited from its parent Workspace, where they have been already defined. Any modification to the parent Workspace's definition affects all its children workspace immediately. In the example above, all common attributes would be defined in the parent workspace, while the company-specific ones would be inherited in a set of children workspaces.
By way of example, if the definition view of a parent workspace comprises three criteria, and the user creates a child workspace from the above-mentioned workspace, the child workspace will inherit these three criteria and give the possibility to the user to add more criteria in the child workspace. Only the newly defined criteria can be modified in the child workspace, the inherited criteria are fixed and cannot be altered by the user. The inheritance mechanism described above does not limit to one parent workspace in the system and a set of children workspaces. It is implemented as a hierarchical tree structure, where a parent workspace can also have a parent, until the root workspace, which does not define any criterion is reached. Therefore, most workspaces within the system are linked together in a hierarchical tree structure, where each defines a set of local criteria, and inherits from the cascade of all local definitions of all its parent workspaces, in a chain.
Figure 3 shows an example of a tree structure of workspaces. W1 is the root workspace; it does not define any query. Workspaces W11 , W12, W13 are its
first-level children, defining the first criteria. Workspaces W121 and W122 are children of W12, adding their local criteria to the inheritance mechanism.
Linking workspaces in a tree makes it easy to locate them as the root workspace is always accessible in the graphical user interface. By opening it, all its first-level children workspaces are exhibited and the user may access further child workspace by selecting them in the tree structure with a pointing device for example.
This method of archiving and retrieving document offers several advantages over prior art methods. A graphical display of a simple small hierarchy is provided that shows all the content of the repository, but still allows progressive refinement from large to smaller documents collections, down to the individual document. The organization is ad-hoc and can be personalized by the user, unlike in a library or in an index. Secondly, the queries set up in the definition view are permanently stored in the definition view of the workspace that allows the user to re-execute them without re entering the criteria.
The program that drives the user interface updates periodically all the content view of the different workspaces in the system, thus providing a real-time as opposed to a snap shot view that is commonly found in other document management systems. It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the present invention may include combinations and sub-combinations of the various features disclosed as well as modifications and extensions thereof which fall under the scope of the following claims.