System and method for electronic document handling
Field of the invention
The present invention relates to the field of electronic document handling.
Prior art
In recent years the rapid development of the Internet and electronic media suitable for publishing purposes assumed large proportions. This requires publishers to move from a situation optimized for print production to an environment optimized to handling electronic information. Applicant of the present invention, Kluwer Academic
Publishers (KAP), developed such an environment for the publication of scientific articles. The basic publication cycle for scientific articles is divided into four sequential processes:
1. Submission
2. Review
3. Production
4. Publication
On the one hand publishers seek ways to increase the efficiency and speed of production of scientific articles while minimizing the costs.
This requires the combination of a workflow engine that controls the production process and a document management system that stores the electronic articles and maintains the corresponding meta-data. Efficient logistics, storage, re-use and management of electronic documents requires these to be in medium-neutral formats such as SGML/XML; see: Flynn, P.: Understanding SGML/XML Tools, Kluwer Academic Publishers, 1998.
At KAP, a system termed"XPS" (XML Production System) controls the production workflow from an accepted manuscript to its XML (half-) product. At the same time publishers have to implement sophisticated WEB publishing or delivery systems that fit the production-engine and fulfill customer needs. KAP implemented a system termed "KAPIS" (Kluwer Academic Publishers Information System) for, among others, WEB delivery of all articles published by KAP
It is necessary to convert partially structured or unstructured electronic information to structured information according to the SGML/XML standard and a corresponding
DTD. To achieve this publishers tend to provide authors with templates. Authors write their documents using this template. A dedicated conversion then translates the information in the template into SGML/XML.
This template-driven approach can lead to problems because - authors do not like to be forced to write according to too many instructions - incorrect use or even abuse of templates causes the (one-to-one) translation to fail - different templates have to be written for different word-processors and/or format versions which causes huge maintenance costs.
Another approach is to provide external contractors with received manuscripts and request them to convert the accepted manuscripts to SGML/XML for KAP during the production process.
However, if the incoming manuscript would be already in XML before the production process, an expensive part (in terms of money and time) of the production process would be eliminated. This would reduce production costs a great deal and facilitate rapid on-line publication. Moreover, since the manuscript would be already available in
SGML/XML even before the actual review process it would also ensure fast service to and from authors via the Internet.
One of the problems related to the conversion of a document received in a certain format to a document in, e. g., SGML/XML is to determine text classes for different text parts like affiliation, keywords, heading, sections, and references.
Summary of the invention
A first objective of the present invention is, therefore, to provide an arrangement for electronic document handling (EDH) that automatically determines such text classes for the text parts of the document.
To obtain this object, the present invention provides
A computer arrangement for electronic document handling comprising a processor and a memory connected to the processor, the computer arrangement being arrangement to carry out the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least said times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
Such steps can be easily carried out by modern computer arrangements. Thus, the problem of determining text classes in a document has been converted into a, essentially, calculating method that can be tackled by a computer. Therefore, the computer arrangement thus defined is a strong tool in determining text classes in documents.
Preferably, each predetermined variable is an entity within a text part that can be distinguished from other entities.
In an important embodiment, in step d) also the position of each text part in the first document is used to classify each text part.
A second object of the present invention is to provide an arrangement that automatically converts the submitted electronic document in any format to a predetermined document format, preferably, SGML/XML according to a specific document type definition (DTD).
To that end, the invention provides an arrangement as defined above, wherein the first document has a predetermined first format as identified by first formatting codes, and the arrangement is arranged to carry out the following step: . converting at least some of the first formatting codes into second formatting codes from a standard format to render a second document having the standard format, which is preferably, either SGML or XML.
The invention also relates to a method for electronic document handling using a computer arrangement comprising a processor and a memory connected to the processor, the method comprising the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least the times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
Moreover, the present invention relates to a computer program product for electronic document handling, that can be loaded by a computer arrangement comprising a processor and a memory connected to the processor, the computer program after being loaded providing the computer arrangement with the capacity to carry out the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least the times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
Finally, the present invention relates to a data carrier provided with a computer program product as defined above.
Brief description of the drawings
The invention will be explained with reference to some drawing which are intended for illustration purposes only and to limit the scope of the present invention, which is only limited by the scope of the annexed claims.
Figure 1 schematically shows a publication cycle for a scientific article;
Figure 2 is a general overview of a computer arrangement that can be used to carry out the present invention;
Figure 3 schematically shows an example of text parts to which a text class is assigned in a scientific article;
Figure 4 schematically shows an electronic document handling scheme in accordance with the invention;
Figure 5 shows a flow diagram of a conversion of a document received in a predetermined format to a basic format, e. g., ASCII or a variant thereof.
Figure 6 shows a flow diagram of a basic filter to determine variable values;
Figure 7 shows a flow diagram of a basic filter to replace a suggested text class with an exact text class;
Figure 8 shows a flow diagram of a basic filter to create a train data set;
Figure 9 shows a flow diagram of a basic filter to train a classification algorithm with the train data set;
Figure 10 shows a flow diagram of a basic filter to create a new data set;
Figure 11 shows a flow diagram of a basic filter to classify the text parts in the new data set with the trained classification algorithm;
Figure 12 shows a flow diagram of a basic filter to replace the suggested class attribute for < TEXT > element with the classified text class;
Figure 13 shows a flow diagram with filters and post-classification;.
Figure 14 schematically shows how different objects (text parts) can be discriminated by using discrimination functions ;
Figure 15 schematically shows how different objects (text parts) can be classified in different classes by using discrimination functions.
Description of preferred embodiment
Figure 1 shows a schematic overview of a publication cycle for scientific articles. First of all, an author transmits a draft article to the publisher who starts a review process.
After having reviewed and accepted the draft, the production stage can start. After production, the article is published. Then, the article will be read by others which may result in new research, and possibly new articles.
As will be explained in detail hereinafter, the electronic document handling (EDH) system of the invention provides: a submission engine for authors to easily submit electronic manuscripts for reviewing a review assignment engine to facilitate scientific editors to assign submitted electronic manuscripts rapidly to available reviewers or referees the possibility to monitor the reviewing process (on-line) . rapid generation of first proofs before or during the production process 'rapid on-line publication of electronic articles (pre-prints, forth-comings, etc.)
In general, submitted manuscripts can be structured if they are in SGML/XML according to a specific DTD partially structured which means they are in SGML/XML according to other DTDs,
word-processor formats according to template, LaTeX according to specific stylefiles etc. e unstructured in case they are written in word-processor formats not according to a template, LaTeX not according to specific stylefiles etc.
In accordance with the invention, EDH automatically converts partially structured or unstructured electronic information to structured information according to a predetermined document format, like the SGML/XML standard and a corresponding
DTD.
As will be explained below, the rationale for the proposed solution is the use of inherent structure of scientific articles. The inherent structure is deduced from a text analysis and translated to a logical structure.
Text analysis
In general text analysis focuses on determining both the abstract structure and the semantic structure of documents. The semantic structure is concerned with text meaning and with the manner in which the total meaning is built-up from the meaning of individual pieces. Examples of subjects concerning semantic structure are automatic keyword classification to generate thesauri, document classification, text categorization, sub-topic structure definitions and text summarization.
The abstract structure specifies how different pieces of text fit together and how these are assembled in the full document. The abstract structure relates to the hierarchy of a document, e. g., as determined by heading and sections. The abstract structure of a document consists of a logical structure which is the composition of document objects into successfully larger text components
a lay-out structure which is the composition of document objects into successfully larger physical units, e. g. bold, underlining, italic, indentations, etc.
For retrieval purposes abstract structure and semantic structure are related, i. e., semantic document structure necessarily depends on a reasonable decomposition of text in abstract structure elements; see: Rijsbergen van, C. J.: Information Retrieval, 2nd Ed.,
London, Butterworths, 1979.
The document analysis and recognition community has been dealing with document structure for several years, Bunke, H., and Wang, P. S. P.: Handbook of Optical
Character Recognition and Document Analysis, World Scientific Publishing Company, 1997. One of the research topics is to define a generic document model. This is a definition of how a specific class of documents is structured. Attempts have been made to define classes of documents and then extract a logical structure for this class. The logical structure is the basis for a generic document model. On the other hand the electronic document production community has adopted the SGML/XML language that provides the DTD mechanism. A DTD can be seen as some kind of logical structure for an SGML/XML instance. It actually forces the logical structure on the document.
Recently, developments to combine both approaches have been initiated. First classes of documents within a set of documents are generated. Then within a class of documents text parts that are related are searched for. Defining nodes between related text parts generates a logical structure for a class of documents. Finally, a DTD is developed from that logical structure. Classification algorithms used for this purpose, such as K-nearest neighbors or hierarchical Bayesian clustering, calculate similarities between documents or between text parts by means of their semantics, see : Hearst,
M.
A., Text-tiling: Segmenting Text into Multi-paragraph Subtopic Paragraphs, Coynputational Linguistics, 23 (1), 33,1997; Makoto, I., and Takenobu, T., Cluster
Based Text Categorization: A comparison of Category Search Strategies, TR0016, Dep.
Comp. Sci., Tokyo Institute of Technology, Okayama, Japan, 1995 ; Merkle, D.,
Content-based Document Classification with Highly Compressed Input Data, in:
International Conference on Artificial Neural Networks, 1995.
To generate nodes between related text parts M-gram statistics is used. The relation is searched for on the basis of the context of a text part, see Brugger, R., Bapst, F., and
Ingold, R.: A DTD Extension for Document Structure Recognition, in: 7 Ih
International Conference on Electronic Publishing, Springer Verlag, 1998.
The new EDH approach
In developing the EDH application, the inventors used a new approach. The semantics or context of text parts were not used at all. Instead a set of variables that describe text parts were defined. Hence, the content of a text part is replaced by a set of numbers that characterize the content. The rationale for this approach is that scientific articles already have an inherent structure. For example, the punctuation marks between text parts that belong to different text classes is significantly different. Other variables taken into account may be: full stops, commas, capitals, small letters, numbers, etc. By describing a text part with a set of such variables one can use statistical methods that, e. g., calculate a discriminating measure between text parts that belong to different text classes. This method is explained below with reference to figure 3.
However, first of all an arrangement by means of which such a method can be carried out will be explained with reference to figure 2.
In figure 2, an overview is given of a computer arrangement that can be used to carry out the method according to the invention. The arrangement comprises a processor 1 for carrying out arithmetic operations.
The processor 1 is connected to a plurality of memory components, including a hard disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read Only Memory (EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need necessarily be provided. Moreover, these memory components need not be located physically close to the processor 1 but may be located remote from the processor 1.
The processor 1 is also connected to means for inputting instructions, data etc. by a user, like a keyboard 13, and a mouse 15. Other input means, such as a touch screen, a track ball and/or a voice converter, known to persons skilled in the art may be provided too.
A reading unit 17 connected to the processor 1 is provided. The reading unit 17 is arranged to read data from and possibly write data on a data carrier like a floppy disk 19 or a CDROM 21. Other data carriers may be tapes, DVD, etc,. as is known to persons skilled in the art.
The processor 1 is also connected to a printer 23 for printing output data on paper, as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.
The processor 1 may be connected to a communication network 27, for instance, the
Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area
Network (WAN), etc. by means of I/O means 25. The processor 1 may be arranged to communicate with other communication arrangements through the network 27.
The processor 1 may be implemented as stand alone system, or as a plurality of parallel operating processors each arranged to carry out subtasks of a larger computer program, or as one or more main processors with several subprocessors. Parts of the functionality of the invention may even be carried out by remote processors communicating with processor 1 through the network 27.
One of the crucial parts of EDH is that it automatically does a classification of the input document. The input document can be transmitted to EDH, e. g., through the Internet.
Then, a user sends his document (in format x) to the EDH application running on the computer arrangement shown in figure 2. The EDH application receives the document and processes it on the fly (i. e., strips off formatting code, calculates variable values, classifies, puts XML tags in, etc.). When EDH is ready it delivers the resulting structured XML to the user. Hence, there is no human intervention during the EDH process. Communications are preferably done through the communication network 27.
Figure 3 shows an example of different text parts 30 in a scientific article: a title 31, an affiliation 33, an abstract 35, key words 37, one or more sections 39,41, and references 43.
Figure 4 depicts a possible implementation scheme for EDH, which consists, essentially, of three parts. Each of these parts is described in detail by means of flow charts on the following pages.
The first part is a conversion step 52 of a document in a certain format X (50) to a format, here termed"ASCII+" (54). Preferably, the submitted format is converted to
ASCII, i. e. all formatting code and codes that only partially structure the input manuscript should be removed. However, EDH does not remove all formatting codes in this part. The reasons are that: - in the second part of EDH the inherent structure of the manuscript (both bold and italics formatting code apparently contribute to this structure) is analyzed; - basic mathematics, chemical formulas etc. can be generated with formatting codes such as subscript and superscript;
- some coding will not only represent format information but also directly identify the content, e. g. in the case of footnotes.
Therefore, instead of removing these formatting codes EDH translates these latter codes directly into equivalent coding according to a predetermined DTD, referred to as KAP's scientific article DTD. To that effect, in one embodiment EDH used public domain Omnimarkm software, developed by R. Geimer and called"Rtf2Xml" (by the time of filing, downloadable from www. omnimark. com). All other programs in the
EDH application were written in Omnimark by the inventors.
The second part is the application of a set of basic filters, step 56, of which the most important one is a classification filter. The filters dissemble the manuscript into text parts and classify these into text classes. From these classified text parts the XML instance is built. There are 5 basic classes, as indicated in 58.
The third part is the application of a filter, which corrects the classification result for small errors with a post-classification program, step 60. This part also generates a final granularity in the XML instance, 62.
Below, the first, second, and third parts (steps 52,56,60) will be described in detail.
First part (step 52)
Step 521"Open document i7 formatX": the document is opened, In the current version, format X stands for RTF. In the near future other formats will be included.
Step 522"Convertformat X to XML": This conversion step uses Rtf2Xml version 0.5 (public domain) Omnimark code. This program a) generates an XML header with general information such as filename, date, author name etc. b) converts RTF formatting code tags into STRING elements (the formatting information is captured in attributes of the STRING element) c) converts RTF tags that represent structure codes such as paragraphs and lists with items into corresponding XML tags.
Flags that invoke extraction of figures (after which these are saved on the file system) can be set. Rtf2Xml translates RTF codes directly into XML. However, the resulting
XML instance has no (or minimal) logical structure yet. Therefore, additional
Omnimark programs were written to add a logical structure to the document step-bystep.
Use is made of a program called"TransCal"which is part of the Rtf2Xml package referred to above. It is called after Rtf2Xml and scans the output of the Rtf2Xml program for table elements. The content of table elements is then translated into CALS table element tags. CALS is a table standard as known to persons skilled in the art and available from http://www. oasis-open. org/cover/tr9502. html.
In step 523 it is tested whether or not there are figures present in the original document.
When the'extract-figures'flag is set in Rtf2Xml figures are exported to the file system, step 524. Then, a FIGDATA element is inserted in the document to replace the RTF code for the figure.
In step 525, it is checked whether or not the original document comprises tables. If present, the program TransCal translates table contents into CALS tags. Then, in step 526, CALS table tags are translated into equivalent DTD elements, as shown in table
1. The program then'exports'the complete DTD TABLE element (and of course its content) to the file system and replaces the TABLE element with an ENTITY-INSERT element (and its corresponding attributes). It will be seen that finally EDH renders the resulting XML instance to PDF or HTML. During the rendering EDH imports the tables from the system and processes these on-the-fly.
The program continues with step 527 :"Remove XML tags not defined in the DTD".
This program takes as input the output of TransCal. XML tags that are the result of that conversion are translated to DTD equivalents whenever this is possible. This is mainly true for formatting tags, as shown in Table 2. All other formatting tags such as font size information or underline attributes are removed. Codes that identify the content are translated directly into equivalent DTD elements, as shown in Table 3. The FIGDATA element is translated into an ENTITY-INSERT element (with corresponding attributes).
Table 1: Translation of figure and table elements
EMI14.1
<tb> RTF2XML/
<tb> TRANSCAL <SEP> DTD
<tb> Element <SEP> Attribute <SEP> Value <SEP> Element <SEP> Attribute <SEP> Value
<tb> FIGDATA-ENTITY-INSERT <SEP> Eid <SEP> File <SEP> name
<tb> <SEP> Eid-type <SEP> To-be-defined
<tb> <SEP> Alt <SEP> To-be-defmed
<tb> TABLE-ENTITY-INSERT <SEP> Eid <SEP> Table <SEP> name
<tb> <SEP> Eid-type <SEP> To-be-defined
<tb> <SEP> Alt <SEP> To-be-defined
<tb> TGROUP--TABMATTER <SEP> Rule-Kap
<tb> <SEP> style
<tb> COLSPEC
<tb> THEAD--TABHEAD-
<tb> TBODY <SEP> - <SEP> - <SEP> TABBODY <SEP> - <SEP>
<tb> ROW <SEP> ROW
<tb> <SEP> Rowsep <SEP> Nolyes <SEP> HSEP
<tb> ENTRY <SEP> Valign <SEP> Top#middle# <SEP> CELL <SEP> Valign <SEP> Top#middle#
<tb> <SEP> bottom <SEP> bottom
<tb> <SEP> Align <SEP> Left#right#center# <SEP> Halign <SEP> Left#right#center#
<tb> <SEP> justify#char <SEP>
#justify#char
<tb>
Definitions:
FIGDATA = element name to specify that a figure is exported (specific figure value)
TABLE = element name for a table
GROUP= element name that defines that actual table
COLSPEC = element name that specifies a column name
THEAD = element name for the table head
TBODY = element name for the table body ROW = element name for a table row
ENTRY = element name for a row entry in the table
ENTITY-INSERT = element name to specify that a figure or table was exported (specify figure/table name)
TABMATTER = element name that defines the actual table
TABHEAD = element name for the table head TABBODY = element name for the table body ROW = element name for the table row HSEP = element name that specifies a horizontal separator (line)
CELL = element name for a row entry in the table
The definitions of the attributes can be found in CALS.
Table 2: Translation of format elements
EMI15.1
RTF2XML <SEP> DTD
<tb> Element <SEP> Attribute <SEP> Value <SEP> Element
<tb> STRING <SEP> Subscript <SEP> On <SEP> SUB
<tb> STRING <SEP> Superscript <SEP> On <SEP> SUP
<tb> STRING <SEP> Italic <SEP> On <SEP> EMPHASIS
<tb> STRING <SEP> Bold <SEP> On <SEP> STRONG
<tb> definitions:
STRING = element name that specifies a line of content
SUB = element name that specifies subscript formatted content
SUP = element name that specifies superscript formatted content
EMPHASIS = element name that specifies static formatted content
STRONG = element name that specifies bold formatted content
Table 3:
Translation of structure elements
EMI15.2
<tb> RTF2XML <SEP> DTD
<tb> Elemeat <SEP> Element <SEP> Attribute <SEP> Value
<tb> - <SEP> LIST'Type <SEP> Ord <SEP> or <SEP> Unord
<tb> PNTEXT/P <SEP> ITEM
<tb> FOOTNOTE <SEP> NOTE-
Definitions:
PNTEXT/P = element name that defines an item in a list
FOOTNOTE = element name that defines a footnote
LIST = element name that defines a list 1 Wrapped around the ITEM elements.
ITEM = element name that defines an item
NOTE = element name that defines a note
In step 527, all remaining XML code that has no equivalent in the DTD is removed.
Second part (step 56)
Now part 1 is finished and part 2 can start. Part 2 (step 56) comprises the following subparts: part 2a: basic filter to determine variable values, figure 6; part 2b-1 : basic filter to replace the suggested text class (suggest: raw attribute value) with the exact text class, figure 7; 'part 2b-2: basic filter to create a"train"data set, figure 8 ; part 2b-3: basic filter to train the classification algorithm with the train data set, figure 9; part 2c-1 : basic filter to create a"new"data set, figure 10; part 2c-2: basic filter to classify the rtext parts in the new data set with the trained classification algorithm, figure 11; 'part 2c-3:
basic filter to replace the suggested class attribute for TEXT element with the classified text class, figure 12.
A shown in figure 6, part 2a starts in 530 with the output of Part 1 (i. e., figure 5).
If End Of File (EOF) is reached for the document, step 531, the program proceeds with
Part 2b in case of a training session or proceeds with Part 2c in case of a classification process, as shown in step 532.
If EOF is not reached, the program continues with step 533. If in step 533 it is established that the current element is an ENTITY-INSERT element the program jumps back to step 531. Else, The program continues with step 534.
In step 534"Extract text part n' ; the procedure scans the document for chunks of bytes terminated by a predetermined end of bytes (in this case an end-of-line), or text parts and ENTITY-INSERT elements.
In step 535, a text part (with all its current tags) is wrapped into a temporary TEXT element tag.
In step 536, it is tested whether or not the value for variable in the TEXT element content is calculated for the last variable from the range of predetermined variables. If so, the program proceeds with step 537. If not, the program proceeds with step 539.
In step 539, the content of the text part is analyzed by a separate function, which is called"CountIntegerFunctions". This function calculates values for a set of variables that are defined to describe a text part.
As observed above, scientific articles already have an inherent structure. For example, bibliographic references have more punctuation than headings, keywords are single words or short word sequences most often separated by commas etc. This means that it is possible to discriminate between text parts that belong to different text classes in a way that is not dependent on their content or context. By describing a text part with a set of variables (in practice this means that the content of a text part is replaced with a series of numbers) one can use statistical methods that calculate a discriminating measure between text parts. EDH usesfour sets of variables that describe text parts:
Words : 1. Italic 2. Bold 3. Short words such as'the'and'a (n)' 4. Common words (a short thesaurus used in lexicographical analysis) 5. Affiliation type words 6.
References type words 7. Heading type words
In one test, 100 scientific articles were analyzed and words that often appeared in affiliations, bibliographical references and headings, respectively, were selected. These words form short thesauri (variables 5,6, and 7).
Characters : 1. Full-stop 2. Colon 3. Semi-colon 4. Comma 5. Upper case characters 6. Lower case characters 7. Digits 8. Brackets such as [and { 9. Quotes 10. Math signs such as + and = 11. Others (everything that is not covered by 1-10) Strirags : 1. Position of the string (text part) in the document
Denominators : 1. Total numbers of characters in a string 2. Total number of words in a string 3. Total numbers of words in a manuscript
Sub functions in CountlntegerFunctions analyze each text part that is previously created. For each of the variables as described above CountIntergerFunctions has a sub function. This sub function counts the value of that variable in the text part and returns it.
If the last variable has been analyzed the program jumps to step 537 in which it suggests a text class value. TEXT elements contain two additional attributes, suggest : raw and suggest: classification. Suggest : raw indicates a raw indication of what the class could be based on the counting only. Suggest : classification is filled in later by means of a statistical program. Their value can, e. g., be one of the following five classes: 1. Affiliation 2. Keywords 3. Heading 4. Section 5. References
The classification algorithm classifies each text part as belonging to one of these basic classes.
The value for suggest: classification is initially set to'none'while
CountIntegerFunctions will give an estimate for suggest: raw.
After step 537, the program proceeds with step 538 in which the calculated variable values and the estimated suggest: raw value are assigned to attributes of the TEXT element. Then, the program returns to step 531.
The result of steps 530 through 539 is a document that contains text parts wrapped in
TEXT tags and ENTITY-INSERT elements.
Part 2b-l, shown in figure 7, starts with the output of part 2a via step 532.
In step 540"Get element TEXT'a TEXT element is obtained.
Part 2b is intended for training purposes. Hence, the exact text class for a text part must be known. Therefore, the suggest: raw attribute value for the element from step 540 is extracted. Its value is compared to the actual content of the TEXT element. When the value is not correct it is replaced with a correct value, step 541. This is a manual procedure and has to be done only once to create a'train'data set!
In step 542 it is checked whether the last element has been processed. If so, the program proceeds with step 543. If not, the document is saved and stored.
Figure 8 shows a basic filter to create a"train"data set. In step 550"Initialize train data set"a file is opened in which the train data can be stored.
In step 551 it is checked whether or not the last document is processed. If so, the train data set file is closed, saved and stored on the file system. If not, the program continues with step 553"Open document": the next document in the series of documents that is used to create the train data set is opened. The documents used for this purpose are, preferably, in ASCII+ format, which is the result of part 1 and part 2a.
Step 554 checks whether the End Of File (EOF) is reached. If so, the program continues with step 551. If not, the program continues with step 555"Get TEXT element", and obtains a next TEXT element from the document that is open.
In step 556 variable values v are extracted and a text class value c from the attributes is extracted of the TEXT element.
These attributes for each TEXT element are appended to the train data set file that is open, as shown in step 557.
EMI20.1
In this a manner a train data set file is created with r rows, where m is the number of documents and ni is the number of text parts in document i. Each row represents a text part described by v columns of variables. Column v + 1 is a column that describes the text class.
Thus, strings are translated into vectors with numbers, e. g., number of comma's, capitals, etc. per line.
The result of figure 8 will be used to train a classification algorithm, as shown in figure 9. In step 560"Open train data set"the file that was created and stored in Part 2b-2 is opened.
In step 561"Get row 7""row r will be retrieved from the train data set file.
In step 562, by using or interpreting the variable values v the classification algorithm estimates a value for the text class.
In step 563, the estimated text class value from step 562 is compared with the exact text class value that is in row v + 1 of the train data set file. This comparison is translated into an error value.
In step 564, if the last row from the train data set is processed, the program proceeds with step 565. If not, the program returns to step 561.
In step 565, it is checked whether or not the classification algorithm has converged, i. e. the error value calculated in step 563 does not decrease anymore or has reached a certain threshold. If so, the program continues with step 568. If not, the program continues with step 566.
In step 568"Save and store the train paraameters" : the train parameters of the trained and converged classification algorithm are saved and stored in the file system..
In step 566, the train parameters of the classification algorithm are adjusted. Then, the program goes to the first row of the train data set file and proceeds with step 561.
Classification algorithms suitable for carrying out the steps of figure 9 are among others Discriminant Analysis and Artificial Neural Networks. At KAP a basic Linear
Discrimant Analysis program written in the R language was used (R is an S-Plus dialect that is distributed as public domain software by the Comprehensive R Archive
Network; see http ://www. ci. tuwien. ac. at/R).
Linear Discriminant Analysis (LDA) is a statistical method to find - c classes within a group of - m objects described by - n variables
See table 4.
Table 4
EMI22.1
<tb> <SEP> Variables
<tb> <SEP> Varl <SEP> Var2... <SEP> Var <SEP> n
<tb> <SEP> Obj <SEP> 1 <SEP> D <SEP> xl <SEP> l <SEP> xl2... <SEP> xln
<tb> <SEP> Obj2 <SEP> X21 <SEP> X22... <SEP> X2n
<tb> <SEP> ...
<tb>
O
<tb> 0
<tb> <SEP> Obj <SEP> m <SEP> xml <SEP> Xm2.. <SEP> xmn
<tb>
In order to this, linear combinations of the variables are created. These linear combinations are called the discriminant functions (DF), e. g.
DF1 = wll. varl + wl2. var2 +... wl,,. var n
DF2 = w2l. varl + w22.var2 + ... w2n. var fa
DFm = wml varl + wm2. var2 +... wrl"l var n where wij, i = 1,2,3,... m ; j = 1,2,3,...., n, are weighing factors. In an iterative approach (that uses an eigenvalue decomposition of the object/variable matrix) those weighing factors that give an optimal discrimination between classes of objects are found. DF1 and DF2 can be plotted against each other (i. e. after filling in the values, x) for all objects to visualize the discrimination, as visualized in figure 14. Figure 14 shows that groups of objects are clustered remote from other groups of objects.
These different groups appear to relate to different classes.
To create the train data set, the following steps may be carried out.
Take D documents. For each document - extract the text parts (objects) - calculate the value x for each of the n variables in that text part - append the vector with variable values for this object to the train data set
Because these are train documents the text class for each object is known and can be assigned to a vector. The train data set then looks like e. g.
(see table 5)
Table 5
EMI23.1
<tb> <SEP> text
<tb> <SEP> Variables
<tb> <SEP> class
<tb> <SEP> Varl <SEP> Var2... <SEP> Var <SEP> n <SEP> Class
<tb> <SEP> Obj <SEP> 1 <SEP> xl <SEP> l <SEP> x12... <SEP> xm <SEP> section
<tb> <SEP> Obj2 <SEP> x21 <SEP> X22... <SEP> X2n <SEP> section
<tb> <SEP> Obj3X3 <SEP> 1 <SEP> X32... <SEP> X3n <SEP> section
<tb> <SEP> Obj4 <SEP> X41 <SEP> X42... <SEP> X4, <SEP> ; <SEP> keyword
<tb> <SEP> ObjS <SEP> xs <SEP> xs2... <SEP> xs, <SEP> n <SEP> keyword
<tb> <SEP> O <SEP> Obj <SEP> 6 <SEP> x61 <SEP> x62... <SEP> X6"references
<tb> O
<tb> <SEP> Obj <SEP> 7 <SEP> X7 <SEP> 1 <SEP> X72..'X7't <SEP> references
<tb> <SEP> Etc.
<tb>
Finding weighing factors with LDA for this train data set might result in a sufficient discrimination between the text classes, as shown in figure 15. Figure 15 shows three different groups of objects relating to different classes: one relates to sections, another one to references and still another one to keywords.
In part 2c new documents will be classified using the trained algorithm. This will be explained in detail below.
Figure 10 shows a basic filter to create a new data set.
In step 570, a file in which the new data can be stored is opened.
In step 571, the document of which the text parts have to be classified into text classes is opened. The documents used for this purpose are, preferably, in ASCII+ format, which is the result of Part 1 and Part 2a.
If in step 572, it is established that the End Of File (EOF) is reached the program proceeds with step 573. If not it continues with step 574.
In step 574 a TEXT element is obtained from the document.
In step 575, variable values v and text class value c from the attributes of the TEXT element are extracted.
In step 576, these attributes for each TEXT element are appended to the new data set file. In this, manner, a new data set file is created with r'rows. Each row represents a text part described by v columns of variables. Column v + 1 is a column that describes the suggested text class.
After the End Of File has been reached, the new data set is saved and stored, in step 573.
Now, the r'text parts in the new data set are to be classified with the trained classification algorithm. Figure 11 shows a flow diagram to do so.
In step 580, the new data set file is opened.
In step 581, the train parameters file from part 2b is opened.
In step 582, it is checked whether or not the processed row from the new data set is the last row. If not, the process continues with step 584. If so, the process goes to step 583.
In step 584, a row from the new data set is obtained. With the classification algorithm and the train parameter configuration which was opened in step 581 the variable values in the current row are classified into a text class, as shown in step 585. This classification is done in the following way.
For an extracted text part (object), the values xij (i = 1,2,.., m ; j = 1,2,..., n) for each of the n variables ; in the text part are calculated. A vector comprising these variable values for this text part are appended to this new data set. The values of DF1, DF2, etc. for the text parts in this new data set are calculated using the weighing factors wsj that were obtained during the training session. Now, using this vector defined by DF1, DF2, ... it can be established in which area of the (multidimensional) space this text part is located. From the training session it is known how different text classes are located in this (multidimensional) space c± figure 15 which shows an example of a twodimensional space.
Then, it may, e. g., be established that the text part concerned is a section..
After step 585, the program returns to step 582.
In step 583, the text classes that are the result of this classification are saved and stored.
Figure 12 shows a flow diagram of a basic filter to replace the suggested class attribute for TEXT element with the classified text class.
In step 590, the classified text class file from part 2c-2, i. e. step 583 in figure 11, is opened. Then, in step 591, the document D is opened.
In step 592, it is checked whether or not the TEXT element from the current processed document D is the last one. If so, the process proceeds with step 593, i. e. it jumps to part 3 (figure 13). Else, the process continues with step 594 in which the next TEXT element is obtained from the current document D.
In step 595, the suggest: raw attribute of the current TEXT element is replaced with the classified text class value from the classified text class file (output of step 583) and, then, the process returns to step 592.
Third part (step 60)
Now, part 3 will be described with reference to the flow diagram of figure 13. The input of the flow diagram of figure 13 is coming from step 593 in figure 12.
Each TEXT element in document D is now accompanied by a calculated text class value. In step 600"Replace TEXT element withDTD text class elemerat", the TEXT element tag of each text part is replaced with the corresponding DTD element tag according to the calculated class value, as shown Table 6.
Table 6. Elements for text classes
EMI26.1
Text <SEP> class <SEP> DTD
<tb> <SEP> Elemerzt <SEP> Attribute <SEP> Value
<tb> AffiliationAUTHORGROUP <SEP> - <SEP>
<tb> Keywords <SEP> KEYWORDS <SEP> Type <SEP> Keywords <SEP> or <SEP> JEL-code
<tb> Heading <SEP> STITLE-
<tb> Section <SEP> PAR-F-
<tb> References <SEP> BIBUNSTRUCT-
Document D has text parts that are wrapped in either one of the text class elements
AUTHORGROUP, KEYWORDS, STITLE, PAR-F or BIBUNSTRUCT. According to the DTD these text class elements are sub-elements of elements higher in hierarchy or have sub-elements themselves. Therefore, three additional actions are necesssary; steps 601,602,603.
In step 601, DTD header element tags are added in accordance with the following logical structure:
ARTICLE(version="firstproof' versionno="0" aid="unknown" aid-type="unknown" language="english" translit="latinl" doi-id="unknown" refers-to="unknown" refers-to-type="unknown" prod-dept="unknown" contractor="unknown")
PRODUCTHOOK(pid="unknown" pid-type="unknown" journalcode="unknown" issn="unknown" medium="unknown" volume="unknown" last-volume="unknown" issue="unknown" first-page="unknown" last-page="unknown" publisher-id="KAP" style="unknown" styleversion="unknown")
ARTICLETYPE (type="research-article")
EN-TITLE
AUTHORGROUP (KEYWORDS (type="keywordsUel-code")
KEYWORD+)
STITLE+ PAR-F+ (BIBLIST BIBUNSTRUCT+ (id=NUMBER)
PAR-S+)
The top element in the DTD is ARTICLE. The element ARTICLE has sub elements according to the content model < ! ELEMENT article (producthook+, style-adjust*, copyright?, articletype, doctopic*, front-matter, body?, back-matter? )
The top element ARTICLE and the header elements PRODUCTHOOK and
ARTICLETYPE are required elements. Hence these are added to the document structure. The corresponding required attributes, such as version, for these elements are added as well. Their value is set to'unknown'if no value can be given at this stage.
The header elements STYLE-ADJUST, COPYRIGHT and DOCTOPIC are optional elements and are not inserted in the document structure.
Step 602: now that the header elements are inserted the remaining sub elements of the
ARTICLE elements should be introduced. This is actually done in step 603. However, some pre-requisites are set prior to step 603. First, the EN-TITLE element of FRONT
MATTER is introduced. EN-TITLE is the main title of the document and it is introduced by taking the first text part in the document that is wrapped in an STITLE element. Then the content of the KEYWORDS element is analyzed to isolate the separate keywords. These isolated keywords are each wrapped in a KEYWORD element. Finally, the BIBUNSTRUCT elements are wrapped in one BIBLIST element.
This filter also uses a post-classification algorithm. For each text part CountIntegerFunctions calculated its position in the document. Dividing this value by the total number of text parts in the manuscript returns its relative position. The postclassification algorithm checks whether the text class calculation by the classification algorithm for a text part is valid with respect to its relative position. E. g., if a text part was classified as a reference but its relative position is near the start of the document this will probably be a wrong classification. The post-classification suggests an alternative class for this text part, which is valid.
In step 603, final granularity is added: this program restructures the elements in such a way that each STITLE precedes its corresponding PAR-F elements. In this manner
SECTION elements are generated, as defined below: (SECTION
STITLE
PAR-F+)
By analyzing the content of SECTION elements and replacing the SECTION tags where appropriate final granularity is added. E. g., initially an abstract will be classified as a section. After analyzing the content of this section it may become clear that it is an abstract and the SECTION tags are replaced with EN-ABSTRACT tags. In addition the elements are assigned to the FRONT-MATTER, BODY or BACK-MATTER element.
Finally the corresponding attributes are put into place, in accordance with the following logical structure :
ARTICLE (version="firstproof versionno="0" aid="unknown" aid-type="unknown" language="english" translit="latinl" doi-id="unknown" refers-to="unknown" refers-to-type="unknown" prod-dept="unknown" contractor="unknown") PRODUCTHOOK(pid="unknown" pid-type="unknown" journalcode="unknown" issn="unknown" medium="unknown" volume="unknown" last-volume="unknown" issue="unknown" first-page="unknown" last-page="unknown" publisher-id="KAP" style="unknown" styleversion="unknown")
ARTICLETYPE (type="research-article")
FRONT-MATTER
EN-TITLE (id=NUMBER)
AUTHORGROUP (id=NUMBER) (EN-ABSTRACT(id=NUMBER)
PAR-F+ (id=NUMBER)) (KEYWORDS (id=NUMBER type="keywords#jel-code") KEYWORD+ (id=NUMBER))
BODY (SECTION (id=NUMBER)
STITLE (id--NUMBER)
PAR-F+ (id=NUMBER)) +
BACK-MATTER (ACKNOWLEDGMENTS (id=NUMBER)
PAR-F+ (id=NUMBER)) (BIBLIST (id=NUMBER) (BIBUNSTRUCT(id=NUMBER)
PAR-S+ (id=NUMBER)) +)
A working EDH prototype has been developed that takes a Word RTF Manuscript of a scientific article and turns this without human intervention into an XML instance with high-level structure.
The data set used could be re-classified with an accuracy of 93.5%.
The following materials and methods were used in the prototype:
Data : - 100 MS Word manuscripts of scientific articles covering most scientific article subjects 'containing both inserted tables and figures 'not containing inserted equation boxes - DTD
Software : - Windows 95 operating system - Apache http server, 1.3.6 - Omnimark 5.0, Windows version - R 6.4.2, public domain Windows 95 version
Hardware : - Intel Pentium II computer, 64Mb RAM, approx. 6 Gb hard disk.
Using the values for the variables as such gives a biased classification. E. g., it is obvious that a paragraph in the'Introduction'has more lower case characters than a heading. To compensate for this the variables were scaled: - words variable values are divided by total number of words in a string -characters variables values are divided by total number of characters in a string - string variable value is divided by total number of strings in a manuscript
Training example
Classification consists of assigning a text class to a set of unclassified text parts. First the EDH classification algorithm has to be trained on a set of text parts of which the text classes are known and hence assigned. For this purpose a train data set must be generated.
Therefore, the EDH application was ran with the data set of the 100 scientific articles referred to above. 8040 text parts could be generated from this data set. The algorithms described as Part 1 and Part 2b appended all the Words, Characters and String variable values and the suggest : raw value for each text part to a file. The suggest : raw class values were manually corrected. This resulted in a train data set with 8040 text parts described by 23 variable values and an exact text class. Then, the classification algorithm was trained with this train data set. In a training session examples from the train set are offered to the classification algorithm. It calculates a value for the text class on the basis of the variable values. Then, the calculated text class value is compared to the correct text class value.
If it is wrong the algorithm must adjust its parameters to improve its calculation in a next cycle. This process continuous until the result cannot be improved further. Obviously, for training purposes the EDH algorithm stops here.
The EDH prototype referred to above automatically converts a scientific article in RTF to an XML instance according to the DTD. However, the invention is not restricted to this example. The concept can be extended into several directions: - Other formats. The majority of submitted articles is in popular formats such as
DOC and WPD. These formats are easily converted to RTF and hence can be dealt with using the EDH prototype approach. However, there are large communities that deliver documents authored using LaTeX. This is especially true for documents that contain much mathematics sections. A LaTeX to XML/MathML conversion can be introduced in the EDH concept.
-Other documents. The EDH prototype concentrates on scientific articles. However, the EDH approach is also suitable to apply to other kind of documents. In the same way it must be possible to use the EDH approach for documents with another then the DTD.
The module-type architecture described shows that the EDH concept is easily copied to other conversion types than scientific articles in RTF to XML according to the DTD.
Therefore, the EDH concept provides a fast and elegant way to develop automatic conversion tools for electronic documents into XML that can be incorporated in the publication cycle of any publishing company.
Summary
Above, an architecture of an electronic document handling (EDH) system has been described in detail. This EDH automatically converts electronic manuscripts submitted as, e. g., RTF document into, e. g., XML according to a predefined scientific article
Document Type Definition (DTD). To this end, the application first strips original formatting codes from the manuscript. It then cuts the manuscript into text parts. The text parts are classified as belonging to a text class defined in, e. g., the scientific article
DTD. By way of example, a Linear Discriminant Analysis algorithm (LDA) was used for this classification. A subset of a huge archive of scientific articles was taken and used as example articles to'train'the LDA algorithm.
In this way, the algorithm was configured to recognize structure in the example scientific articles and to assign text classes to text parts successfully. LDA is now ready to recognize structure in unclassified documents. As a result of the classification of all the text parts in text classes, the text parts were wrapped in the corresponding element tags from the DTD.
This fast and inexpensive generation of XML documents from electronic input documents makes EDH of potential use for any publishing company.
References Brugger, R., Bapst, F., and Ingold, R.: A DTD Extension for Document Structure
Recognition. In: 7"'International Conference on Electronic Publishing, Springer
Verlag, 1998.
Bunke, H., and Wang, P. S. P.: Handbook of Optical Character Recognition and
Document Analysis, World Scientific Publishing Company, 1997.
Flynn, P.: Understanding SGML/XML Tools, Kluwer Academic Publishers, 1998.
Hearst, M. A., Text-tiling: Segmenting Text into Multi-paragraph Subtopic Paragraphs,
Computational Linguistics, 23 (1), 33,1997.
Makoto, I., and Takenobu, T., Cluster-Based Text Categorization: A comparison of
Category Search Strategies, TR0016, Dep. Comp. Sci., Tokyo Institute of Technology,
Okayama, Japan, 1995.
Merkle, D., Content-based Document Classification with Highly Compressed Input
Data, In : International Conference on Artificial Neural Networks, 1995.
Rijsbergen van, C. J.: Information Retrieval, 2dEd., London, Butterworths, 1979.
Glossary
ASCII: American Standard Code for Information Interchange, which defines the 7-bit coded character set used on most computers. This set contains the 52 letters of the Latin alphabet (26 upper case and 26 lower case), digits 0-9 and some punctuation. ISO 646 International reference version.
Attribute: Additional item of information about an element, contained within a start-tag, in the form of a name and a value separated by an equals sign.
Classification: Assignment of an object to a class.
Class: Group of related objects.
DTD: Document Type Definition. The formal definition of the elements (with their attributes), entities and notations, which make up a specific type of document in SGML or XML.
EDH: Electronic Document Handling.
Element: A named component part of the text or structure of an SGML or
XML document entity. The names given to elements and their potential location in the mark-up hierarchy are declared in the DTD.
KAP Kluwer Academic Publishers
KAPIS : Kluwer Academic Publishers Information System
LaTeX: (LAmport TeX) A system to prepare a document based on the TeX language which was originally developed by L. Lamport at SRI International.
Users may concentrate on logic structure of a document instead of on format codes.
Linear Discriminant Analysis (LDA):
Algorithm to classify objects described by variable vectors. A matrix with n rows (vectors) of fa variables and a known class c is decomposed. The decomposition aims to find linear combinations of n variables that result in a maximum discrimination between the known classes in the matrix. The linear combinations of variables are called the linear discrimants. Once the optimal decomposition is found for this matrix with known classes the linear discriminants may be used to classify objects in a matrix with unknown classes.
Mark-up: Symbols or characters used with text to indicate significance or action.
PDF: Portable Document Format.
RTF: Rich Text Format. Language devised by Microsoft for describing the appearance of text, so that it can be passed between different makes and models of a word processor.
SGML: Standard Generalized Mark-up Language. ISO 8879: 1986.
Tag: Piece of code used to indicate the start (start-tag) and the end (end tag) of an element. In both SGML and XML, tags are indicated by angled brackets.
Text class: Class with related text parts.
Text part: Separate document entity (to be indicated by a starting point and an end point, e. g., a predetermined end-of-byte code. Variable : Entity to describe a specific characteristic of an object.
XML. Extensible Mark-up Language. Fundamentally a subset of SGML that reduces some of the complexity of SGML without reducing its functionality.
XPS : XML Production System