[go: up one dir, main page]

WO2002056197A1 - System and method for electronic document handling - Google Patents

System and method for electronic document handling Download PDF

Info

Publication number
WO2002056197A1
WO2002056197A1 PCT/NL2001/000013 NL0100013W WO02056197A1 WO 2002056197 A1 WO2002056197 A1 WO 2002056197A1 NL 0100013 W NL0100013 W NL 0100013W WO 02056197 A1 WO02056197 A1 WO 02056197A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
document
predetermined
variables
sep
Prior art date
Application number
PCT/NL2001/000013
Other languages
French (fr)
Inventor
Michaël Leonard Maria BECKERS
Barend Jan De Jong
Original Assignee
Kluwer Academic Publishers B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kluwer Academic Publishers B.V. filed Critical Kluwer Academic Publishers B.V.
Priority to PCT/NL2001/000013 priority Critical patent/WO2002056197A1/en
Publication of WO2002056197A1 publication Critical patent/WO2002056197A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer arrangement and method for electronic document handling with a processor (1) and a memory (5, 7, 9, 11) connected to the processor, the computer arrangement being arranged to carry out the following steps: (a) receiving a first document and storing said document in said memory (5, 7, 9, 11); (b) dividing the first document in one or more text parts; (c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; (d) using a predetermined classification algorithm, by using at least said times each one of said plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.

Description


  



  System and method for electronic document handling
Field of the invention
The present invention relates to the field of electronic document handling.



  Prior art
In recent years the rapid development of the Internet and electronic media suitable for publishing purposes assumed large proportions. This requires publishers to move from a situation optimized for print production to an environment optimized to handling electronic information. Applicant of the present invention, Kluwer Academic
Publishers (KAP), developed such an environment for the publication of scientific articles. The basic publication cycle for scientific articles is divided into four sequential processes:
1. Submission
2. Review
3. Production
4. Publication
On the one hand publishers seek ways to increase the efficiency and speed of production of scientific articles while minimizing the costs.

   This requires the combination of a workflow engine that controls the production process and a document management system that stores the electronic articles and maintains the corresponding meta-data. Efficient logistics, storage, re-use and management of electronic documents requires these to be in medium-neutral formats such as SGML/XML; see: Flynn, P.:   Understanding SGML/XML Tools,    Kluwer Academic Publishers, 1998.



  At KAP, a system termed"XPS" (XML Production System) controls the production workflow from an accepted manuscript to its XML (half-) product. At the same time publishers have to implement sophisticated WEB publishing or delivery systems that fit the production-engine and fulfill customer needs. KAP implemented a system termed "KAPIS" (Kluwer Academic Publishers Information System) for, among others, WEB delivery of all articles published by KAP
It is necessary to convert partially structured or unstructured electronic information to structured information according to the   SGML/XML    standard and a corresponding
DTD. To achieve this publishers tend to provide authors with templates. Authors write their documents using this template. A dedicated conversion then translates the information in the template into SGML/XML.

   This template-driven approach can lead to problems because - authors do not like to be forced to write according to too many instructions - incorrect use or even abuse of templates causes the (one-to-one) translation to fail - different templates have to be written for different word-processors and/or format versions which causes huge maintenance costs.



  Another approach is to provide external contractors with received manuscripts and request them to convert the accepted manuscripts to SGML/XML for KAP during the production process.



  However, if the incoming manuscript would be already in XML before the production process, an expensive part (in terms of money and time) of the production process would be eliminated. This would reduce production costs a great deal and facilitate rapid on-line publication. Moreover, since the manuscript would be already available in
SGML/XML even before the actual review process it would also ensure fast service to and from authors via the Internet.



  One of the problems related to the conversion of a document received in a certain format to a document in, e. g., SGML/XML is to determine text classes for different text parts like affiliation, keywords, heading, sections, and references.



  Summary of the invention
A first objective of the present invention is, therefore, to provide an arrangement for electronic document handling (EDH) that automatically determines such text classes for the text parts of the document.



  To obtain this object, the present invention provides
A computer arrangement for electronic document handling comprising a processor and a memory connected to the processor, the computer arrangement being arrangement to carry out the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least said times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.



  Such steps can be easily carried out by modern computer arrangements. Thus, the problem of determining text classes in a document has been converted into a, essentially, calculating method that can be tackled by a computer. Therefore, the computer arrangement thus defined is a strong tool in determining text classes in documents.



  Preferably, each predetermined variable is an entity within a text part that can be distinguished from other entities.



  In an important embodiment, in step d) also the position of each text part in the first document is used to classify each text part.



  A second object of the present invention is to provide an arrangement that automatically converts the submitted electronic document in any format to a predetermined document format, preferably, SGML/XML according to a specific document type definition (DTD).



  To that end, the invention provides an arrangement as defined above, wherein the first document has a predetermined first format as identified by first formatting codes, and the arrangement is arranged to carry out the following step: . converting at least some of the first formatting codes into second formatting codes from a standard format to render a second document having the standard format, which is preferably, either SGML or XML.



  The invention also relates to a method for electronic document handling using a computer arrangement comprising a processor and a memory connected to the processor, the method comprising the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least the times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.



  Moreover, the present invention relates to a computer program product for electronic document handling, that can be loaded by a computer arrangement comprising a processor and a memory connected to the processor, the computer program after being loaded providing the computer arrangement with the capacity to carry out the following steps: a) receiving a first document and storing the document in the memory; b) dividing the first document in one or more text parts; c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; d) using a predetermined classification algorithm, by using at least the times each one of the plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.



  Finally, the present invention relates to a data carrier provided with a computer program product as defined above. 



  Brief description of the drawings
The invention will be explained with reference to some drawing which are intended for illustration purposes only and to limit the scope of the present invention, which is only limited by the scope of the annexed claims.



  Figure 1 schematically shows a publication cycle for a scientific article;
Figure 2 is a general overview of a computer arrangement that can be used to carry out the present invention;
Figure 3 schematically shows an example of text parts to which a text class is assigned in a scientific article;
Figure 4 schematically shows an electronic document handling scheme in accordance with the invention;
Figure 5 shows a flow diagram of a conversion of a document received in a predetermined format to a basic format, e. g., ASCII or a variant thereof.



  Figure 6 shows a flow diagram of a basic filter to determine variable values;
Figure 7 shows a flow diagram of a basic filter to replace a suggested text class with an exact text class;
Figure   8    shows a flow diagram of a basic filter to create a train data set;
Figure 9 shows a flow diagram of a basic filter to train a classification algorithm with the train data set;
Figure 10 shows a flow diagram of a basic filter to create a new data set;
Figure 11 shows a flow diagram of a basic filter to classify the text parts in the new data set with the trained classification algorithm; 
Figure 12 shows a flow diagram of a basic filter to replace the suggested class attribute for    < TEXT >     element with the classified text class;
Figure 13 shows a flow diagram with filters and post-classification;.



  Figure 14 schematically shows how different objects (text parts) can be discriminated by using discrimination functions ;
Figure 15 schematically shows how different objects (text parts) can be classified in different classes by using discrimination functions.



  Description of preferred embodiment
Figure 1 shows a schematic overview of a publication cycle for scientific articles. First of all, an author transmits a draft article to the publisher who starts a review process.
After having reviewed and accepted the draft, the production stage can start. After production, the article is published. Then, the article will be read by others which may result in new research, and possibly new articles.



  As will be explained in detail hereinafter, the electronic document handling (EDH) system of the invention provides:       a submission engine for authors to easily submit electronic manuscripts for reviewing    a review    assignment engine to facilitate scientific editors to assign submitted electronic manuscripts rapidly to available reviewers or referees the possibility to monitor the reviewing process (on-line)   .    rapid generation of first proofs before or during the production process 'rapid on-line publication of electronic articles (pre-prints, forth-comings, etc.)
In general, submitted manuscripts can be structured if they are in SGML/XML according to a specific DTD partially structured which means they are in SGML/XML according to other    DTDs,

      word-processor formats according to template, LaTeX according to specific stylefiles etc. e unstructured in case they are written in word-processor formats not according to a template, LaTeX not according to specific stylefiles etc.



  In accordance with the invention, EDH automatically converts partially structured or unstructured electronic information to structured information according to a predetermined document format, like the SGML/XML standard and a corresponding
DTD.



  As will be explained below, the rationale for the proposed solution is the use of inherent structure of scientific articles. The inherent structure is deduced from a text analysis and translated to a logical structure.



  Text analysis
In general text analysis focuses on determining both the abstract structure and the semantic structure of documents. The semantic structure is concerned with text meaning and with the manner in which the total meaning is built-up from the meaning of individual pieces. Examples of subjects concerning semantic structure are automatic keyword classification to generate thesauri, document classification, text categorization, sub-topic structure definitions and text summarization.



  The abstract structure specifies how different pieces of text fit together and how these are assembled in the full document. The abstract structure relates to the hierarchy of a document, e. g., as determined by heading and sections. The abstract structure of a document consists of a logical structure which is the composition of document objects into successfully larger text components
    a lay-out    structure which is the composition of document objects into successfully larger physical units, e.   g.    bold, underlining, italic, indentations, etc.



  For retrieval purposes abstract structure and semantic structure are related, i. e., semantic document structure necessarily depends on a reasonable decomposition of text in abstract structure elements; see: Rijsbergen van, C. J.: Information Retrieval,   2nd    Ed.,
London, Butterworths, 1979.



  The document analysis and recognition community has been dealing with document structure for several years, Bunke, H., and Wang, P. S. P.: Handbook of Optical
Character Recognition and Document Analysis, World Scientific Publishing Company, 1997. One of the research topics is to define a generic document model. This is a definition of how a specific class of documents is structured. Attempts have been made to define classes of documents and then extract a logical structure for this class. The logical structure is the basis for a generic document model. On the other hand the electronic document production community has adopted the   SGML/XML    language that provides the DTD mechanism. A DTD can be seen as some kind of logical structure for an   SGML/XML    instance. It actually forces the logical structure on the document.



  Recently, developments to combine both approaches have been initiated. First classes of documents within a set of documents are generated. Then within   a    class of documents text parts that are related are searched for. Defining nodes between related text parts generates a logical structure for a class of documents. Finally, a DTD is developed from that logical structure. Classification algorithms used for this purpose, such as K-nearest neighbors or hierarchical   Bayesian    clustering, calculate similarities between documents or between text parts by means of their semantics, see : Hearst,
M.

   A., Text-tiling: Segmenting Text into Multi-paragraph Subtopic Paragraphs,   Coynputational Linguistics,    23   (1),    33,1997; Makoto, I., and Takenobu, T., Cluster
Based Text Categorization: A comparison of Category Search Strategies,   TR0016,    Dep.



  Comp. Sci., Tokyo Institute of Technology,   Okayama,    Japan, 1995 ; Merkle, D.,
Content-based Document Classification with Highly Compressed Input Data, in:
International Conference on Artificial Neural Networks, 1995.



  To generate nodes between related text parts   M-gram    statistics is used. The relation is searched for on the basis of the context of a text part, see Brugger, R., Bapst, F., and
Ingold, R.: A DTD Extension for Document Structure Recognition, in: 7 Ih
International Conference on Electronic Publishing, Springer Verlag, 1998. 



  The new EDH approach
In developing the EDH application, the inventors used a new approach. The semantics or context of text parts were not used at all. Instead a set of variables that describe text parts were defined. Hence, the content of a text part is replaced by a set of numbers that characterize the content. The rationale for this approach is that scientific articles already have an inherent structure. For example, the punctuation marks between text parts that belong to different text classes is significantly different. Other variables taken into account may be: full stops, commas, capitals, small letters, numbers, etc. By describing a text part with a set of such variables one can use statistical methods that, e. g., calculate a discriminating measure between text parts that belong to different text classes. This method is explained below with reference to figure 3.

   However, first of all an arrangement by means of which such a method can be carried out will be explained with reference to figure 2.



  In figure 2, an overview is given of a computer arrangement that can be used to carry out the method according to the invention. The arrangement comprises a processor 1 for carrying out arithmetic operations.



  The processor 1 is connected to a plurality of memory components, including a hard disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read Only Memory (EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need necessarily be provided. Moreover, these memory components need not be located physically close to the processor 1 but may be located remote from the processor 1.



  The processor 1 is also connected to means for inputting instructions, data etc. by a user, like a keyboard 13, and a mouse 15. Other input means, such as a touch screen, a track ball and/or a voice converter, known to persons skilled in the art may be provided too.



  A reading unit 17 connected to the processor 1 is provided. The reading unit 17 is arranged to read data from and possibly write data on a data carrier like a floppy disk 19 or a CDROM 21. Other data carriers may be tapes, DVD, etc,. as is known to persons skilled in the art. 



  The processor 1 is also connected to a printer 23 for printing output data on paper, as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.



  The processor 1 may be connected to a communication network 27, for instance, the
Public Switched Telephone Network   (PSTN),    a Local Area Network (LAN), a Wide Area
Network (WAN), etc. by means of   I/O    means 25. The processor 1 may be arranged to communicate with other communication arrangements through the network 27.



  The processor 1 may be implemented as stand alone system, or as a plurality of parallel operating processors each arranged to carry out subtasks of a larger computer program, or as one or more main processors with several subprocessors. Parts of the functionality of the invention may even be carried out by remote processors communicating with processor 1 through the network 27.



  One of the crucial parts of EDH is that it automatically does a classification of the input document. The input document can be transmitted to EDH, e. g., through the Internet.



  Then, a user sends his document (in format x) to the EDH application running on the computer arrangement shown in figure 2. The EDH application receives the document and processes it on the fly (i. e., strips off formatting code, calculates variable values, classifies, puts XML tags in, etc.). When EDH is ready it delivers the resulting structured XML to the user. Hence, there is no human intervention during the EDH process. Communications are preferably done through the communication network 27.



  Figure 3 shows an example of different text parts 30 in a scientific article: a title 31, an affiliation 33, an abstract 35, key words 37, one or more sections 39,41, and references 43.



  Figure 4 depicts a possible implementation scheme for EDH, which consists, essentially, of three parts. Each of these parts is described in detail by means of flow charts on the following pages.



  The first part is a conversion step 52 of a document in a certain format X   (50)    to a format, here termed"ASCII+" (54). Preferably, the submitted format is converted to
ASCII, i. e. all formatting code and codes that only partially structure the input manuscript should be removed. However, EDH does not remove all formatting codes in this part. The reasons are that:  - in the second part of EDH the inherent structure of the manuscript (both bold and italics formatting code apparently contribute to this structure) is analyzed;  - basic mathematics, chemical formulas etc. can be generated with formatting codes such as subscript and superscript;
   - some    coding will not only represent format information but also directly identify the content, e. g. in the case of footnotes.



  Therefore, instead of removing these formatting codes EDH translates these latter codes directly into equivalent coding according to a predetermined DTD, referred to as   KAP's    scientific article DTD. To that effect, in one embodiment EDH used public domain   Omnimarkm    software, developed by R. Geimer and called"Rtf2Xml" (by the time of filing, downloadable from www.   omnimark.    com). All other programs in the
EDH application were written in   Omnimark    by the inventors.



  The second part is the application of a set of basic filters, step 56, of which the most important one is a classification filter. The filters dissemble the manuscript into text parts and classify these into text classes. From these classified text parts the XML instance is built. There are 5 basic classes, as indicated in 58.



  The third part is the application of a filter, which corrects the classification result for small errors with a post-classification program, step 60. This part also generates a final granularity in the XML instance, 62.



  Below, the first, second, and third parts (steps 52,56,60) will be described in detail.



  First part (step 52)
Step   521"Open      document i7 formatX":    the document is opened, In the current version, format X stands for RTF. In the near future other formats will be included. 



  Step   522"Convertformat X to XML":    This conversion step uses   Rtf2Xml    version 0.5 (public domain)   Omnimark    code. This program a) generates an XML header with general information such as filename, date, author name etc. b) converts RTF formatting code tags into STRING elements (the formatting information is captured in attributes of the STRING element) c) converts RTF tags that represent structure codes such as paragraphs and lists with items into corresponding XML tags.



  Flags that invoke extraction of figures (after which these are saved on the file system) can be set.   Rtf2Xml    translates RTF codes directly into XML. However, the resulting
XML instance has no (or minimal) logical structure yet. Therefore, additional
Omnimark programs were written to add a logical structure to the document step-bystep.



  Use is made of a program called"TransCal"which is part of the   Rtf2Xml    package referred to above. It is called after Rtf2Xml and scans the output of the   Rtf2Xml    program for table elements. The content of table elements is then translated into CALS table element tags. CALS is a table standard as known to persons skilled in the art and available from http://www. oasis-open. org/cover/tr9502. html.



  In step 523 it is tested whether or not there are figures present in the original document.



  When the'extract-figures'flag is set in   Rtf2Xml    figures are exported to the file system, step 524. Then, a FIGDATA element is inserted in the document to replace the RTF code for the figure.



  In step 525, it is checked whether or not the original document comprises tables. If present, the program   TransCal    translates table contents into CALS tags. Then, in step 526, CALS table tags are translated into equivalent DTD elements, as shown in table
   1.    The program then'exports'the complete DTD TABLE element (and of course its content) to the file system and replaces the TABLE element with an ENTITY-INSERT element (and its corresponding attributes). It will be seen that finally EDH renders the resulting XML instance to PDF or HTML. During the rendering EDH imports the tables from the system and processes these on-the-fly.



  The program continues with step 527 :"Remove XML   tags not defined in the DTD".   
This program takes as input the output of TransCal. XML tags that are the result of that conversion are translated to DTD equivalents whenever this is possible. This is mainly true for formatting tags, as shown in Table 2. All other formatting tags such as font size information or underline attributes are removed. Codes that identify the content are translated directly into equivalent DTD elements, as shown in Table 3. The FIGDATA element is translated into an ENTITY-INSERT element (with corresponding attributes). 



  Table 1: Translation of figure and table elements
EMI14.1     


<tb> RTF2XML/
<tb> TRANSCAL <SEP> DTD
<tb> Element <SEP> Attribute <SEP> Value <SEP> Element <SEP> Attribute <SEP> Value
<tb> FIGDATA-ENTITY-INSERT <SEP> Eid <SEP> File <SEP> name
<tb>  <SEP> Eid-type <SEP> To-be-defined
<tb>  <SEP> Alt <SEP> To-be-defmed
<tb> TABLE-ENTITY-INSERT <SEP> Eid <SEP> Table <SEP> name
<tb>  <SEP> Eid-type <SEP> To-be-defined
<tb>  <SEP> Alt <SEP> To-be-defined
<tb> TGROUP--TABMATTER <SEP> Rule-Kap
<tb>  <SEP> style
<tb> COLSPEC
<tb> THEAD--TABHEAD-
<tb> TBODY <SEP> - <SEP> - <SEP> TABBODY <SEP> - <SEP> 
<tb> ROW <SEP> ROW
<tb>  <SEP> Rowsep <SEP> Nolyes <SEP> HSEP
<tb> ENTRY <SEP> Valign <SEP> Top#middle# <SEP> CELL <SEP> Valign <SEP> Top#middle#
<tb>  <SEP> bottom <SEP> bottom
<tb>  <SEP> Align <SEP> Left#right#center# <SEP> Halign <SEP> Left#right#center#
<tb>  <SEP> justify#char <SEP> 

  #justify#char
<tb> 
Definitions:
FIGDATA = element name to specify that a figure is exported (specific figure value)
TABLE = element name for a table
GROUP= element name that defines that actual table
COLSPEC = element name that specifies a column name
THEAD = element name for the table head
TBODY = element name for the table body    ROW =    element name for a table row
ENTRY = element name for a row entry in the table
ENTITY-INSERT = element name to specify that a figure or table was exported  (specify figure/table name)
TABMATTER = element name that defines the actual table
TABHEAD = element name for the table head     TABBODY    = element name for the table body    ROW =    element name for the table row    HSEP =    element name that specifies a horizontal separator (line)

      CELL =    element name for a row entry in the table
The definitions of the attributes can be found in CALS.



  Table 2: Translation of format elements
EMI15.1     


RTF2XML <SEP> DTD
<tb> Element <SEP> Attribute <SEP> Value <SEP> Element
<tb> STRING <SEP> Subscript <SEP> On <SEP> SUB
<tb> STRING <SEP> Superscript <SEP> On <SEP> SUP
<tb> STRING <SEP> Italic <SEP> On <SEP> EMPHASIS
<tb> STRING <SEP> Bold <SEP> On <SEP> STRONG
<tb>  definitions:
STRING = element name that specifies a line of content
SUB = element name that specifies subscript formatted content
SUP = element name that specifies superscript formatted content
EMPHASIS = element name that specifies static formatted content
STRONG = element name that specifies bold formatted content
Table 3:

   Translation of structure elements
EMI15.2     


<tb> RTF2XML <SEP> DTD
<tb> Elemeat <SEP> Element <SEP> Attribute <SEP> Value
<tb> - <SEP> LIST'Type <SEP> Ord <SEP> or <SEP> Unord
<tb> PNTEXT/P <SEP> ITEM
<tb> FOOTNOTE <SEP> NOTE- 
Definitions:
PNTEXT/P = element name that defines an item in a list
FOOTNOTE = element name that defines a footnote
LIST = element name that defines a list 1 Wrapped around the ITEM elements. 



   ITEM = element name that defines an item
NOTE = element name that defines a note
In step 527, all remaining XML code that has no equivalent in the DTD is removed.



  Second part (step 56)
Now part 1 is finished and part 2 can start. Part 2 (step 56) comprises the following subparts: part 2a: basic filter to determine variable values, figure 6;   part 2b-1    : basic filter to replace the suggested text class (suggest: raw attribute value) with the exact text class, figure 7; 'part 2b-2: basic filter to create a"train"data set, figure   8    ; part 2b-3: basic filter to train the classification algorithm with the train data set, figure 9;    part 2c-1    : basic filter to create a"new"data set, figure 10;       part 2c-2: basic filter to classify the   rtext    parts in the new data set with the trained classification algorithm, figure 11; 'part 2c-3:

   basic filter to replace the suggested class attribute for TEXT element with the classified text class, figure   12.   



  A shown in figure 6, part 2a starts in 530 with the output of Part 1 (i. e., figure 5).



  If End Of File (EOF) is reached for the document, step   531,    the program proceeds with
Part 2b in case of a training session or proceeds with Part 2c in case of a classification process, as shown in step 532.



  If EOF is not reached, the program continues with step 533. If in step 533 it is established that the current element is an ENTITY-INSERT element the program jumps back to step   531.    Else, The program continues with step 534.



  In step 534"Extract   text part n' ; the    procedure scans the document for chunks of bytes terminated by a predetermined end of bytes (in this case an end-of-line), or text parts and ENTITY-INSERT elements. 



  In step 535, a text part (with all its current tags) is wrapped into a temporary TEXT element tag.



  In step 536, it is tested whether or not the value for variable in the TEXT element content is calculated for the last variable from the range of predetermined variables. If so, the program proceeds with step 537. If not, the program proceeds with step 539.



  In step 539, the content of the text part is analyzed by a separate function, which is   called"CountIntegerFunctions".    This function calculates values for a set of variables that are defined to describe a text part.



  As observed above, scientific articles already have an inherent structure. For example, bibliographic references have more punctuation than headings, keywords are single words or short word sequences most often separated by commas etc. This means that it is possible to discriminate between text parts that belong to different text classes in a way that is not dependent on their content or context. By describing a text part with a set of variables (in practice this means that the content of a text part is replaced with a series of numbers) one can use statistical methods that calculate a discriminating measure between text parts. EDH   usesfour    sets of variables that describe text parts:
Words : 1. Italic 2. Bold 3. Short words such as'the'and'a (n)' 4. Common words (a short thesaurus used in lexicographical analysis) 5. Affiliation type words 6.

   References type words 7. Heading type words
In one test, 100 scientific articles were analyzed and words that often appeared in affiliations, bibliographical references and headings, respectively, were selected. These words form short thesauri (variables 5,6, and 7). 



  Characters :   1.    Full-stop 2. Colon 3. Semi-colon 4. Comma 5. Upper case characters 6. Lower case characters 7. Digits 8. Brackets such as [and { 9. Quotes 10. Math signs such as +   and =    11. Others (everything that is not covered by   1-10)      Strirags    :   1.    Position of the string (text part) in the document
Denominators :   1.    Total numbers of characters in a string 2. Total number of words in a string 3. Total numbers of words in a manuscript
Sub functions in CountlntegerFunctions analyze each text part that is previously created. For each of the variables as described above CountIntergerFunctions has a sub function. This sub function counts the value of that variable in the text part and returns it.



  If the last variable has been analyzed the program jumps to step 537 in which it suggests a text class value. TEXT elements contain two additional attributes, suggest : raw and suggest: classification. Suggest : raw indicates a raw indication of what the class could be based on the counting only. Suggest : classification is filled in later by means of a statistical program. Their value can, e. g., be one of the following five classes: 1. Affiliation 2. Keywords 3. Heading 4. Section 5. References
The classification algorithm classifies each text part as belonging to one of these basic classes.



  The value for suggest: classification is initially set to'none'while
CountIntegerFunctions will give an estimate for suggest: raw.



  After step 537, the program proceeds with step   538    in which the calculated variable values and the estimated suggest: raw value are assigned to attributes of the TEXT element. Then, the program returns to step   531.   



  The result of steps 530 through 539 is a document that contains text parts wrapped in
TEXT tags and ENTITY-INSERT elements.



  Part   2b-l,    shown in figure 7, starts with the output of part   2a    via step 532.



  In step 540"Get element   TEXT'a    TEXT element is obtained.



  Part 2b is intended for training purposes. Hence, the exact text class for a text part must be known. Therefore, the suggest: raw attribute value for the element from step 540 is extracted. Its value is compared to the actual content of the TEXT element. When the value is not correct it is replaced with a correct value, step 541. This is a manual procedure and has to be done only once to create a'train'data set!
In step 542 it is checked whether the last element has been processed. If so, the program proceeds with step 543. If not, the document is saved and stored.



  Figure 8 shows a basic filter to create a"train"data set. In step 550"Initialize train data set"a file is opened in which the train data can be stored. 
In step 551 it is checked whether or not the last document is processed. If so, the train data set file is closed, saved and stored on the file system. If not, the program continues with step   553"Open    document": the next document in the series of documents that is used to create the train data set is opened. The documents used for this purpose are, preferably, in ASCII+ format, which is the result of part 1 and part 2a.



  Step 554 checks whether the End Of File (EOF) is reached. If so, the program continues with step 551. If not, the program continues with step   555"Get    TEXT element", and obtains a next TEXT element from the document that is open.



  In step 556 variable values v are extracted and a text class value c from the attributes is extracted of the TEXT element.



  These attributes for each TEXT element are appended to the train data set file that is open, as shown in step 557.
EMI20.1     

In this a manner a train data set file is created with r rows, where m is the number of documents and ni is the number of text parts in document i. Each row represents a text part described by v columns of variables. Column v + 1 is a column that describes the text class.



  Thus, strings are translated into vectors with numbers, e. g., number of   comma's,    capitals, etc. per line.



  The result of figure 8 will be used to train a classification algorithm, as shown in figure 9. In step 560"Open train data set"the file that was created and stored in Part 2b-2 is opened.



  In step   561"Get row 7""row r    will be retrieved from the train data set file. 



  In step 562, by using or interpreting the variable values v the classification algorithm estimates a value for the text class.



  In step 563, the estimated text class value from step 562 is compared with the exact text class value that is in row v + 1 of the train data set file. This comparison is translated into an error value.



  In step 564, if the last row from the train data set is processed, the program proceeds with step 565. If not, the program returns to step 561.



  In step 565, it is checked whether or not the classification algorithm has converged, i. e. the error value calculated in step 563 does not decrease anymore or has reached a certain threshold. If so, the program continues with step 568. If not, the program continues with step 566.



  In step   568"Save      and store the train paraameters"    : the train parameters of the trained and converged classification algorithm are saved and stored in the file system..



  In step 566, the train parameters of the classification algorithm are adjusted. Then, the program goes to the first row of the train data set file and proceeds with step 561.



  Classification algorithms suitable for carrying out the steps of figure 9 are among others Discriminant Analysis and Artificial Neural Networks. At KAP a basic Linear
Discrimant Analysis program written in the R language was used (R is an S-Plus dialect that is distributed as public domain software by the Comprehensive R Archive
Network; see http   ://www.    ci. tuwien. ac. at/R).



  Linear Discriminant Analysis (LDA) is a statistical method to find   - c    classes within a group of - m objects described by   - n    variables
See table 4. 



  Table 4
EMI22.1     


<tb>  <SEP> Variables
<tb>  <SEP> Varl <SEP> Var2... <SEP> Var <SEP> n
<tb>  <SEP> Obj <SEP> 1 <SEP> D <SEP> xl <SEP> l <SEP> xl2... <SEP> xln
<tb>  <SEP> Obj2 <SEP> X21 <SEP> X22... <SEP> X2n
<tb>  <SEP> ...
<tb>



  O
<tb> 0
<tb>  <SEP> Obj <SEP> m <SEP> xml <SEP> Xm2.. <SEP> xmn
<tb> 
In order to this, linear combinations of the variables are created. These linear combinations are called the discriminant functions (DF), e. g.



  DF1 =   wll.    varl + wl2. var2 +...   wl,,.    var n
DF2 =   w2l.    varl + w22.var2 + ... w2n. var   fa   
DFm =   wml    varl +   wm2. var2    +...   wrl"l    var n where   wij, i =    1,2,3,... m ; j = 1,2,3,...., n, are weighing factors. In an iterative approach (that uses an eigenvalue decomposition of the object/variable matrix) those weighing factors that give an optimal discrimination between classes of objects are found. DF1 and DF2 can be plotted against each other (i. e. after filling in the values, x) for all objects to visualize the discrimination, as visualized in figure 14. Figure 14 shows that groups of objects are clustered remote from other groups of objects.

   These different groups appear to relate to different classes.



  To create the train data set, the following steps may be carried out.



  Take D documents. For each document - extract the text parts (objects)  - calculate the value x for each of the n variables in that text part - append the vector with variable values for this object to the train data set
Because these are train documents the text class for each object is known and can be assigned to a vector. The train data set then looks like e. g.

   (see table 5)
Table 5
EMI23.1     


<tb>  <SEP> text
<tb>  <SEP> Variables
<tb>  <SEP> class
<tb>  <SEP> Varl <SEP> Var2... <SEP> Var <SEP> n <SEP> Class
<tb>  <SEP> Obj <SEP> 1 <SEP> xl <SEP> l <SEP> x12... <SEP> xm <SEP> section
<tb>  <SEP> Obj2 <SEP> x21 <SEP> X22... <SEP> X2n <SEP> section
<tb>  <SEP> Obj3X3 <SEP> 1 <SEP> X32... <SEP> X3n <SEP> section
<tb>  <SEP> Obj4 <SEP> X41 <SEP> X42... <SEP> X4, <SEP> ; <SEP> keyword
<tb>  <SEP> ObjS <SEP> xs <SEP> xs2... <SEP> xs, <SEP> n <SEP> keyword
<tb>  <SEP> O <SEP> Obj <SEP> 6 <SEP> x61 <SEP> x62... <SEP> X6"references
<tb> O
<tb>  <SEP> Obj <SEP> 7 <SEP> X7 <SEP> 1 <SEP> X72..'X7't <SEP> references
<tb>  <SEP> Etc.
<tb> 



  Finding weighing factors with LDA for this train data set might result in a sufficient discrimination between the text classes, as shown in figure 15. Figure 15 shows three different groups of objects relating to different classes: one relates to sections, another one to references and still another one to keywords.



  In part 2c new documents will be classified using the trained algorithm. This will be explained in detail below.



  Figure 10 shows a basic filter to create a new data set.



  In step 570, a file in which the new data can be stored is opened.



  In step 571, the document of which the text parts have to be classified into text classes is opened. The documents used for this purpose are, preferably, in ASCII+ format, which is the result of Part 1 and Part 2a.



  If in step 572, it is established that the End Of File (EOF) is reached the program proceeds with step 573. If not it continues with step 574.



  In step 574 a TEXT element is obtained from the document.



  In step 575, variable values v and text class value   c    from the attributes of the TEXT element are extracted.



  In step 576, these attributes for each TEXT element are appended to the new data set file. In this,   manner, a    new data set file is created with   r'rows.    Each row represents a text part described by v columns of variables. Column v + 1 is a column that describes the suggested text class.



  After the End Of File has been reached, the new data set is saved and stored, in step 573.



  Now, the   r'text    parts in the new data set are to be classified with the trained classification algorithm. Figure 11 shows a flow diagram to do so.



  In step 580, the new data set file is opened.



  In step   581,    the train parameters file from part 2b is opened.



  In step   582,    it is checked whether or not the processed row from the new data set is the last row. If not, the process continues with step 584. If so, the process goes to step 583.



  In step   584,    a row from the new data set is obtained. With the classification algorithm and the train parameter configuration which was opened in step 581 the variable values in the current row are classified into a text class, as shown in step 585. This classification is done in the following way.



  For an extracted text part (object), the values   xij    (i = 1,2,.., m ; j = 1,2,..., n) for each of the n variables ; in the text part are calculated. A vector comprising these variable values for this text part are appended to this new data set. The values of   DF1,    DF2, etc. for the text parts in this new data set are calculated using the weighing factors   wsj    that were obtained during the training session. Now, using this vector defined by   DF1,    DF2, ... it can be established in which area of the (multidimensional) space this text part is located. From the training session it is known how different text classes are located in this (multidimensional) space c± figure 15 which shows an example of a twodimensional space.

   Then, it may, e. g., be established that the text part concerned is a section..



  After step 585, the program returns to step 582.



  In step 583, the text classes that are the result of this classification are saved and stored.



  Figure 12 shows a flow diagram of a basic filter to replace the suggested class attribute for TEXT element with the classified text class.



  In step 590, the classified text class file from part 2c-2, i. e. step 583 in figure 11, is opened. Then, in step 591, the document D is opened.



  In step 592, it is checked whether or not the TEXT element from the current processed document D is the last one. If so, the process proceeds with step 593, i. e. it jumps to part 3 (figure 13). Else, the process continues with step 594 in which the next TEXT element is obtained from the current document D.



  In step 595, the suggest: raw attribute of the current TEXT element is replaced with the classified text class value from the classified text class file (output of step 583) and, then, the process returns to step 592.



  Third part (step 60)
Now, part 3 will be described with reference to the flow diagram of figure 13. The input of the flow diagram of figure 13 is coming from step 593 in figure 12. 
Each TEXT element in document D is now accompanied by a calculated text class value. In step 600"Replace   TEXT element withDTD text class elemerat",    the TEXT element tag of each text part is replaced with the corresponding DTD element tag according to the calculated class value, as shown Table 6.



  Table 6. Elements for text classes
EMI26.1     


Text <SEP> class <SEP> DTD
<tb>  <SEP> Elemerzt <SEP> Attribute <SEP> Value
<tb> AffiliationAUTHORGROUP <SEP> - <SEP> 
<tb> Keywords <SEP> KEYWORDS <SEP> Type <SEP> Keywords <SEP> or <SEP> JEL-code
<tb> Heading <SEP> STITLE-
<tb> Section <SEP> PAR-F-
<tb> References <SEP> BIBUNSTRUCT- 
Document D has text parts that are wrapped in either one of the text class elements
AUTHORGROUP, KEYWORDS, STITLE, PAR-F or BIBUNSTRUCT. According to the DTD these text class elements are sub-elements of elements higher in hierarchy or have sub-elements themselves. Therefore, three additional actions are necesssary; steps 601,602,603.



  In step 601, DTD header element tags are added in accordance with the following logical structure:
ARTICLE(version="firstproof'    versionno="0" aid="unknown" aid-type="unknown"    language="english"    translit="latinl"    doi-id="unknown" refers-to="unknown" refers-to-type="unknown"    prod-dept="unknown" contractor="unknown")
PRODUCTHOOK(pid="unknown" pid-type="unknown"    journalcode="unknown"    issn="unknown" medium="unknown"    volume="unknown"    last-volume="unknown" issue="unknown" first-page="unknown"    last-page="unknown"    publisher-id="KAP" style="unknown" styleversion="unknown")   
ARTICLETYPE (type="research-article")
EN-TITLE
AUTHORGROUP  (KEYWORDS   (type="keywordsUel-code")   
KEYWORD+)

  
STITLE+    PAR-F+     (BIBLIST    BIBUNSTRUCT+    (id=NUMBER)
PAR-S+)
The top element in the DTD is ARTICLE. The element ARTICLE has sub elements according to the content model     <  ! ELEMENT    article   (producthook+,    style-adjust*, copyright?,    articletype,    doctopic*, front-matter, body?, back-matter?  )
The top element ARTICLE and the header elements PRODUCTHOOK and
ARTICLETYPE are required elements. Hence these are added to the document structure. The corresponding required attributes, such as version, for these elements are added as well. Their value is set to'unknown'if no value can be given at this stage.



  The header elements STYLE-ADJUST, COPYRIGHT and DOCTOPIC are optional elements and are not inserted in the document structure.



  Step 602: now that the header elements are inserted the remaining sub elements of the
ARTICLE elements should be introduced. This is actually done in step 603. However, some pre-requisites are set prior to step 603. First, the EN-TITLE element of FRONT
MATTER is introduced. EN-TITLE is the main title of the document and it is introduced by taking the first text part in the document that is wrapped in an STITLE element. Then the content of the KEYWORDS element is analyzed to isolate the separate keywords. These isolated keywords are each wrapped in a   KEYWORD    element. Finally, the   BIBUNSTRUCT    elements are wrapped in one BIBLIST element.



  This filter also uses a post-classification algorithm. For each text part   CountIntegerFunctions    calculated its position in the document. Dividing this value by the total number of text parts in the manuscript returns its relative position. The postclassification algorithm checks whether the text class calculation by the classification algorithm for a text part is valid with respect to its relative position. E. g., if a text part was classified as a reference but its relative position is near the start of the document this will probably be a wrong classification. The post-classification suggests an alternative class for this text part, which is valid.



  In step 603, final granularity is added: this program restructures the elements in such a way that each STITLE precedes its corresponding PAR-F elements. In this manner
SECTION elements are generated, as defined below:  (SECTION
STITLE
PAR-F+)
By analyzing the content of SECTION elements and replacing the SECTION tags where appropriate final granularity is added. E. g., initially an abstract will be classified as a section. After analyzing the content of this section it may become clear that it is an abstract and the SECTION tags are replaced with EN-ABSTRACT tags. In addition the elements are assigned to the FRONT-MATTER, BODY or BACK-MATTER element.



  Finally the corresponding attributes are put into place, in accordance with the following logical structure :
ARTICLE   (version="firstproof    versionno="0" aid="unknown" aid-type="unknown" language="english"    translit="latinl" doi-id="unknown" refers-to="unknown"    refers-to-type="unknown"    prod-dept="unknown"    contractor="unknown")    PRODUCTHOOK(pid="unknown"    pid-type="unknown"    journalcode="unknown" issn="unknown" medium="unknown"    volume="unknown" last-volume="unknown"    issue="unknown"    first-page="unknown" last-page="unknown" publisher-id="KAP"    style="unknown" styleversion="unknown")   
ARTICLETYPE (type="research-article")
FRONT-MATTER
EN-TITLE   (id=NUMBER)

     
AUTHORGROUP   (id=NUMBER)     (EN-ABSTRACT(id=NUMBER)
PAR-F+   (id=NUMBER))     (KEYWORDS (id=NUMBER   type="keywords#jel-code")       KEYWORD+    (id=NUMBER)) 
BODY  (SECTION (id=NUMBER)
STITLE   (id--NUMBER)   
PAR-F+   (id=NUMBER))    +
BACK-MATTER  (ACKNOWLEDGMENTS   (id=NUMBER)   
PAR-F+   (id=NUMBER))     (BIBLIST   (id=NUMBER)       (BIBUNSTRUCT(id=NUMBER)   
PAR-S+   (id=NUMBER))    +)
A working EDH prototype has been developed that takes a Word RTF Manuscript of a scientific article and turns this without human intervention into an XML instance with high-level structure.

   The data set used could be re-classified with an accuracy of 93.5%.



  The following materials and methods were used in the prototype:
Data :    - 100    MS Word manuscripts of scientific articles covering most scientific article subjects  'containing both inserted tables and figures  'not containing inserted equation boxes  - DTD
Software :  - Windows 95 operating system  - Apache http server, 1.3.6    - Omnimark    5.0, Windows version  - R 6.4.2, public domain Windows 95 version
Hardware :  - Intel Pentium II computer, 64Mb RAM, approx. 6 Gb hard disk.



  Using the values for the variables as such gives a biased classification. E. g., it is obvious that a paragraph in the'Introduction'has more lower case characters than a heading. To compensate for this the variables were scaled:   - words variable values are divided by total number of words in a string  -characters variables values are divided by total number of characters in a string  - string variable value is divided by total number of strings in a manuscript
Training example
Classification consists of assigning a text class to a set of unclassified text parts. First the EDH classification algorithm has to be trained on a set of text parts of which the text classes are known and hence assigned. For this purpose a train data set must be generated.

   Therefore, the EDH application was ran with the data set of the 100 scientific articles referred to above. 8040 text parts could be generated from this data set. The algorithms described as Part 1 and Part 2b appended all the Words, Characters and String variable values and the suggest : raw value for each text part to a file. The suggest : raw class values were manually corrected. This resulted in a train data set with 8040 text parts described by 23 variable values and an exact text class. Then, the classification algorithm was trained with this train data set. In a training session examples from the train set are offered to the classification algorithm. It calculates a value for the text class on the basis of the variable values. Then, the calculated text class value is compared to the correct text class value.

   If it is wrong the algorithm must adjust its parameters to improve its calculation in a next cycle. This process continuous until the result cannot be improved further. Obviously, for training purposes the EDH algorithm stops here.



  The EDH prototype referred to above automatically converts a scientific article in RTF to an XML instance according to the DTD. However, the invention is not restricted to this example. The concept can be extended into several directions: - Other formats. The majority of submitted articles is in popular formats such as
DOC and WPD. These formats are easily converted to RTF and hence can be dealt with using the EDH prototype approach. However, there are large communities that deliver documents authored using LaTeX. This is especially true for documents that contain much mathematics sections. A LaTeX to XML/MathML conversion can be introduced in the EDH concept.



  -Other documents. The EDH prototype concentrates on scientific articles. However, the EDH approach is also suitable to apply to other kind of documents. In the same way it must be possible to use the EDH approach for documents with another then the DTD.



  The module-type architecture described shows that the EDH concept is easily copied to other conversion types than scientific articles in RTF to XML according to the DTD.



  Therefore, the EDH concept provides a fast and elegant way to develop automatic conversion tools for electronic documents into XML that can be incorporated in the publication cycle of any publishing company.



  Summary
Above, an architecture of an electronic document handling (EDH) system has been described in detail. This EDH automatically converts electronic manuscripts submitted as, e. g., RTF document into, e. g., XML according to a predefined scientific article
Document Type Definition (DTD). To this end, the application first strips original formatting codes from the manuscript. It then cuts the manuscript into text parts. The text parts are classified as belonging to a text class defined in, e. g., the scientific article
DTD. By way of example, a Linear Discriminant Analysis algorithm (LDA) was used for this classification. A subset of a huge archive of scientific articles was taken and used as example articles to'train'the LDA algorithm.

   In this way, the algorithm was configured to recognize structure in the example scientific articles and to assign text classes to text parts successfully. LDA is now ready to recognize structure in unclassified documents. As a result of the classification of all the text parts in text classes, the text parts were wrapped in the corresponding element tags from the DTD.



  This fast and inexpensive generation of XML documents from electronic input documents makes EDH of potential use for any publishing company. 



  References   Brugger,      R.,    Bapst, F., and   Ingold,    R.: A DTD Extension for Document Structure
Recognition. In:   7"'International Conference on Electronic Publishing,    Springer
Verlag, 1998.



  Bunke, H., and Wang, P. S. P.: Handbook of Optical Character Recognition and
Document Analysis, World Scientific Publishing Company, 1997.



  Flynn, P.: Understanding   SGML/XML Tools, Kluwer    Academic Publishers, 1998.



  Hearst, M. A., Text-tiling: Segmenting Text into Multi-paragraph Subtopic Paragraphs,
Computational Linguistics, 23   (1),    33,1997.



  Makoto, I., and Takenobu, T., Cluster-Based Text Categorization: A comparison of
Category Search Strategies,   TR0016,    Dep. Comp. Sci., Tokyo Institute of Technology,
Okayama, Japan, 1995.



  Merkle, D., Content-based Document Classification with Highly Compressed Input
Data, In : International Conference on Artificial Neural Networks, 1995.



     Rijsbergen    van, C. J.: Information Retrieval, 2dEd., London, Butterworths, 1979. 



  Glossary
ASCII: American Standard Code for Information Interchange, which defines the 7-bit coded character set used on most computers. This set contains the 52 letters of the Latin alphabet (26 upper case and 26 lower case), digits 0-9 and some punctuation. ISO 646 International reference version.



  Attribute: Additional item of information about an element, contained within a start-tag, in the form of a name and a value separated by an equals sign.



  Classification: Assignment of an object to a class.



  Class: Group of related objects.



  DTD: Document Type Definition. The formal definition of the elements  (with their attributes), entities and notations, which make up a specific type of document in SGML or XML.



  EDH: Electronic Document Handling.



  Element: A named component part of the text or structure of an SGML or
XML document entity. The names given to elements and their potential location in the mark-up hierarchy are declared in the DTD.



  KAP Kluwer Academic Publishers
KAPIS : Kluwer Academic Publishers Information System
LaTeX:   (LAmport    TeX) A system to prepare a document based on the TeX language which was originally developed by L. Lamport at SRI International.



   Users may concentrate on logic structure of a document instead of on format codes. 



  Linear Discriminant Analysis (LDA):
Algorithm to classify objects described by variable vectors. A matrix with   n    rows (vectors) of   fa    variables and a known class c is decomposed. The decomposition aims to find linear combinations of n variables that result in a maximum discrimination between the known classes in the matrix. The linear combinations of variables are called the linear discrimants. Once the optimal decomposition is found for this matrix with known classes the linear discriminants may be used to classify objects in a matrix with unknown classes.



  Mark-up: Symbols or characters used with text to indicate significance or action.



  PDF: Portable Document Format.



  RTF: Rich Text Format. Language devised by Microsoft for describing the appearance of text, so that it can be passed between different makes and models of a word processor.



  SGML: Standard Generalized Mark-up Language. ISO 8879: 1986.



  Tag: Piece of code used to indicate the start (start-tag) and the end (end tag) of an element. In both SGML and XML, tags are indicated by angled brackets.



  Text class: Class with related text parts.



  Text part: Separate document entity (to be indicated by a starting point and an end point, e. g., a predetermined end-of-byte code. Variable : Entity to describe a specific characteristic of an object.



  XML. Extensible Mark-up Language. Fundamentally a subset of SGML that reduces some of the complexity of SGML without reducing its   functionality.    



     XPS : XML    Production System

Claims

Claims 1. A computer arrangement for electronic document handling comprising a processor (1) and a memory (5,7,9,11) connected to the processor, the computer arrangement being arranged to carry out the following steps: (a) receiving a first document and storing said document in said memory (5,7,9,11); (b) dividing the first document in one or more text parts; (c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; (d) using a predetermined classification algorithm, by using at least said times each one of said plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
2. An arrangement according to claim 1, wherein each predetermined variable is an entity within a text part that can be distinghuised from other entitities.
3. An arrangement according to claim 1 or 2, wherein in step (d) also position of each text part in said first document is used.
4. An arrangement according to any of the claims 1-3, wherein said plurality of variables comprises at least two variables of the following sets of variables: Words : 1. Italic 2. Bold 3. Short words such as'the'and'a (n)' 4. Common words (a short thesaurus used in lexicographical analysis) 5. Affiliation type of words 6. References type of words 7. Heading type of words Characters : 1. Full-stop 2. Colon 3. Semi-colon 4. Comma 5. Upper case characters 6. Lower case characters 7. Digits 8. Brackets such as [and { 9. Quotes 10. Math signs such as + and = 11. Others (everything that is not covered by 1-10) Strings : 1. Position of the string in the document Denominators : 1. Total numbers of characters in a string 2. Total number of words in a string 3. Total numbers of words in a manuscript 5..
An arrangement according to any of the claims 1 to 4, wherein said plurality of text classes comprises two or more text classes of the following set of text classes: 1. Affiliation 2. Keywords 3. Heading 4. Section 5. References 6. An arrangement according to any of the preceding claims, wherein the predetermined algorithm provides first, raw indications of a text class for each one of the text parts based on said times each one of said plurality of variables is present only.
7. An arrangement according to any of the preceding claims, wherein the predetermined algorithm calculates a predetermined number of discriminant functions DFi in accordance with the following equations: DF1 = will. var + wl2. var2 +... wl,,. var n DF2 = w2l. varl + W22. var2 +... W2n. var n DFm = Wml. varl + wn, 2. var2 +... w, r", var n where:
Wij, i = 1, 2,3,... m j = 1, 2,3,...., n are predetermined weighing factors; vari are values of said predetermined variables; and afterwards, said predetermined algorithm determines said text class by establishing in which part of a multidimensional space a vector defined by the calculated values of said discriminant functions is located, said multidimensional space being ordered in different text classes.
. 8. An arrangement according to claim ?, wherein, if the predetermined algorithm has provided first, raw indications of a text class for each one of the text parts based on said times each one of said plurality of variables is present only, said first, raw indications are replaced with said text class as determined with said vector.
. 9 An arrangement according to any of the preceding claims, wherein each one of said classified text parts is wrapped in a predetermined one of a plurality of text element tags.
10 An arrangement according to any of the preceding claims, wherein said predetermined algorithm carries out a post-classification step in which for each text part its relative position in the first document is determined and when said relative position does not correspond to an expected relative position, said expected relative position being associated with said text class as established, another classification is established in accordance with said relative position.
11. An arrangement according to any of the preceding claims, wherein said first document has a predetermined first format as identified by first formatting codes, and said arrangement is arranged to carry out the following step: converting at least some of said first formatting codes into second formatting codes from a standard format to render a second document having said standard format.
12. An arrangement according to claim 11, wherein said standard format is either SGML or XML.
13 A method of electronic document handling using a computer arrangement comprising a processor (1) and a memory (5,7,9,11) connected to the processor, the method comprising the following steps: (a) receiving a first document and storing said document in said memory (5,7,9,11); (b) dividing the first document in one or more text parts; (c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; (d) using a predetermined classification algorithm, by using at least said times each one of said plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
14. A method acording to claim 13, wherein each predetermined variable is an entity within a text part that be distinguised from other entities.
15. A method according to claim 13 or 14, wherein in step (d) also position of each text part in said first document is used.
16. A method according to any of the claims 13-15 wherein said plurality of variables comprises at least two variables of the following sets of variables: Words : 8. Italic 9. Bold 10. Short words such as'the'and'a (n)' 11. Common words (a short thesaurus used in lexicographical analysis) 12. Affiliation type of words 13. References type of words 14. Heading type of words Characters : 12. Full-stop 13. Colon 14. Semi-colon 15. Comma 16. Upper case characters 17. Lower case characters 18. Digits 19. Brackets such as [and { 20. Quotes 21. Math signs such as + and = 22. Others (everything that is not covered by 1-10) Strings : 2. Position of the string in the document Denominators : 4. Total numbers of characters in a string 5. Total number of words in a string 6. Total numbers of words in a manuscript 17.
A method according to any of the claims 13 to 16 wherein said plurality of text classes comprises two or more text classes of the following set of text classes: 6. Affiliation 7. Keywords 8. Heading 9. Section 10. References 18. A method according to any of the claims 13 through 17, wherein the predetermined algorithm provides first, raw indications of a text class for each one of the text parts based on said times each one of said plurality of variables is present only.
19. A method according to any of the claims 13 through 18, wherein the predetermined algorithm calculates a predetermined number of discriminant functions DFi in accordance with the following equations : DF1 = wll. varl + wl2. var2 +... wl. var n DF2 = w2l. varl + w22. var2 +... w2n. var n DFm = ami. var + Wtn2. var2 +...
Wmm. var n where: wi i = 1, 2, 3,... mj = 1, 2,3,.... n, are predetermined weighing factors; vari are values of said predetermined variables ; and afterwards, said predetermined algorithm determines said text class by establishing in which part of a multidimensional space a vector defined by the calculated values of said discriminant functions is located, said multidimensional space being ordered in different text classes.
20. A method according to claim 19, wherein, if the predetermined algorithm has provided first, raw indications of a text class for each one of the text parts based on said times each one of said plurality of variables is present only, said first, raw indications are replaced with said text class as determined with said vector.
21 A method according to any of the claims 13 through 20, wherein each one of said classified text parts is wrapped in a predetermined one of a plurality of text element tags.
22 A method according to any of the claims 13 through 21, wherein said predetermined algorithm carries out a post-classification step in which for each text part its relative position in the first document is determined and when said relative position does not correspond to an expected relative position, said expected relative position being associated with said text class as established, another classification is established in accordance with said relative position.
23. A method according to any of the claims 13 through 22, wherein said first document has a predetermined first format as identified by first formatting codes, and said arrangement is arranged to carry out the following step: converting at least some of said first formatting codes into second formatting codes from a standard format to render a second document having said standard format.
24. A method according to claim 23, wherein said standard format is either SGML or XML.
25. A computer program product for electronic document handling, that can be loaded by a computer arrangement comprising a processor (1) and a memory (5,7,9,11) connected to the processor, the computer program after being loaded providing said computer arrangement with the capacity to carry out the following steps: (a) receiving a first document and storing said document in said memory (5,7,9,11); (b) dividing the first document in one or more text parts; (c) for each text part, calculating how many times each one of a plurality of predetermined variables is present; (d) using a predetermined classification algorithm, by using at leastsaid times each one of said plurality of variables is present as parameters, to classify each text part as belonging to one of a plurality of predetermined text classes.
26 A data carrier provided with a computer program product according to claim 5.
*****
PCT/NL2001/000013 2001-01-10 2001-01-10 System and method for electronic document handling WO2002056197A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/NL2001/000013 WO2002056197A1 (en) 2001-01-10 2001-01-10 System and method for electronic document handling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/NL2001/000013 WO2002056197A1 (en) 2001-01-10 2001-01-10 System and method for electronic document handling

Publications (1)

Publication Number Publication Date
WO2002056197A1 true WO2002056197A1 (en) 2002-07-18

Family

ID=19760733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NL2001/000013 WO2002056197A1 (en) 2001-01-10 2001-01-10 System and method for electronic document handling

Country Status (1)

Country Link
WO (1) WO2002056197A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008028018A1 (en) 2006-08-30 2008-03-06 Amazon Technologies, Inc. Automated classification of document pages

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2336698A (en) * 1998-04-24 1999-10-27 Dialog Corp Plc The Automatic content categorisation of text data files using subdivision to reduce false classification
US5983170A (en) * 1996-06-25 1999-11-09 Continuum Software, Inc System and method for generating semantic analysis of textual information
WO2000026795A1 (en) * 1998-10-30 2000-05-11 Justsystem Pittsburgh Research Center, Inc. Method for content-based filtering of messages by analyzing term characteristics within a message

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5983170A (en) * 1996-06-25 1999-11-09 Continuum Software, Inc System and method for generating semantic analysis of textual information
GB2336698A (en) * 1998-04-24 1999-10-27 Dialog Corp Plc The Automatic content categorisation of text data files using subdivision to reduce false classification
WO2000026795A1 (en) * 1998-10-30 2000-05-11 Justsystem Pittsburgh Research Center, Inc. Method for content-based filtering of messages by analyzing term characteristics within a message

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008028018A1 (en) 2006-08-30 2008-03-06 Amazon Technologies, Inc. Automated classification of document pages
JP2010503075A (en) * 2006-08-30 2010-01-28 アマゾン テクノロジーズ,インク. Automatic classification of document pages
US8306326B2 (en) 2006-08-30 2012-11-06 Amazon Technologies, Inc. Method and system for automatically classifying page images
US9594833B2 (en) 2006-08-30 2017-03-14 Amazon Technologies, Inc. Automatically classifying page images

Similar Documents

Publication Publication Date Title
Weiss et al. Text mining: predictive methods for analyzing unstructured information
Weiss et al. Fundamentals of predictive text mining
Chy et al. Bangla news classification using naive Bayes classifier
Witten Text Mining.
Wang et al. A machine learning based approach for table detection on the web
Duwairi Machine learning for Arabic text categorization
Khusro et al. On methods and tools of table detection, extraction and annotation in PDF documents
US7469251B2 (en) Extraction of information from documents
EP1736901B1 (en) Method for classifying sub-trees in semi-structured documents
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
US20120109949A1 (en) Two stage search
Hadni et al. Word sense disambiguation for Arabic text categorization.
JP4911599B2 (en) Reputation information extraction device and reputation information extraction method
JP2008165598A (en) Reputation information extraction device and reputation information extraction method
WO2009154570A1 (en) System and method for aligning and indexing multilingual documents
Kanaris et al. Learning to recognize webpage genres
US7877383B2 (en) Ranking and accessing definitions of terms
Scharkow Content analysis, automatic
EP1745396B1 (en) Document information mining tool
Hull Information retrieval using statistical classification
Yurtsever et al. Figure search by text in large scale digital document collections
CN112199960B (en) Standard knowledge element granularity analysis system
Bia et al. The Miguel de Cervantes digital library: the Hispanic voice on the web
Pembe et al. A tree-based learning approach for document structure analysis and its application to web search
Lama Clustering system based on text mining using the K-means algorithm: news headlines clustering

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP