WO2017009900A1

WO2017009900A1 - Document processing system and document processing method

Info

Publication number: WO2017009900A1
Application number: PCT/JP2015/069910
Authority: WO
Inventors: 高橋　寿一; 真岩山; 新庄　広; 義行小林; 太亮尾崎; 久雄間瀬; 彬童
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2015-07-10
Filing date: 2015-07-10
Publication date: 2017-01-19
Anticipated expiration: 2018-01-10
Also published as: JPWO2017009900A1; JP6496025B2

Abstract

In this document processing system: a learning dictionary having registered therein a physical structure of an elements included in a paragraph that should be extracted from document data and an information aggregation present in a region of the element is stored in a storage region; and a processor divides the document data into at least one region, acquires an information aggregation present in the divided region, calculates the level of physical structure agreement indicative of the level of agreement between the region in which the acquired information aggregation is present and the physical structure of the element registered in the learning dictionary, calculates the level of information aggregation agreement indicative of the level of agreement between the acquired information aggregation and the information aggregation registered in the learning dictionary, determines whether or not the divided region is the paragraph that should be extracted on the basis of the calculated level of physical structure agreement and level of information aggregation agreement, and registers the acquired information aggregation in the learning dictionary when the divided region is the paragraph that should be extracted.

Description

Document processing system and document processing method

　本発明は、文書データから情報を抽出する文書処理システムに関する。 The present invention relates to a document processing system that extracts information from document data.

　文書データには、少なくとも一つの要素を一つのセットとしてリストが存在する。例えば、納入仕様書等には、コード番号、品名、及び仕様等の複数の要素が存在する。物品の在庫の確認又は不具合情報の管理等のために、このようなリストに記載された情報を抽出してデータベース化したいという要望がある。 Document data has a list with at least one element as one set. For example, a delivery specification or the like includes a plurality of elements such as a code number, a product name, and a specification. There is a demand to extract information described in such a list and create a database for checking inventory of goods or managing defect information.

　ここで、データベース化する文書には、構造化文書と非構造化文書とがある。構造化文書は、文書のタイトル、章節、本文、表及び図等の論理的な要素が所定の構造で記載された文書である。このような構造化文書は、論理的な要素の構造に従って論理的な要素の情報を抽出していけばよく、構造化文書の要素の抽出は容易である。 Here, there are structured documents and unstructured documents in the database. A structured document is a document in which logical elements such as a document title, chapter, body, table, and figure are described in a predetermined structure. In such a structured document, it is only necessary to extract logical element information according to the structure of the logical element, and it is easy to extract elements of the structured document.

　一方、非構造化文書は、構造化文書と異なり、論理的な要素が所定の構造で記載されていない。このような非構造化文書の要素を抽出する場合、非構造化文書のテキストを解析する必要がある。 On the other hand, unlike structured documents, unstructured documents do not have logical elements described in a predetermined structure. When extracting elements of such an unstructured document, it is necessary to analyze the text of the unstructured document.

　非構造化文書の各要素の情報を抽出する技術として特開平１１－２５００４１号公報（以下、引用文献１）がある。この公報には、「文書画像からレイアウトオブジェクトと構造を抽出する手段１と、文書画像から抽出したテキストの領域からタイポグラフィーに基づいて段落、リスト、数式、プログラム、注釈等の論理オブジェクトを抽出する手段３と、オブジェクト間の複数の可能な読み順を抽出する手段５と、論理オブジェクトに対して予め定義されているモデルを適用して論理構造を抽出する手段４と、から構成され、文字、写真、図形、表などで構成される多様な複数頁構成の文書からでも一次情報二次情報を抽出し、多様な電子フォーマットに変換可能にすることにより、文書管理システムの自動構築や様々な計算機アプリケーションの有効活用を可能にする。」と記載されている（要約参照）。 Japanese Patent Laid-Open No. 11-250041 (hereinafter referred to as Cited Document 1) is a technique for extracting information on each element of an unstructured document. This publication describes "means 1 for extracting layout objects and structures from document images, and means for extracting logical objects such as paragraphs, lists, mathematical formulas, programs, and annotations based on typography from a text region extracted from a document image. 3, means 5 for extracting a plurality of possible reading orders between objects, and means 4 for extracting a logical structure by applying a pre-defined model to the logical object. Extracting primary information and secondary information from various multi-page documents composed of, graphics, tables, etc., and converting them into various electronic formats enables automatic construction of document management systems and various computer applications Makes it possible to effectively use the system ”(see summary).

特開平１１－２５００４１号公報Japanese Patent Laid-Open No. 11-250041

　特許文献１に記載の技術は、文書画像から情報を抽出するためのレイアウトモデル及びキーとなる単語等を文書毎に定義する必要がある。 The technique described in Patent Document 1 needs to define a layout model for extracting information from a document image, a key word, and the like for each document.

　そこで、本発明は、文書画像から情報を抽出するためのレイアウトモデル及びキーとなる単語等を文書毎に事前に定義しなくても、非構造化文書から要素毎に情報を抽出する文書処理装置を提供することを目的とする。 Accordingly, the present invention provides a document processing apparatus that extracts information for each element from an unstructured document without defining a layout model for extracting information from a document image and key words for each document in advance. The purpose is to provide.

　本発明の代表的な一例を示せば、入力された文書データから情報集合を抽出する文書処理システムであって、プロセッサと記憶領域とを備え、前記記憶領域には、前記文書データから抽出すべき段落に含まれ、かつ抽出すべき情報が存在する領域である要素の物理構造と、当該要素の領域に存在する情報集合と、が登録された学習辞書が記憶され、前記プロセッサは、前記入力された文書データを少なくとも一つの領域の領域に存在する情報集合を取得し、前記取得した情報集合の存在領域と前記学習辞書に登録された要素の物理構造との一致度を示す物理構造一致度を算出し、前記取得した情報集合と前記学習辞書に登録された情報集合との一致度を示す情報集合一致度を算出し、前記算出した物理構造一致度及び情報集合一致度に基づいて、前記領域が前記抽出すべき段落であるか否かを判定し、前記領域が前記抽出すべき段落である場合、前記取得した情報集合を、前記物理構造一致度及び前記情報集合一致度を算出した情報集合に対応する要素の領域に存在する情報集合として前記学習辞書に登録し、前記取得した情報集合を、前記物理構造一致度及び前記情報集合一致度を算出した情報集合の要素に対応付けた出力情報を前記記憶領域に記憶することを特徴とする。 A typical example of the present invention is a document processing system that extracts an information set from input document data, and includes a processor and a storage area, and the storage area should be extracted from the document data. A learning dictionary in which a physical structure of an element that is included in a paragraph and where information to be extracted exists and an information set that exists in the area of the element are registered is stored, and the processor receives the input Information structure existing in at least one area of the document data obtained, and a physical structure matching degree indicating a matching degree between the existence area of the acquired information set and the physical structure of the element registered in the learning dictionary. And calculating an information set matching degree indicating a matching degree between the acquired information set and the information set registered in the learning dictionary, and based on the calculated physical structure matching degree and information set matching degree. Determining whether or not the region is the paragraph to be extracted, and when the region is the paragraph to be extracted, the acquired information set is set as the physical structure match degree and the information set match degree. The information set registered in the learning dictionary as an information set existing in the element region corresponding to the calculated information set, and the acquired information set corresponds to the element of the information set for which the physical structure match degree and the information set match degree are calculated. The attached output information is stored in the storage area.

　本願において開示される発明のうち代表的なものによって得られる効果を簡潔に説明すれば、下記の通りである。すなわち、文書データから情報を抽出するためにキーとなる単語等を文書毎に事前に定義しなくても、非構造化文書から要素毎に情報を抽出する文書処理装置を提供できる。 The following is a brief description of the effects obtained by the representative inventions disclosed in the present application. That is, it is possible to provide a document processing apparatus that extracts information for each element from an unstructured document without defining in advance for each document a word or the like that is a key for extracting information from document data.

　上記した以外の課題、構成、及び効果は、以下の実施形態の説明により明らかにされる。 Problems, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

本実施例の文書処理システムのブロック図である。It is a block diagram of the document processing system of a present Example. 本実施例の段落指定受付処理の概要の説明図である。It is explanatory drawing of the outline | summary of the paragraph designation | designated reception process of a present Example. 本実施例の文書処理の概略の説明図である。It is explanatory drawing of the outline of the document processing of a present Example. 本実施例のブロック辞書の説明図である。It is explanatory drawing of the block dictionary of a present Example. 本実施例の繰り返しテーブルの説明図である。It is explanatory drawing of the repetition table of a present Example. 本実施例の段落指定受付処理の概略及び文書処理の概略のフローチャートである。It is a flowchart of the outline of the paragraph designation | designated reception process of this Example, and the outline of document processing. 本実施例の段落指定受付処理のフローチャートである。It is a flowchart of the paragraph designation reception process of a present Example. 本実施例の繰り返し部分抽出処理のフローチャートである。It is a flowchart of the repetition part extraction process of a present Example. 本実施例の繰り返し候補部分抽出処理のフローチャートである。It is a flowchart of the repetition candidate part extraction process of a present Example. 本実施例の要素一致判定処理のフローチャートである。It is a flowchart of the element matching determination process of a present Example. 本実施例のユーザから指定された段落の領域をＸ－Ｙカット法を用いて分割する処理の説明図である。It is explanatory drawing of the process which divides | segments the area | region of the paragraph designated by the user of a present Example using the XY cut method. 本実施例の文書データの全体領域をＸ－Ｙカット法を用いて分割する処理の説明図である。It is explanatory drawing of the process which divides | segments the whole area | region of the document data of a present Example using an XY cut method. 本実施例の一致度算出処理のフローチャートである。It is a flowchart of a coincidence degree calculation process of a present Example. 本実施例の文書処理における文書データの全体領域をＸ－Ｙカット法を用いて分割する処理の説明図である。It is explanatory drawing of the process which divides | segments the whole area | region of the document data in the document process of a present Example using an XY cut method. 本実施例の繰り返し部分表示画面の説明図である。It is explanatory drawing of the repetition partial display screen of a present Example. 本実施例の処理結果表示画面の説明図である。It is explanatory drawing of the process result display screen of a present Example.

　図１は、本実施例の文書処理システム１００のブロック図である。 FIG. 1 is a block diagram of the document processing system 100 of this embodiment.

　文書処理システム１００は、入力された文書データ（例えば、文書画像等）に含まれる情報集合を要素ごとに抽出する。情報集合は、抽出すべき文字列及び図形等を含む概念である。 The document processing system 100 extracts an information set included in input document data (for example, a document image) for each element. The information set is a concept including a character string and a graphic to be extracted.

　文書処理システム１００は、プロセッサ１０１、メモリ１０２、二次記憶装置１０３、入力インタフェース（ＩＦ）１０４、出力インタフェース（ＩＦ）１０５、及びネットワークインタフェース１０６を有する。これらは、バス１０７を介して互いに接続される。 The document processing system 100 includes a processor 101, a memory 102, a secondary storage device 103, an input interface (IF) 104, an output interface (IF) 105, and a network interface 106. These are connected to each other via a bus 107.

　プロセッサ１０１は、各種演算処理を実行する。二次記憶装置１０３は非揮発性の非一時的な記憶媒体であり、各種プログラム及び各種データが記憶される。メモリ１０２は揮発性の一時的な記憶媒体であり、メモリ１０２には、二次記憶装置１０３に記憶された各種プログラム及び各種データがロードされ、プロセッサ１０１がメモリ１０２にロードされた各種プログラムを実行し、メモリ１０２にロードされた各種データを読み書きする。 The processor 101 executes various arithmetic processes. The secondary storage device 103 is a non-volatile non-transitory storage medium, and stores various programs and various data. The memory 102 is a volatile temporary storage medium. The memory 102 is loaded with various programs and various data stored in the secondary storage device 103, and the processor 101 executes the various programs loaded in the memory 102. Then, various data loaded in the memory 102 are read and written.

　入力ＩＦ１０４には、例えばキーボード及びマウス等の入力デバイスが接続される。出力ＩＦ１０５には、例えばディスプレイ等の出力デバイスが接続される。ネットワークＩＦ１０６は、文書処理システム１００をネットワークに接続するインタフェースである。 Input devices such as a keyboard and a mouse are connected to the input IF 104, for example. For example, an output device such as a display is connected to the output IF 105. The network IF 106 is an interface that connects the document processing system 100 to a network.

　プロセッサ１０１は、段落指定受付処理部１１１及び文書処理部１１２を有する。 The processor 101 includes a paragraph designation reception processing unit 111 and a document processing unit 112.

　段落指定受付処理部１１１は、段落指定受付処理を実行する。段落指定受付処理では、ある文書データにおいてユーザから段落の指定及び当該段落内の要素の指定を受け付け、指定を受け付けた段落及び要素に関する情報をブロック辞書１２１に登録し、当該要素に存在する情報を単語辞書１２２に登録する。また、段落指定受付処理では、当該ある文書データにおいて指定を受け付けた段落が繰り返される領域（繰り返し部分）を抽出し、抽出した繰り返し部分及び当該繰り返し部分に含まれる要素に関する情報を出力ＤＢ１２３に登録し、当該要素に存在する情報集合を単語辞書１２２に登録する。段落指定受付処理の詳細は図６、図７～図１２で説明する。 The paragraph designation reception processing unit 111 executes a paragraph designation reception process. In the paragraph designation reception process, in a certain document data, the designation of a paragraph and the designation of an element in the paragraph are accepted from the user, the information about the paragraph and the element for which the designation is accepted is registered in the block dictionary 121, and information existing in the element is registered. Register in the word dictionary 122. Also, in the paragraph designation receiving process, an area (repeated portion) where the designated paragraph is repeated in the document data is extracted, and information regarding the extracted repeated portion and elements included in the repeated portion is registered in the output DB 123. The information set existing in the element is registered in the word dictionary 122. Details of the paragraph designation receiving process will be described with reference to FIGS. 6 and 7 to 12.

　文書処理部１１２は、文書処理を実行する。文書処理では、ブロック辞書１２１及び単語辞書１２２を参照して、段落指定受付処理が実行された文書データと異なる文書データから、段落指定受付処理で指定を受け付けた段落に対応する領域から段落指定受付処理で指定を受け付けた要素に対応する要素を抽出し、抽出した要素に存在する情報を抽出する。文書処理の詳細は図６、図１３及び図１４で説明する。 The document processing unit 112 executes document processing. In the document processing, referring to the block dictionary 121 and the word dictionary 122, the paragraph designation is accepted from the area corresponding to the paragraph whose designation is accepted in the paragraph designation accepting process from the document data different from the document data on which the paragraph designation accepting process is executed. The element corresponding to the element that has received the designation in the process is extracted, and information existing in the extracted element is extracted. Details of the document processing will be described with reference to FIGS.

　なお、段落指定受付処理部１１１及び文書処理部１１２それぞれに対応するプログラムがメモリ１０２に記憶され、プロセッサ１０１がこれらのプログラムを実行することによって、プロセッサ１０１に段落指定受付処理部１１１及び文書処理部１１２が実装される。 It should be noted that programs corresponding to the paragraph designation reception processing unit 111 and the document processing unit 112 are stored in the memory 102, and the processor 101 executes these programs to cause the processor 101 to execute the paragraph designation reception processing unit 111 and the document processing unit. 112 is implemented.

　メモリ１０２には、ブロック辞書１２１、単語辞書１２２、出力データベース（ＤＢ）１２３、及び繰り返しテーブル１２４が記憶される。 The memory 102 stores a block dictionary 121, a word dictionary 122, an output database (DB) 123, and a repetition table 124.

　ブロック辞書１２１には、抽出すべき要素を含む段落の物理構造、並びに、当該段落に含まれる要素の物理構造及び論理構造が登録される。段落の物理構造は、例えば、段落の面積が特定可能な情報である。要素の物理構造は、要素の段落内における位置情報であり、要素の論理構造は、例えば要素に存在する情報の表記形式及び属性等である。ブロック辞書１２１の詳細は、図４で説明する。 In the block dictionary 121, a physical structure of a paragraph including an element to be extracted and a physical structure and a logical structure of an element included in the paragraph are registered. The physical structure of the paragraph is information that can specify the area of the paragraph, for example. The physical structure of the element is position information within the paragraph of the element, and the logical structure of the element is, for example, a notation format and attribute of information existing in the element. Details of the block dictionary 121 will be described with reference to FIG.

　単語辞書１２２には、要素と当該要素に存在する情報集合（単語及び図等）とが対応付けて登録される。出力ＤＢ１２３には、文書データから抽出された要素と当該要素に存在する情報集合とが対応付けて登録される。繰り返しテーブル１２４には、段落指定受付処理で指定を受け付けた段落の繰り返し部分に関する情報が登録される。繰り返しテーブル１２４の詳細は図５で説明する。 In the word dictionary 122, an element and an information set (word, figure, etc.) existing in the element are registered in association with each other. In the output DB 123, elements extracted from the document data and information sets existing in the elements are registered in association with each other. In the repetition table 124, information related to the repeated portion of the paragraph whose designation is received in the paragraph designation receiving process is registered. Details of the repetition table 124 will be described with reference to FIG.

　図２は、本実施例の段落指定受付処理の概要の説明図である。 FIG. 2 is an explanatory diagram outlining the paragraph designation receiving process of this embodiment.

　まず、段落指定受付処理では、段落指定受付処理部１１１は、入力された文書データから抽出すべき段落の指定をユーザから受け付ける（２０１）。段落の指定は抽出すべき段落の領域の指定であり、段落が矩形である場合には例えば左上座標及び右下座標が指定され、段落が多角形である場合には多角形の各頂点の座標が指定される。 First, in the paragraph designation receiving process, the paragraph designation receiving processing unit 111 receives a paragraph designation to be extracted from the input document data from the user (201). The paragraph specification is the specification of the area of the paragraph to be extracted. When the paragraph is rectangular, for example, the upper left coordinate and the lower right coordinate are specified, and when the paragraph is a polygon, the coordinates of each vertex of the polygon are specified. Is specified.

　段落指定受付処理部１１１は、段落の指定を受け付けると、指定された段落の物理構造をブロック辞書１２１に登録する（２０２）。 When the paragraph designation acceptance processing unit 111 accepts the paragraph designation, it registers the physical structure of the designated paragraph in the block dictionary 121 (202).

　段落指定受付処理部１１１は、指定された段落に含まれる要素の指定を受け付ける（２０３）。要素の指定は抽出すべき要素の領域の指定であり、要素が矩形である場合には例えば左上座標及び右下座標が指定され、要素が多角形である場合には多角形の頂点の座標が指定される。 The paragraph specification reception processing unit 111 receives specification of elements included in the specified paragraph (203). The element designation is the designation of the area of the element to be extracted. When the element is rectangular, for example, the upper left coordinate and the lower right coordinate are designated, and when the element is a polygon, the coordinates of the polygon vertex are designated. It is specified.

　段落指定受付処理部１１１は、要素の指定を受け付けると、指定された要素の物理構造をブロック辞書１２１に登録する。また、段落指定受付処理部１１１は、指定された要素に存在する情報集合を抽出し、抽出した情報集合に基づいて要素の表記形式及び属性を特定し、特定した要素の表記形式及び属性を論理構造としてブロック辞書１２１に登録する。また、段落指定受付処理部１１１は、抽出した情報集合を単語辞書１２２及び出力ＤＢ１２３に登録する（２０４）。要素に存在する情報集合とは、図２であれば、「０５５－３２２３」、「ハンドバッグ」、及び「このバッグに・・・」等の文字列である。要素には文字列の他にも、図等が存在する可能性があり、文字集合とは、文字列、及び図等を含む概念である。 When the paragraph designation reception processing unit 111 accepts the element designation, the paragraph designation reception processing unit 111 registers the physical structure of the designated element in the block dictionary 121. In addition, the paragraph designation reception processing unit 111 extracts an information set existing in the designated element, specifies the notation format and attribute of the element based on the extracted information set, and logically calculates the notation format and attribute of the identified element. The structure is registered in the block dictionary 121. In addition, the paragraph designation reception processing unit 111 registers the extracted information set in the word dictionary 122 and the output DB 123 (204). In FIG. 2, the information set existing in the element is a character string such as “055-3223”, “handbag”, and “in this bag...”. In addition to a character string, there may be a figure or the like, and the character set is a concept including a character string and a figure.

　２０４の処理の実行後には、単語辞書１２２の要素１には「０５５－３２２３」が登録され、要素２には「ハンドバッグ」が登録される。また、出力ＤＢ１２３の要素１には「０５５－３２２３」が登録され、要素２には「ハンドバッグ」が登録され、要素３には「このバッグに・・・」が登録される。 After execution of the process 204, “055-3223” is registered in the element 1 of the word dictionary 122, and “handbag” is registered in the element 2. In addition, “055-3223” is registered in element 1 of the output DB 123, “handbag” is registered in element 2, and “in this bag...” Is registered in element 3.

　次に、段落指定受付処理部１１１は、ブロック辞書１２１に登録された指定された段落の物理構造及び論理構造に基づいて、指定された段落の繰り返し部分を抽出する（２０５）。繰り返し部分とは、指定された段落の物理構造に類似し、指定された要素の物理構造に類似する要素を含む領域である。なお、図２では、繰り返し部分として、指定された段落の下方に位置する「長靴」に関する領域、及び「手袋」に関する領域が抽出される。 Next, the paragraph designation reception processing unit 111 extracts a repeated portion of the designated paragraph based on the physical structure and logical structure of the designated paragraph registered in the block dictionary 121 (205). The repetitive part is an area including an element similar to the physical structure of the designated paragraph and similar to the physical structure of the designated element. In FIG. 2, a region relating to “boots” and a region relating to “gloves” located below the designated paragraph are extracted as repeated portions.

　段落指定受付処理部１１１は、抽出した繰り返し部分に含まれる各要素に存在する情報集合を抽出し、抽出した情報集合を単語辞書１２２及び出力ＤＢ１２３に登録する（２０６）。 The paragraph designation reception processing unit 111 extracts an information set existing in each element included in the extracted repeated portion, and registers the extracted information set in the word dictionary 122 and the output DB 123 (206).

　２０６の処理の実行後には、単語辞書１２２の要素１に「０１５－０００１」及び「０１５－０１４９」が追加され、要素２に「長靴」及び「手袋」が追加される。また、出力ＤＢ１２３の要素１には「０１５－０００１」及び「０１５－０１４９」が追加され、要素２には「長靴」及び「手袋」が追加され、要素３には「ビニール製な・・・」及び「毛糸を使用・・・」が追加される。 After execution of the process 206, “015-0001” and “015-0149” are added to the element 1 of the word dictionary 122, and “boots” and “gloves” are added to the element 2. In addition, “015-0001” and “015-0149” are added to the element 1 of the output DB 123, “boots” and “gloves” are added to the element 2, and “vinyl made ... "And" Use yarn ... "are added.

　これによって、ユーザは、一つの段落を指定し、指定した段落に含まれる要素を指定するだけで、当該段落の繰り返し部分が抽出されるので、繰り返し部分を逐一指定する必要がなくなる。また、抽出された繰り返し部分の要素の情報が自動的に単語辞書１２２に登録されるので、ユーザが単語辞書１２２を構築する作業を省略できる。 This allows the user to specify a single paragraph and specify elements included in the specified paragraph, so that repeated parts of the paragraph are extracted, so that it is not necessary to specify the repeated parts one by one. In addition, since the information of the extracted elements of the repeated portion is automatically registered in the word dictionary 122, the user can omit the work of building the word dictionary 122.

　図３は、本実施例の文書処理の概略の説明図である。 FIG. 3 is a schematic explanatory diagram of document processing of the present embodiment.

　文書処理では、段落指定受付処理で処理された文書データとは異なる文書データから指定された要素に存在する情報集合が抽出される。 In the document processing, an information set existing in the designated element is extracted from document data different from the document data processed in the paragraph designation receiving process.

　まず、文書処理部１１２は、入力された文書データからブロック辞書１２１に登録された段落の抽出領域候補部分を抽出する（３０１）。抽出領域候補部分は、指定された段落と物理構造が類似する領域である。 First, the document processing unit 112 extracts the extraction area candidate portion of the paragraph registered in the block dictionary 121 from the input document data (301). The extraction region candidate portion is a region whose physical structure is similar to the designated paragraph.

　次に、文書処理部１１２は、ブロック辞書１２１及び単語辞書１２２を参照して、指定された段落と抽出領域候補部分との総合一致度を算出し、算出した総合一致度が閾値以上であれば、抽出領域候補部分に含まれる要素に存在する情報集合を抽出する（３０２）。なお、総合一致度は、指定された段落に含まれる要素の段落内における位置情報と抽出領域候補部分に含まれる情報集合の抽出領域候補部分における位置情報との一致度、指定された段落に含まれる要素に対応する情報集合と抽出領域候補部分に含まれる情報集合との一致度、指定された段落に含まれる要素に対応する表記形式及び属性と抽出領域候補部分に含まれる情報集合の表記形式及び属性との一致度に基づいて算出される。 Next, the document processing unit 112 refers to the block dictionary 121 and the word dictionary 122 to calculate the total matching degree between the specified paragraph and the extraction area candidate part, and if the calculated total matching degree is equal to or greater than a threshold value. Then, an information set existing in the element included in the extraction area candidate portion is extracted (302). Note that the total matching degree is the degree of matching between the position information in the paragraph of the element included in the specified paragraph and the position information in the extraction region candidate portion of the information set included in the extraction region candidate portion, and is included in the specified paragraph. The degree of coincidence between the information set corresponding to the specified element and the information set included in the extraction area candidate part, the notation format corresponding to the element included in the specified paragraph, and the notation format of the attribute and information set included in the extraction area candidate part And the degree of coincidence with the attribute.

　そして、文書処理部１１２は、３０２の処理で抽出した情報集合を単語辞書１２２及び出力ＤＢ１２３に登録する（３０３）。 Then, the document processing unit 112 registers the information set extracted in the process 302 in the word dictionary 122 and the output DB 123 (303).

　図３に示す文書データの「折りたたみ傘」については図２で登録された単語辞書１２２に登録されていないが、「折りたたみ傘」の領域の総合一致度が閾値以上であるので、「折りたたみ傘」の領域の各要素に存在する情報集合が抽出されている。単語辞書１２２の要素１に「３３１－０１２０」が追加され、要素２に「折りたたみ傘」が追加される。また、出力ＤＢ１２３の要素１には「３３１－０１２０」が追加され、要素２には「折りたたみ傘」が追加され、要素３には「ワンプッシュ・・・」が追加される。 The “folding umbrella” of the document data shown in FIG. 3 is not registered in the word dictionary 122 registered in FIG. 2, but since the total matching degree of the area of “folding umbrella” is equal to or greater than the threshold, “folding umbrella” An information set existing in each element of the area is extracted. “331-0120” is added to element 1 of word dictionary 122, and “folding umbrella” is added to element 2. Further, “331-0120” is added to element 1 of the output DB 123, “folding umbrella” is added to element 2, and “one push...” Is added to element 3.

　このように、ユーザは段落及び要素を指定するだけで、他の文書データからも抽出すべき要素に存在する情報集合を抽出することができ、ユーザの作業を軽減しながら、非構造文書から所定の要素を抽出することができる。 In this way, the user can extract a set of information existing in elements to be extracted from other document data only by specifying paragraphs and elements. Can be extracted.

　図４は、本実施例のブロック辞書１２１の説明図である。 FIG. 4 is an explanatory diagram of the block dictionary 121 of this embodiment.

　ブロック辞書１２１は、段落テーブル４００、要素テーブル４１０、及び座標テーブル４２０を含む。 The block dictionary 121 includes a paragraph table 400, an element table 410, and a coordinate table 420.

　段落テーブル４００は、左上座標４０１、右下座標４０２、部分領域数４０３、要素数４０４、要素へのポインタ４０５を含む。 The paragraph table 400 includes an upper left coordinate 401, a lower right coordinate 402, a partial area number 403, an element number 404, and an element pointer 405.

　左上座標４０１には、ユーザによって指定された段落の左上の座標が登録される。右下座標４０２には、ユーザによって指定された段落の右下の座標が登録される。部分領域数４０３には、ユーザによって指定された段落が含む領域の数が登録される。要素数４０４には、ユーザによって指定された段落でユーザによって指定された要素の数が登録される。要素へのポインタ４０５には、各要素テーブル４１０へのポインタが登録される。 In the upper left coordinate 401, the upper left coordinate of the paragraph designated by the user is registered. In the lower right coordinate 402, the lower right coordinate of the paragraph specified by the user is registered. In the partial area number 403, the number of areas included in the paragraph designated by the user is registered. In the element number 404, the number of elements specified by the user in the paragraph specified by the user is registered. In the element pointer 405, a pointer to each element table 410 is registered.

　ユーザによって指定された段落が多角形である場合、段落テーブル４００には、左上座標４０１及び右下座標４０２の代わりに、当該多角形の各頂点の座標が登録される。 When the paragraph specified by the user is a polygon, the paragraph table 400 registers the coordinates of each vertex of the polygon instead of the upper left coordinates 401 and the lower right coordinates 402.

　要素テーブル４１０は、段落テーブル４００の要素数４０４に登録された要素の数だけブロック辞書１２１に含まれる。要素テーブル４１０は、多角形座標４１１、中心座標４１２、表記形式４１３、及び属性４１４を含む。 The element table 410 is included in the block dictionary 121 by the number of elements registered in the element number 404 of the paragraph table 400. The element table 410 includes polygon coordinates 411, center coordinates 412, a notation format 413, and attributes 414.

　多角形座標４１１には、ユーザによって指定された要素の頂点の座標が登録された座標テーブル４２０へのポインタが登録される。中心座標４１２には、ユーザによって指定された要素の中心座標が登録される。この要素の中心座標は、当該要素を含む段落の中心座標を原点とした座標である。要素の段落内での位置が特定可能であれば、中心座標に限定されない。 In the polygon coordinates 411, a pointer to the coordinate table 420 in which the coordinates of the vertex of the element designated by the user are registered. Registered in the center coordinate 412 is the center coordinate of the element designated by the user. The center coordinates of this element are coordinates with the center coordinate of the paragraph including the element as the origin. The position is not limited to the center coordinates as long as the position of the element in the paragraph can be specified.

　表記形式４１３には、ユーザによって指定された要素に存在する情報集合の表記形式が登録される。表記形式は「ＮＮＮ－ＮＮＮＮ」、「ＡＡＡＡＡＡＡＡＡ」、又は「Ｔｅｘｔ」等で表現される。「ＮＮＮ－ＮＮＮＮ」の「Ｎ」は任意の一文字の数字を意味し、「－」はハイフンを意味する。「ＡＡＡＡＡＡＡＡＡ」の「Ａ」は任意のアルファベットを意味する。「Ｔｅｘｔ」は、何らかのテキストが情報集合として存在することを意味する。 In the notation format 413, the notation format of the information set existing in the element designated by the user is registered. The notation format is expressed as “NNN-NNNN”, “AAAAAAAAA”, “Text” or the like. “N” in “NNN-NNNN” means an arbitrary single-character number, and “-” means a hyphen. “A” in “AAAAAAAAAA” means an arbitrary alphabet. “Text” means that some text exists as an information set.

　属性４１４には、ユーザによって指定された要素に存在する属性が登録される。属性は「英大文字列」、「テキスト」、及び「画像」等で表現される。「英大文字列」は、要素に存在する情報集合がアルファベットの大文字で構成された文字列であることを意味する。「テキスト」は、要素に存在する情報集合が何らかのテキストであることを意味する。「画像」は要素に存在する情報集合が画像であることを意味する。 In the attribute 414, an attribute existing in the element designated by the user is registered. The attribute is expressed by “English capital letter string”, “text”, “image”, and the like. “English capital letter string” means that the information set existing in the element is a character string composed of uppercase letters. “Text” means that the information set existing in the element is some text. “Image” means that the information set existing in the element is an image.

　なお、単語辞書１２２に登録される要素に存在した情報集合は、要素テーブル４１０の一つのエントリとして登録されてもよい。すなわち、単語辞書１２２はブロック辞書１２１に含まれてもよく、ブロック辞書１２１及び単語辞書１２２を総称して学習辞書という。 Note that an information set existing in an element registered in the word dictionary 122 may be registered as one entry in the element table 410. That is, the word dictionary 122 may be included in the block dictionary 121, and the block dictionary 121 and the word dictionary 122 are collectively referred to as a learning dictionary.

　図５は、本実施例の繰り返しテーブル１２４の説明図である。 FIG. 5 is an explanatory diagram of the repetition table 124 of this embodiment.

　繰り返しテーブル１２４は、リストテーブル５００、段落テーブル５１０、要素テーブル５２０、及び座標テーブル５３０を含む。 The repetition table 124 includes a list table 500, a paragraph table 510, an element table 520, and a coordinate table 530.

　リストテーブル５００は、繰り返し数５０１、及び段落テーブルへのポインタ５０２を含む。 The list table 500 includes a repetition number 501 and a pointer 502 to a paragraph table.

　繰り返し数５０１には、繰り返し部分として抽出された領域の数が登録される。段落テーブルへのポインタ５０２には、繰り返し部分として抽出された領域の情報が登録された段落テーブル５１０へのポインタが登録される。 In the repetition number 501, the number of areas extracted as repetition parts is registered. In the paragraph table pointer 502, a pointer to the paragraph table 510 in which the information of the area extracted as the repeated portion is registered is registered.

　段落テーブル５１０は、繰り返し部分として抽出された領域に関する情報が登録される点以外は、図４に示す段落テーブル４００と同じであるので、説明を省略する。 The paragraph table 510 is the same as the paragraph table 400 shown in FIG. 4 except that information related to the region extracted as the repeated portion is registered, and thus description thereof is omitted.

　要素テーブル５２０は、繰り返し部分として抽出された領域に含まれる要素に関する情報が登録される点以外は、図４に示す要素テーブル４１０と同じであるので、説明を省略する。座標テーブル５３０は、繰り返し部分として抽出された領域に含まれる要素の頂点の座標が登録される。 The element table 520 is the same as the element table 410 shown in FIG. 4 except that information related to elements included in the region extracted as the repeated portion is registered, and thus description thereof is omitted. In the coordinate table 530, the coordinates of the vertices of the elements included in the region extracted as the repeated portion are registered.

　図６は、本実施例の段落指定受付処理の概略及び文書処理の概略のフローチャートである。 FIG. 6 is a flowchart of the outline of the paragraph designation receiving process and the outline of the document process of this embodiment.

　まず、段落指定受付処理の概略について説明する。 First, the outline of the paragraph designation receiving process will be described.

　段落指定受付処理部１１１は、文書データの入力を受け付ける（６０１）。段落指定受付処理部１１１は、入力を受け付けた文書データの領域からユーザによる段落の指定を受け付け、指定を受け付けた段落に含まれる要素の指定を受け付ける（６０２）。 The paragraph designation reception processing unit 111 receives input of document data (601). The paragraph designation acceptance processing unit 111 accepts designation of a paragraph by the user from the area of the document data that has been accepted, and accepts designation of an element included in the paragraph that has received the designation (602).

　次に、段落指定受付処理部１１１は、指定を受け付けた段落の物理構造、及び指定を受け付けた要素の物理構造を解析する（６０３）。具体的には、段落指定受付処理部１１１は、指定を受け付けた段落の左上座標及び右下座標を特定する。また、段落指定受付処理部１１１は、指定を受け付けた要素の各頂点の座標及び要素の中心座標を特定する。 Next, the paragraph designation reception processing unit 111 analyzes the physical structure of the paragraph that has received the designation and the physical structure of the element that has accepted the designation (603). Specifically, the paragraph designation reception processing unit 111 identifies the upper left coordinates and the lower right coordinates of the paragraph for which the designation is received. In addition, the paragraph designation reception processing unit 111 specifies the coordinates of the vertices and the center coordinates of the element of the element that has received the designation.

　次に、段落指定受付処理部１１１は、指定を受け付けた要素に存在する情報集合の論理構造を解析する（６０４）。具体的には、段落指定受付処理部１１１は、指定を受け付けた要素に存在する情報集合の表記形式及び属性を特定する。 Next, the paragraph designation reception processing unit 111 analyzes the logical structure of the information set existing in the element that has received the designation (604). Specifically, the paragraph designation reception processing unit 111 identifies the notation format and attributes of the information set existing in the element that has received the designation.

　次に、段落指定受付処理部１１１は、ステップ６０３の処理で解析した段落及び要素の物理構造、並びにステップ６０４の処理で解析した要素の論理構造をブロック辞書１２１に登録する（６０５）。具体的には、段落指定受付処理部１１１は、ブロック辞書１２１の段落テーブル４００の左上座標４０１に、ステップ６０３の処理で特定した段落の左上座標を登録し、右下座標４０２に、ステップ６０３の処理で特定した段落の右下座標を登録する。また、段落指定受付処理部１１１は、段落テーブル４００の要素数４０４に指定を受け付けた要素の数を登録し、要素へのポインタ４０５に各要素テーブル４１０へのポインタを登録する。 Next, the paragraph designation reception processing unit 111 registers the paragraph and element physical structure analyzed in step 603 and the element logical structure analyzed in step 604 in the block dictionary 121 (605). Specifically, the paragraph designation reception processing unit 111 registers the upper left coordinate of the paragraph specified in the process of step 603 in the upper left coordinate 401 of the paragraph table 400 of the block dictionary 121 and the lower right coordinate 402 of step 603. Register the lower right coordinate of the paragraph specified in the process. In addition, the paragraph designation reception processing unit 111 registers the number of elements whose designation is accepted in the element number 404 of the paragraph table 400, and registers a pointer to each element table 410 in the element pointer 405.

　また、段落指定受付処理部１１１は、要素テーブル４１０の多角形座標４１１に、ステップ６０３の処理で特定した要素の頂点の座標を登録し、中心座標４１２に、ステップ６０３の処理で特定した要素の中心座標を登録する。また、段落指定受付処理部１１１は、要素テーブル４１０の表記形式４１３にステップ６０４の処理で特定した要素の表記形式を登録し、属性４１４にステップ６０４の処理で特定した要素の属性を登録する。 In addition, the paragraph designation reception processing unit 111 registers the coordinates of the vertex of the element specified in the process of step 603 in the polygon coordinates 411 of the element table 410, and the element specified in the process of step 603 in the center coordinate 412. Register center coordinates. The paragraph designation reception processing unit 111 registers the notation format of the element specified in the process of step 604 in the notation format 413 of the element table 410, and registers the attribute of the element specified in the process of step 604 in the attribute 414.

　次に、段落指定受付処理部１１１は、全ての繰り返し部分に対してステップ６０７～６１０の処理を実行する（６０６）。まず、段落指定受付処理部１１１は、繰り返し部分の各要素の情報集合を抽出する（６０７）。そして、段落指定受付処理部１１１は、ステップ６０７の処理で抽出した情報集合を単語辞書１２２に登録し（６０８）、ステップ６０７の処理で抽出した情報集合を出力ＤＢ１２３に登録する（６０９）。そして、全ての繰り返し部分に対してステップ６０７～６０９の処理が実行されていれば、段落指定受付処理を終了し、全ての繰り返し部分に対してステップ６０７～６０９の処理が実行されていなければ、ステップ６０６の処理に戻る。 Next, the paragraph designation reception processing unit 111 executes the processing of steps 607 to 610 for all repeated portions (606). First, the paragraph designation reception processing unit 111 extracts an information set of each element of the repeated portion (607). Then, the paragraph designation reception processing unit 111 registers the information set extracted in step 607 in the word dictionary 122 (608), and registers the information set extracted in step 607 in the output DB 123 (609). If the processes of steps 607 to 609 are executed for all the repeated parts, the paragraph designation receiving process is terminated. If the processes of steps 607 to 609 are not executed for all the repeated parts, The processing returns to step 606.

　次に、文書処理について説明する。 Next, document processing will be described.

　まず、文書処理部１１２は、文書データの入力を受け付け（６１１）、入力された文書データの最初のページに移動する（６１２）。次に、文書処理部１１２は、入力された文書データから文字及び線分の情報を抽出する（６１３）。次に、文書処理部１１２は、入力された文書データの全体の領域を、Ｘ－Ｙカット法を用いて少なくとも一つの領域に分割する（６１４）。Ｘ－Ｙカット法を用いた領域の分割の詳細については図１４で説明する。 First, the document processing unit 112 accepts input of document data (611), and moves to the first page of the input document data (612). Next, the document processing unit 112 extracts character and line segment information from the input document data (613). Next, the document processing unit 112 divides the entire area of the input document data into at least one area using the XY cut method (614). Details of region division using the XY cut method will be described with reference to FIG.

　次に、文書処理部１１２は、ブロック辞書１２１を参照して、抽出領域候補抽出処理を実行する（６１５）。抽出領域候補抽出処理は、ブロック辞書１２１に登録された段落に類似する領域を抽出領域候補として抽出する処理である Next, the document processing unit 112 refers to the block dictionary 121 and executes extraction region candidate extraction processing (615). The extraction area candidate extraction process is a process of extracting an area similar to a paragraph registered in the block dictionary 121 as an extraction area candidate.

　次に、文書処理部１１２は、ブロック辞書１２１及び単語辞書１２２を参照して、一致度算出処理を実行する（６１６）。一致度算出処理は、繰り返し候補部分の要素とブロック辞書１２１に登録された要素との一致度を算出し、一致度が閾値以上であれば、繰り返し候補部分の要素に存在する情報集合を抽出し、抽出した情報集合を要素毎に出力ＤＢ１２３に登録する処理であり、詳細は図１３で説明する。 Next, the document processing unit 112 refers to the block dictionary 121 and the word dictionary 122 and executes a matching degree calculation process (616). The degree of coincidence calculation process calculates the degree of coincidence between the element of the repetition candidate part and the element registered in the block dictionary 121. If the degree of coincidence is equal to or greater than a threshold, the information set existing in the element of the repetition candidate part is extracted. This is a process for registering the extracted information set in the output DB 123 for each element, and details will be described with reference to FIG.

　次に、文書処理部１１２は、抽出した領域と抽出した要素の情報集合の領域を含む処理結果表示画面１６００（図１６参照）を表示し（６１７）、文書処理を終了する。処理結果表示画面１６００の詳細は図１６で説明する。 Next, the document processing unit 112 displays a processing result display screen 1600 (see FIG. 16) including the extracted area and the information set area of the extracted element (617), and ends the document processing. Details of the processing result display screen 1600 will be described with reference to FIG.

　次に、段落指定受付処理の詳細について図７～図１２を用いて説明する。 Next, details of the paragraph designation receiving process will be described with reference to FIGS.

　図７は、本実施例の段落指定受付処理のフローチャートである。 FIG. 7 is a flowchart of the paragraph designation receiving process of this embodiment.

　まず、段落指定受付処理部１１１は、ユーザからの段落指定受付処理の実行要求の入力を受け付け、段落指定受付処理用のユーザインタフェースを起動し（７０１）、文書データの入力を受け付ける（７０２）。そして、段落指定受付処理部１１１は、ブロック辞書１２１を初期化する（７０３）。 First, the paragraph designation acceptance processing unit 111 accepts an input of an execution request for a paragraph designation acceptance process from a user, activates a user interface for the paragraph designation acceptance process (701), and accepts input of document data (702). Then, the paragraph designation reception processing unit 111 initializes the block dictionary 121 (703).

　次に、段落指定受付処理部１１１は、入力された文書データに繰り返し部分があるか否かの入力をユーザから受け付け、ユーザから受け付けた入力が文書データに繰り返し部分があることを示すか否かを判定する（７０４）。 Next, the paragraph designation reception processing unit 111 receives an input from the user as to whether or not the input document data includes a repeated portion, and whether or not the input received from the user indicates that the document data includes a repeated portion. Is determined (704).

　ステップ７０４の処理で、ユーザから受け付けた入力が文書データに繰り返し部分がないことを示すと判定された場合（７０４：Ｎｏ）、段落指定受付処理部１１１は段落指定受付処理を終了する。 If it is determined in step 704 that the input received from the user indicates that there is no repetitive portion in the document data (704: No), the paragraph designation acceptance processing unit 111 ends the paragraph designation acceptance process.

　一方、ステップ７０４の処理で、ユーザから受け付けた入力が文書データに繰り返し部分があることを示すと判定された場合（７０４：Ｙｅｓ）、段落指定受付処理部１１１は、入力された画像データの領域のうちユーザが抽出を所望する段落の領域の指定を受け付ける（７０５）。例えば、ユーザは、マウス等を用いて段落の領域を指定する。 On the other hand, if it is determined in step 704 that the input received from the user indicates that there is a repeated portion in the document data (704: Yes), the paragraph designation reception processing unit 111 reads the area of the input image data. In step 705, the user designates a paragraph area desired to be extracted. For example, the user designates a paragraph area using a mouse or the like.

　次に、段落指定受付処理部１１１は、ステップ７０５の処理で指定を受け付けた段落の左上座標及び右下座標を段落テーブル４００の左上座標４０１及び右下座標４０２に登録する（７０６）。 Next, the paragraph designation reception processing unit 111 registers the upper left coordinates and lower right coordinates of the paragraph for which the designation has been received in step 705 as the upper left coordinates 401 and the lower right coordinates 402 of the paragraph table 400 (706).

　次に、段落指定受付処理部１１１は、ステップ７０５の処理で指定を受け付けた段落の領域に含まれ、ユーザが抽出を所望する要素の領域の指定を受け付ける（７０７）。例えば、ユーザは、マウス等を用いて要素の領域を指定する。 Next, the paragraph designation acceptance processing unit 111 accepts designation of an area of an element that is included in the paragraph area that has been designated in step 705 and that the user desires to extract (707). For example, the user designates an element area using a mouse or the like.

　次に、段落指定受付処理部１１１は、ステップ７０７の処理で指定を受け付けた要素の領域の頂点の座標を座標テーブル４２０に登録し、当該座標が登録された座標テーブル４２０へのポインタを要素テーブル４１０の多角形座標４１１に登録する（７０８）。 Next, the paragraph designation reception processing unit 111 registers the coordinates of the vertices of the area of the element whose designation is received in the process of step 707 in the coordinate table 420, and sets a pointer to the coordinate table 420 in which the coordinates are registered as the element table. The polygonal coordinates 411 of 410 are registered (708).

　次に、段落指定受付処理部１１１は、ステップ７０７の処理で指定を受け付けた要素の領域の中心座標を算出し、算出した中心座標を要素テーブル４１０の中心座標４１２に登録する（７０９）。 Next, the paragraph designation reception processing unit 111 calculates the center coordinates of the area of the element that has been specified in step 707, and registers the calculated center coordinates in the center coordinates 412 of the element table 410 (709).

　次に、段落指定受付処理部１１１は、ステップ７０７の処理で指定を受け付けた要素の領域に存在する情報集合を抽出し、抽出した情報集合の表記形式を要素テーブル４１０の表記形式４１３に登録し、抽出した情報集合の属性を要素テーブル４１０の属性４１４に登録し、抽出した情報集合を単語辞書１２２に登録する（７１０）。 Next, the paragraph designation reception processing unit 111 extracts an information set existing in the element area for which the designation has been received in the process of step 707, and registers the notation format of the extracted information set in the notation format 413 of the element table 410. Then, the attribute of the extracted information set is registered in the attribute 414 of the element table 410, and the extracted information set is registered in the word dictionary 122 (710).

　次に、段落指定受付処理部１１１は、ステップ７０５の処理で指定を受け付けた段落の領域をＸ－Ｙカット法を用いて部分領域に分割し、部分領域数を算出する（７１１）。Ｘ－Ｙカット法は、所定の領域から例えば横方向に連続する情報集合を抽出し、抽出した横方向に連続する二つの情報集合の縦方向の距離に基づいて所定の領域を分割する方法である。なお、縦方向に連続する情報集合を抽出する場合、縦方向に連続する二つの情報集合の横方向の距離に基づいて分割する。ステップ７１１の処理では、段落の領域に分割可能な領域の数を部分領域数として算出している。ステップ７１１の処理の具体例については図１１で説明する。 Next, the paragraph designation reception processing unit 111 divides the paragraph area for which the designation has been received in step 705 into partial areas using the XY cut method, and calculates the number of partial areas (711). The XY cut method is a method in which, for example, an information set continuous in the horizontal direction is extracted from a predetermined area, and the predetermined area is divided based on a vertical distance between two extracted information sets continuous in the horizontal direction. is there. In addition, when extracting the information set which continues in the vertical direction, it divides | segments based on the distance of the horizontal direction of two information sets continuous in the vertical direction. In step 711, the number of areas that can be divided into paragraph areas is calculated as the number of partial areas. A specific example of the processing in step 711 will be described with reference to FIG.

　次に、段落指定受付処理部１１１は、ステップ７１１の処理で算出した部分領域数を段落テーブル４００の部分領域数４０３に登録する（７１２）。 Next, the paragraph designation reception processing unit 111 registers the number of partial areas calculated in the process of step 711 in the number of partial areas 403 of the paragraph table 400 (712).

　次に、段落指定受付処理部１１１は、要素テーブル４１０へのポインタを段落テーブル４００の要素へのポインタ４０５に登録する（７１３）。 Next, the paragraph designation reception processing unit 111 registers the pointer to the element table 410 in the pointer 405 to the element of the paragraph table 400 (713).

　次に、段落指定受付処理部１１１は、入力された文書データから、指定を受け付けた段落の繰り返し部分を抽出する繰り返し部分抽出処理を実行し（７１４）、段落受付処理を終了する。繰り返し部分抽出処理の詳細は図８～図１０で説明する。 Next, the paragraph designation acceptance processing unit 111 executes a repeated part extraction process for extracting a repeated part of a paragraph for which designation is accepted from the input document data (714), and ends the paragraph acceptance process. Details of the repeated partial extraction processing will be described with reference to FIGS.

　図８は、本実施例の繰り返し部分抽出処理のフローチャートである。 FIG. 8 is a flowchart of the repeated partial extraction process of this embodiment.

　まず、段落指定受付処理部１１１は、入力された文書データの先頭ページに移動し（８０１）、繰り返しテーブル１２４を初期化する（８０２）。次に、段落指定受付処理部１１１は、入力された文書データから文字及び線分等の情報を抽出し（８０３）、入力された文書データの領域をＸ－Ｙカット法を用いて少なくとも一つの領域に分割する（８０４）。ステップ８０４の処理の具体例については図１２で説明する。 First, the paragraph designation reception processing unit 111 moves to the first page of the input document data (801), and initializes the repetition table 124 (802). Next, the paragraph designation reception processing unit 111 extracts information such as characters and line segments from the input document data (803), and uses the XY cut method to extract the region of the input document data using the XY cut method. The area is divided (804). A specific example of the processing in step 804 will be described with reference to FIG.

　段落指定受付処理部１１１は、ブロック辞書１２１に登録された段落の領域と物理構造が類似する繰り返し候補部分を抽出する繰り返し候補部分抽出処理を実行する（８０５）。繰り返し候補部分抽出処理の詳細は図９で説明する。 The paragraph designation reception processing unit 111 executes a repetition candidate part extraction process for extracting a repetition candidate part having a physical structure similar to that of the paragraph area registered in the block dictionary 121 (805). Details of the repetition candidate part extraction processing will be described with reference to FIG.

　次に、段落指定受付処理部１１１は、抽出された繰り返し候補部分に含まれる要素が、ブロック辞書１２１に登録された要素と一致する場合、当該繰り返し候補部分を繰り返し部分として、繰り返し部分に関する情報を繰り返しテーブル１２４に登録する要素一致判定処理を実行する（８０６）。要素一致判定処理の詳細は図１０で説明する。 Next, when the element included in the extracted repetition candidate part matches the element registered in the block dictionary 121, the paragraph designation reception processing unit 111 sets information regarding the repetition part as the repetition part. Element matching determination processing registered in the repetition table 124 is executed (806). Details of the element matching determination processing will be described with reference to FIG.

　次に、段落指定受付処理部１１１は、抽出された繰り返し部分を表示する繰り返し部分表示画面１５００（図１５参照）を表示して（８０７）、繰り返しテーブル１２４に繰り返し部分が二つ以上登録されているか否かを判定する（８０８）。なお、繰り返し部分表示画面１５００の詳細は図１５で説明する。 Next, the paragraph designation reception processing unit 111 displays a repeated part display screen 1500 (see FIG. 15) that displays the extracted repeated part (807), and two or more repeated parts are registered in the repetition table 124. It is determined whether or not (808). Details of the repeated portion display screen 1500 will be described with reference to FIG.

　ステップ８０８の処理で、繰り返しテーブル１２４に繰り返し部分が二つ以上登録されていると判定された場合（８０８：Ｙｅｓ）、段落指定受付処理部１１１は、繰り返し部分に含まれる各要素の情報集合を抽出し、抽出した各要素の情報集合を出力ＤＢ１２３に登録し、抽出した各要素の情報集合を単語辞書１２２に登録し（８０９）、繰り返し部分抽出処理を終了する。 If it is determined in the process of step 808 that two or more repetition parts are registered in the repetition table 124 (808: Yes), the paragraph designation reception processing unit 111 stores the information set of each element included in the repetition part. The extracted information set of each element is registered in the output DB 123, the extracted information set of each element is registered in the word dictionary 122 (809), and the repeated partial extraction process is terminated.

　一方、ステップ８０８の処理で、繰り返しテーブル１２４に繰り返し部分が二つ以上登録されていないと判定された場合（８０８：Ｎｏ）、段落指定受付処理部１１１は、繰り返し部分抽出処理を終了する。 On the other hand, when it is determined in the process of step 808 that two or more repeated parts are not registered in the repetition table 124 (808: No), the paragraph designation reception processing unit 111 ends the repeated part extraction process.

　図９は、本実施例の繰り返し候補部分抽出処理のフローチャートである。 FIG. 9 is a flowchart of the repetition candidate part extraction process of the present embodiment.

　まず、段落指定受付処理部１１１は、ステップ８０４の処理で分割された領域から処理対象となる一つの領域を選択する（９０１）。 First, the paragraph designation reception processing unit 111 selects one area to be processed from the areas divided in the process of Step 804 (901).

　次に、段落指定受付処理部１１１は、ステップ９０１の処理で選択した領域に存在する全ての情報集合を抽出し、抽出した情報集合の面積の合計値を当該領域の面積として算出する（９０２）。 Next, the paragraph designation reception processing unit 111 extracts all information sets existing in the area selected in the process of step 901, and calculates the total value of the areas of the extracted information sets as the area of the area (902). .

　次に、段落指定受付処理部１１１は、段落テーブル４００の部分領域数４０３に登録された値が２以上であるか否かを判定する（９０３）。 Next, the paragraph designation reception processing unit 111 determines whether or not the value registered in the partial area number 403 of the paragraph table 400 is 2 or more (903).

　ステップ９０３の処理で、段落テーブル４００の部分領域数４０３に登録された値が２以上でない、すなわち、段落テーブル４００の部分領域数４０３に登録された値が１であると判定された場合（９０３：Ｎｏ）、段落指定受付処理部１１１は、段落テーブル４００の左上座標４０１に登録された座標及び右下座標４０２に登録された座標に基づいてユーザから指定された段落の領域の面積を算出し、ステップ９０２の処理で算出した領域の面積が算出した段落の領域の面積から所定範囲内であるか否かを判定する（９０４）。 When it is determined in step 903 that the value registered in the partial area number 403 of the paragraph table 400 is not 2 or more, that is, the value registered in the partial area number 403 of the paragraph table 400 is 1 (903) : No), the paragraph designation reception processing unit 111 calculates the area of the paragraph area designated by the user based on the coordinates registered in the upper left coordinates 401 and the coordinates registered in the lower right coordinates 402 of the paragraph table 400. Then, it is determined whether or not the area area calculated in step 902 is within a predetermined range from the calculated area area of the paragraph (904).

　ステップ９０４の処理で、ステップ９０２の処理で算出した領域の面積が算出した段落の領域の面積から所定範囲内でないと判定された場合（９０４：Ｎｏ）、ステップ９０１の処理で選択した領域はユーザによって指定された段落の領域と物理構造が類似せず、ステップ９０１の処理で選択した領域は繰り返し候補部分とはならないため、段落指定受付処理部１１１は、ステップ９０１の処理に戻り、ステップ９０１～９１２の処理が未だ実行されていない領域を選択する。 If it is determined in step 904 that the area calculated in step 902 is not within the predetermined range from the calculated area of the paragraph (904: No), the area selected in step 901 is the user. Since the physical structure is not similar to the area of the paragraph specified by step 901 and the area selected in the process of step 901 does not become a repeat candidate part, the paragraph specification reception processing unit 111 returns to the process of step 901 and returns to steps 901 to 901. An area where the process 912 has not been executed is selected.

　一方、ステップ９０４の処理で、ステップ９０２の処理で算出した領域の面積が算出した段落の領域の面積から所定範囲内であると判定された場合（９０４：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ９０１の処理で選択した領域を繰り返し候補部分として抽出する（９０５）。 On the other hand, if it is determined in step 904 that the area of the region calculated in step 902 is within a predetermined range from the calculated area of the paragraph (904: Yes), the paragraph designation reception processing unit 111 The region selected in the process of step 901 is repeatedly extracted as a candidate part (905).

　次に、段落指定受付処理部１１１は、入力された画像データの分割された全ての領域に対して、ステップ９０１～９１２の処理が実行されたか否かを判定する（９０６）。 Next, the paragraph designation reception processing unit 111 determines whether or not the processing in steps 901 to 912 has been executed for all the divided areas of the input image data (906).

　ステップ９０６の処理で、入力された画像データの分割された全ての領域に対して、ステップ９０１～９１２の処理が実行されたと判定された場合（９０６：Ｙｅｓ）、段落指定受付処理部１１１は、繰り返し候補部分抽出処理を終了する。 If it is determined in step 906 that the processes in steps 901 to 912 have been performed on all the divided areas of the input image data (906: Yes), the paragraph designation reception processing unit 111 The repetition candidate part extraction process is terminated.

　一方、ステップ９０６の処理で、入力された画像データの分割された全ての領域に対して、ステップ９０１～９１２の処理が実行されていないと判定された場合（９０６：Ｎｏ）、段落指定受付処理部１１１は、ステップ９０１の処理に戻り、ステップ９０１～９１２の処理が未だ実行されていない領域を選択する。 On the other hand, if it is determined in step 906 that the processes in steps 901 to 912 have not been executed for all divided areas of the input image data (906: No), the paragraph designation receiving process The unit 111 returns to the process of step 901 and selects an area where the processes of steps 901 to 912 have not been executed yet.

　ステップ９０３の処理で、段落テーブル４００の部分領域数４０３に登録された値が２以上であると判定された場合、ユーザによって指定された段落は複数の領域を含むので、段落指定受付処理部１１１は、ステップ９０１の処理で選択した領域の面積と当該領域に隣接する領域の面積とを合計することによって、ユーザによって指定された段落に対応する領域の面積を算出する。具体的には、まず、段落指定受付処理部１１１は、ステップ９０１の処理で選択した領域に隣接する一つの領域を隣接領域として選択する（９０７）。例えば、ステップ９０１の処理では、文書データの上方に位置する領域から順に選択されるものとすると、ステップ９０７の処理では、ステップ９０１の処理で選択した領域の下方に隣接する領域が隣接領域として選択される。 If it is determined in step 903 that the value registered in the number of partial areas 403 of the paragraph table 400 is 2 or more, the paragraph specified by the user includes a plurality of areas. Calculates the area of the region corresponding to the paragraph designated by the user by summing the area of the region selected in the processing of step 901 and the area of the region adjacent to the region. Specifically, first, the paragraph designation reception processing unit 111 selects one area adjacent to the area selected in step 901 as an adjacent area (907). For example, in the process of step 901, assuming that areas are selected in order from the area above the document data, the area adjacent to the area selected in the process of step 901 is selected as the adjacent area in the process of step 907. Is done.

　次に、段落指定受付処理部１１１は、ステップ９０７の処理で選択した隣接領域に存在する全ての情報集合を抽出し、抽出した情報集合の面積の合計値を当該隣接領域の面積として算出する（９０８）。 Next, the paragraph designation reception processing unit 111 extracts all the information sets existing in the adjacent area selected in the process of step 907, and calculates the total value of the areas of the extracted information sets as the area of the adjacent area ( 908).

　次に、段落指定受付処理部１１１は、ステップ９０２の処理で算出した領域の面積とステップ９０８の処理で算出した隣接領域の面積との合計値を算出する（９０９）。 Next, the paragraph designation reception processing unit 111 calculates the total value of the area area calculated in step 902 and the adjacent area area calculated in step 908 (909).

　次に、段落指定受付処理部１１１は、段落テーブル４００の左上座標４０１に登録された座標及び右下座標４０２に登録された座標に基づいてユーザから指定された段落の領域の面積を算出し、ステップ９０９の処理で算出した合計値が算出した段落の領域の面積から所定範囲内であるか否かを判定する（９１０）。 Next, the paragraph designation reception processing unit 111 calculates the area of the paragraph area designated by the user based on the coordinates registered in the upper left coordinates 401 and the coordinates registered in the lower right coordinates 402 of the paragraph table 400, It is determined whether the total value calculated in step 909 is within a predetermined range from the calculated area of the paragraph (910).

　ステップ９１０の処理で、ステップ９０２の処理で算出した領域の面積が算出した段落の領域の面積から所定範囲内であると判定された場合（９１０：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ９０５の処理に進み、ステップ９０１の処理で選択した領域及びステップ９０７の処理で選択した隣接領域を繰り返し候補部分として抽出する。 When it is determined in step 910 that the area of the region calculated in step 902 is within a predetermined range from the calculated area of the paragraph (910: Yes), the paragraph designation reception processing unit 111 performs step Proceeding to step 905, the region selected in step 901 and the adjacent region selected in step 907 are repeatedly extracted as candidate portions.

　一方、ステップ９１０の処理で、ステップ９０２の処理で算出した領域の面積が算出した段落の領域の面積から所定範囲内でないと判定された場合（９１０：Ｎｏ）、段落指定受付処理部１１１は、ステップ９０９の処理で算出した合計値が段落の領域の面積より小さいか否かを判定する（９１１）。 On the other hand, if it is determined in step 910 that the area calculated in step 902 is not within the predetermined range from the calculated area of the paragraph (910: No), the paragraph designation reception processing unit 111 It is determined whether the total value calculated in step 909 is smaller than the area of the paragraph area (911).

　ステップ９１１の処理で、ステップ９０９の処理で算出した合計値が段落の領域の面積より小さいと判定された場合（９１１：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ９０７の処理で選択した隣接領域に隣接する領域を隣接領域として選択し（９１２）、ステップ９０８の処理に戻り、ステップ９０１の処理で選択した領域の面積とステップ９０７の処理で選択した領域の面積とステップ９１２の処理で選択した領域の面積との合計値及び段落の領域の面積を比較する。 If it is determined in step 911 that the total value calculated in step 909 is smaller than the area of the paragraph area (911: Yes), the paragraph designation reception processing unit 111 selects the adjacent area selected in step 907. An area adjacent to the area is selected as an adjacent area (912), the process returns to step 908, the area selected in step 901, the area selected in step 907, and the area selected in step 912 are selected. The total value of the area and the area of the paragraph are compared.

　一方、ステップ９１１の処理で、ステップ９０９の処理で算出した合計値が段落の領域の面積以上であると判定された場合（９１２：Ｎｏ）、段落指定受付処理部１１１は、ステップ９０６の処理に進み、入力された画像データの分割された全ての領域に対して、ステップ９０１～９１２の処理が実行されたか否かを判定する。 On the other hand, if it is determined in step 911 that the total value calculated in step 909 is equal to or larger than the area of the paragraph area (912: No), the paragraph designation reception processing unit 111 performs the process in step 906. Then, it is determined whether or not the processing in steps 901 to 912 has been executed for all divided areas of the input image data.

　図１０は、本実施例の要素一致判定処理のフローチャートである。 FIG. 10 is a flowchart of the element matching determination process according to this embodiment.

　まず、段落指定受付処理部１１１は、繰り返し候補部分抽出処理で抽出された繰り返し候補部分から処理対象となる一つの繰り返し候補部分を選択する（１００１）。 First, the paragraph designation reception processing unit 111 selects one repetition candidate part to be processed from the repetition candidate parts extracted in the repetition candidate part extraction process (1001).

　次に、段落指定受付処理部１１１は、ステップ１００１の処理で選択した繰り返し候補部分に含まれる情報集合から処理対象となる一つの情報集合を選択する（１００２）。 Next, the paragraph designation reception processing unit 111 selects one information set to be processed from the information set included in the repetition candidate part selected in the process of Step 1001 (1002).

　次に、段落指定受付処理部１１１は、ステップ１００２の処理で選択した情報集合と比較するための要素テーブル４１０を選択する（１００３）。 Next, the paragraph designation reception processing unit 111 selects the element table 410 for comparison with the information set selected in the processing of Step 1002 (1003).

　次に、段落指定受付処理部１１１は、ステップ１００２の処理で選択した情報集合が、ステップ１００３の処理で選択した要素テーブル４１０の表記形式４１３に登録された表記形式及び属性４１４に登録された属性を満たすか否かを判定する（１００４）。 Next, the paragraph designation reception processing unit 111 uses the attribute set registered in the notation format and attribute 414 registered in the notation format 413 of the element table 410 selected in the process in step 1003 as the information set selected in the process in step 1002. It is determined whether or not the condition is satisfied (1004).

　ステップ１００４の処理で、情報集合が要素テーブル４１０の表記形式及び属性を満たすと判定された場合（１００４：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ１００２の処理で選択した情報集合の中心座標を算出する（１００５）。情報集合の中心座標は、当該情報集合が存在する領域内における座標であり、例えば、当該情報集合が存在する領域の中心座標を原点として算出される。 If it is determined in step 1004 that the information set satisfies the notation format and attributes of the element table 410 (1004: Yes), the paragraph designation reception processing unit 111 determines the center coordinates of the information set selected in step 1002 Is calculated (1005). The center coordinates of the information set are coordinates in the area where the information set exists, and are calculated using, for example, the center coordinates of the area where the information set exists as the origin.

　次に、段落指定受付処理部１１１は、ステップ１００５の処理で算出した情報集合の中心座標と、ステップ１００３の処理で算出した要素テーブル４１０の中心座標４１２に登録された座標との間の距離を算出する（１００６）。 Next, the paragraph designation reception processing unit 111 calculates the distance between the center coordinates of the information set calculated in the process of step 1005 and the coordinates registered in the center coordinates 412 of the element table 410 calculated in the process of step 1003. Calculate (1006).

　次に、段落指定受付処理部１１１は、ステップ１００６の処理で算出した距離が所定値以下であるか否かを判定する（１００７）。 Next, the paragraph designation reception processing unit 111 determines whether or not the distance calculated in the process of Step 1006 is less than or equal to a predetermined value (1007).

　ステップ１００７の処理で、ステップ１００６の処理で算出した距離が所定値以下であると判定された場合（１００７：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ１００２の処理で選択した情報集合とステップ１００３の処理で選択した要素テーブルとを対応付けて記憶する（１００８）。 If it is determined in step 1007 that the distance calculated in step 1006 is equal to or less than a predetermined value (1007: Yes), the paragraph designation reception processing unit 111 selects the information set selected in step 1002 and the step The element table selected in the processing of 1003 is stored in association with it (1008).

　次に、段落指定受付処理部１１１は、ステップ１００１の処理で選択した繰り返し候補部分の全ての情報集合がステップ１００２の処理で選択されたか否かを判定する（１００９）。 Next, the paragraph designation reception processing unit 111 determines whether or not all information sets of the repetition candidate part selected in the process of Step 1001 have been selected in the process of Step 1002 (1009).

　ステップ１００９の処理で、ステップ１００１の処理で選択した繰り返し候補部分の全ての情報集合がステップ１００２の処理で選択されたと判定された場合（１００９：Ｙｅｓ）、段落指定受付処理部１１１は、情報集合が要素テーブルに一対一で対応しているか否かを判定する（１０１０）。例えば、ステップ１００１の処理で選択した繰り返し候補部分に第１情報集合、第２情報集合、及び第３情報集合が存在し、ブロック辞書１２１に第１要素テーブル、第２要素テーブル、及び第３要素テーブルが含まれる場合、ステップ１００８の処理で、第１情報集合と第１要素テーブルとが対応付けて記憶され、第２情報集合と第２要素テーブルとが対応付けて記憶され、第３情報集合と第３要素テーブルとが対応付けて記憶されていれば、ステップ１０１０の処理でＹｅｓと判定される。なお、一つの要素テーブルが複数の情報集合に重複して対応付けられている場合等は、情報集合が要素テーブルに一対一で対応しておらず、ステップ１０１０の処理でＮｏと判定される。 If it is determined in step 1009 that all information sets of the repetition candidate portion selected in step 1001 have been selected in step 1002 (1009: Yes), the paragraph designation reception processing unit 111 Is determined to correspond to the element table on a one-to-one basis (1010). For example, the first information set, the second information set, and the third information set exist in the repetition candidate portion selected in the process of step 1001, and the first element table, the second element table, and the third element are stored in the block dictionary 121. If a table is included, the first information set and the first element table are stored in association with each other in step 1008, the second information set and the second element table are stored in association with each other, and the third information set is stored. And the third element table are stored in association with each other, it is determined Yes in the process of step 1010. When one element table is associated with a plurality of information sets in an overlapping manner, the information set does not correspond to the element table on a one-to-one basis, and it is determined No in the processing of step 1010.

　ステップ１０１０の処理で、情報集合が要素テーブルに一対一で対応していると判定された場合（１０１０：Ｙｅｓ）、段落指定受付処理部１１１は、ステップ１００１の処理で選択した繰り返し候補部分を繰り返し部分として抽出し、当該繰り返し候補部分に関する情報を繰り返しテーブル１２４に登録する（１０１１）。 If it is determined in step 1010 that the information set corresponds to the element table on a one-to-one basis (1010: Yes), the paragraph designation reception processing unit 111 repeats the repetition candidate portion selected in step 1001. It extracts as a part and registers information on the candidate repeat part in the repeat table 124 (1011).

　具体的には、段落指定受付処理部１１１は、繰り返しテーブル１２４に含まれるリストテーブル５００の繰り返し数５０１に登録された値に１を加算し、段落テーブルへのポインタ５０２に段落テーブル５１０へのポインタを登録する。段落指定受付処理部１１１は、段落テーブル５１０の左上座標４０１にステップ１００２の処理で選択した繰り返し候補部分の左上座標を登録し、右下座標４０２に当該繰り返し候補部分の右下座標を登録する。また、段落指定受付処理部１１１は、段落テーブル５１０の部分領域数４０３に、繰り返し部分として抽出された繰り返し候補部分が含む領域の数を登録する。また、段落指定受付処理部１１１は、段落テーブル５１０の要素数４０４に、繰り返し部分として抽出された繰り返し候補部分が含む情報集合の数を登録し、段落テーブル５１０の要素へのポインタ４０５に要素テーブル５２０へのポインタを登録する。 Specifically, the paragraph designation reception processing unit 111 adds 1 to the value registered in the number of repetitions 501 of the list table 500 included in the repetition table 124, and the pointer to the paragraph table 510 is added to the paragraph table pointer 502. Register. The paragraph designation reception processing unit 111 registers the upper left coordinates of the repetition candidate part selected in the process of step 1002 in the upper left coordinates 401 of the paragraph table 510, and registers the lower right coordinates of the repetition candidate parts in the lower right coordinates 402. In addition, the paragraph designation reception processing unit 111 registers the number of areas included in the repetition candidate part extracted as the repetition part in the partial area number 403 of the paragraph table 510. In addition, the paragraph designation reception processing unit 111 registers the number of information sets included in the repetition candidate part extracted as the repetition part in the element number 404 of the paragraph table 510 and the element table in the pointer 405 to the element of the paragraph table 510. A pointer to 520 is registered.

　段落指定受付処理部１１１は、繰り返しテーブル１２４に含まれる要素テーブル５２０の多角形座標４１１に、繰り返し部分として抽出された繰り返し候補部分が含む情報集合の領域の各頂点の座標を登録した座標テーブル５３０へのポインタを登録する。また、段落指定受付処理部１１１は、要素テーブル５２０の中心座標４１２に、繰り返し部分として抽出された繰り返し候補部分が含む情報集合の領域の中心座標を登録する。また、段落指定受付処理部１１１は、要素テーブル５２０の表記形式４１３に、繰り返し部分として抽出された繰り返し候補部分が含む情報集合の表記形式を登録し、要素テーブル５２０の属性４１４に、当該情報集合の属性を登録する。 The paragraph designation reception processing unit 111 registers the coordinates of each vertex of the information set area included in the repetition candidate portion extracted as the repetition portion in the polygon coordinates 411 of the element table 520 included in the repetition table 124. A pointer to is registered. In addition, the paragraph designation reception processing unit 111 registers the center coordinates of the area of the information set included in the repeat candidate part extracted as the repeat part in the center coordinate 412 of the element table 520. In addition, the paragraph designation reception processing unit 111 registers the notation format of the information set included in the repetition candidate portion extracted as the repetition portion in the notation format 413 of the element table 520, and the information set is registered in the attribute 414 of the element table 520. Register the attributes.

　ステップ１０１１の処理の実行後、段落指定受付処理部１１１は、繰り返し候補部分抽出処理で抽出された全ての繰り返し候補部分がステップ１００１の処理で選択されたか否かを判定する（１０１２）。 After the execution of the process of step 1011, the paragraph designation reception processing unit 111 determines whether all the repetition candidate parts extracted by the repetition candidate part extraction process have been selected by the process of step 1001 (1012).

　ステップ１０１２の処理で、繰り返し候補部分抽出処理で抽出された全ての繰り返し候補部分がステップ１００１の処理で選択されたと判定された場合（１０１２：Ｙｅｓ）、段落指定受付処理部１１１は、要素一致判定処理を終了する。一方、ステップ１０１２の処理で、繰り返し候補部分抽出処理で抽出された全ての繰り返し候補部分がステップ１００１の処理で選択されていないと判定された場合（１０１２：Ｎｏ）、段落指定受付処理部１１１は、ステップ１００１の処理に戻る。 If it is determined in step 1012 that all the repetition candidate parts extracted in the repetition candidate part extraction process have been selected in the process of step 1001 (1012: Yes), the paragraph designation reception processing unit 111 performs element matching determination. The process ends. On the other hand, if it is determined in step 1012 that all the repetition candidate parts extracted in the repetition candidate part extraction process are not selected in the process of step 1001 (1012: No), the paragraph designation reception processing unit 111 Returning to the processing of step 1001.

　ステップ１００４の処理で、情報集合が要素テーブル４１０の表記形式及び属性を満たさないと判定された場合（１００４：Ｎｏ）、及び、ステップ１００７の処理で、ステップ１００６の処理で算出した距離が所定値以下であると判定された場合（１００７：Ｎｏ）、ステップ１００２の処理で選択した情報集合はステップ１００３の処理で選択した要素テーブル４１０に登録された要素の物理構造又は論理構造と一致しないので、段落指定受付処理部１１１は、全ての要素テーブル４１０がステップ１００３の処理で選択されたか否かを判定する（１０１３）。ステップ１０１３の処理で、全ての要素テーブル４１０がステップ１００３の処理で選択されていないと判定された場合（１０１３：Ｎｏ）、段落指定受付処理部１１１は、ステップ１００３の処理に戻り、未だ選択されていない要素テーブル４１０を選択する。 When it is determined in step 1004 that the information set does not satisfy the notation format and attributes of the element table 410 (1004: No), and in step 1007, the distance calculated in step 1006 is a predetermined value. If it is determined that it is the following (1007: No), the information set selected in the process of step 1002 does not match the physical structure or logical structure of the element registered in the element table 410 selected in the process of step 1003. The paragraph designation reception processing unit 111 determines whether all the element tables 410 have been selected in the processing of Step 1003 (1013). If it is determined in step 1013 that all the element tables 410 have not been selected in step 1003 (1013: No), the paragraph designation reception processing unit 111 returns to step 1003 and is still selected. The element table 410 that is not selected is selected.

　一方、ステップ１０１３の処理で、全ての要素テーブル４１０がステップ１００３の処理で選択されたと判定された場合（１０１３：Ｙｅｓ）、ステップ１００２の処理で選択した情報集合はいずれの要素の物理構造又は論理構造と一致せず、ステップ１００１の処理で選択した繰り返し候補部分はいずれの要素の物理構造又は論理構造と一致しない情報集合を含むので、段落指定受付処理部１１１は、ステップ１００１の処理で選択した繰り返し候補部分は繰り返し部分でないと判断し（１０１４）、ステップ１０１２の処理に進む。 On the other hand, if it is determined in step 1013 that all the element tables 410 have been selected in step 1003 (1013: Yes), the information set selected in step 1002 is the physical structure or logic of any element. Since the repetition candidate part selected in the process of step 1001 does not match the structure and includes an information set that does not match the physical structure or logical structure of any element, the paragraph designation reception processing unit 111 selected in the process of step 1001 The repetition candidate part is determined not to be a repetition part (1014), and the process proceeds to step 1012.

　また、ステップ１０１０の処理で、情報集合が要素テーブルに一対一で対応していないと判定された場合（１０１０：Ｎｏ）、段落指定受付処理部１１１は、ステップ１０１４の処理に進み、ステップ１００１の処理で選択した繰り返し候補部分は繰り返し部分でないと判断する。 If it is determined in the process of step 1010 that the information set does not correspond to the element table on a one-to-one basis (1010: No), the paragraph designation reception processing unit 111 proceeds to the process of step 1014. It is determined that the repetition candidate part selected in the process is not a repetition part.

　図１１は、本実施例のユーザから指定された段落の領域をＸ－Ｙカット法を用いて分割する処理の説明図である。 FIG. 11 is an explanatory diagram of the process of dividing the paragraph area designated by the user of this embodiment using the XY cut method.

　図２に示す文書データにおいてユーザから指定された段落の領域が、Ｘ－Ｙカット法を用いて分割される。図１１では、横方向に連続する情報集合が抽出され、抽出された横方向に連続する二つの情報集合の縦方向の距離が閾値以下であるので、段落の領域は分割されない。このため、当該段落の領域の部分領域数は１となる。 In the document data shown in FIG. 2, the paragraph area designated by the user is divided using the XY cut method. In FIG. 11, since the information set continuous in the horizontal direction is extracted, and the vertical distance between the two extracted information sets continuous in the horizontal direction is equal to or smaller than the threshold value, the paragraph region is not divided. For this reason, the number of partial areas of the area of the paragraph is 1.

　図１２は、本実施例の文書データの全体領域をＸ－Ｙカット法を用いて分割する処理の説明図である。 FIG. 12 is an explanatory diagram of a process of dividing the entire area of the document data according to the present embodiment using the XY cut method.

　図２に示す文書データの全体領域が、Ｘ－Ｙカット法を用いて分割される。図１２では、文書データの全体領域は、第１領域～第７領域の七つの領域に分割される。 2) The entire area of the document data shown in FIG. 2 is divided using the XY cut method. In FIG. 12, the entire area of the document data is divided into seven areas, a first area to a seventh area.

　第１領域は、文字列「文書１　Ｓｐｏｒｔ　Ｂｏｏｋ」を含む。第２領域は、文字列「この製品には３つのタイプがある。」を含む。第３領域は、三つの図形と各図形に関する文字列とを含む。第４領域は、文字列「以下に製品の一覧を示す」を含む。第５領域は、ハンドバッグに関する文字列を含む。第６領域は長靴に関する文字列を含む。第７領域は手袋に関する文字列を含む。 The first area includes the character string “Document 1 Sport Book”. The second area includes the character string “There are three types of this product”. The third area includes three figures and a character string related to each figure. The fourth area includes a character string “A product list is shown below”. The fifth area includes a character string related to the handbag. The sixth area includes a character string related to boots. The seventh region includes a character string related to gloves.

　次に、文書処理の詳細について説明する。 Next, the details of document processing will be described.

　まず、図６に示す抽出領域候補抽出処理について説明する。抽出領域候補抽出処理では、文書処理部１１２が入力された文書データの領域から、ブロック辞書１２１に含まれる段落テーブル４００に登録された段落の物理構造と類似する領域を抽出領域候補として風出する。抽出領域候補抽出処理は、図９に示す繰り返し候補部分抽出処理と同じフローチャートで実現可能である。抽出領域候補抽出処理では、ステップ９０５の処理では、文書処理部１１２は、ステップ９０１の処理で選択した領域を抽出領域候補として抽出する点で、繰り返し候補部分抽出処理と異なる。 First, the extraction area candidate extraction process shown in FIG. 6 will be described. In the extraction area candidate extraction process, an area similar to the physical structure of the paragraph registered in the paragraph table 400 included in the block dictionary 121 is generated as an extraction area candidate from the area of the document data input by the document processing unit 112. . The extraction area candidate extraction process can be realized by the same flowchart as the repetition candidate part extraction process shown in FIG. The extraction area candidate extraction process is different from the repeated candidate part extraction process in that the document processing unit 112 extracts the area selected in the process of step 901 as an extraction area candidate in the process of step 905.

　次に、図１３を用いて一致度算出処理の詳細を説明する。図１３は、本実施例の一致度算出処理のフローチャートである。 Next, details of the degree-of-match calculation processing will be described with reference to FIG. FIG. 13 is a flowchart of the coincidence degree calculation process according to this embodiment.

　まず、文書処理部１１２は、抽出領域候補抽出処理で抽出された抽出領域候補から、処理対象となる一つの抽出領域候補を選択する（１３０１）。 First, the document processing unit 112 selects one extraction area candidate to be processed from the extraction area candidates extracted in the extraction area candidate extraction process (1301).

　次に、文書処理部１１２は、ブロック辞書１２１に登録された段落テーブル４００から処理対象となる一つの段落テーブル４００を選択する（１３０２）。例えば、段落指定受付処理で、ユーザから複数の段落の領域が指定された場合、ブロック辞書１２１に複数の段落テーブル４００が登録される。 Next, the document processing unit 112 selects one paragraph table 400 to be processed from the paragraph tables 400 registered in the block dictionary 121 (1302). For example, in the paragraph designation receiving process, when a plurality of paragraph areas are designated by the user, a plurality of paragraph tables 400 are registered in the block dictionary 121.

　次に、文書処理部１１２は、ステップ１３０１の処理で選択した抽出領域候補に含まれる情報集合から処理対象となる一つの情報集合を選択する（１３０３）。 Next, the document processing unit 112 selects one information set to be processed from the information set included in the extraction area candidate selected in the processing of step 1301 (1303).

　次に、文書処理部１１２は、ステップ１３０３の処理で選択した情報集合と比較するための要素テーブル４１０を、ステップ１３０２の処理で選択した段落テーブル４００に対応付けられた要素テーブル４１０から選択する（１３０４）。 Next, the document processing unit 112 selects, from the element table 410 associated with the paragraph table 400 selected in step 1302, an element table 410 for comparison with the information set selected in step 1303 ( 1304).

　次に、文書処理部１１２は、中心座標間距離一致度算出処理を実行する（１３０５）。中心座標間距離一致度算出処理は、ステップ１３０３の処理で選択した情報集合の中心座標とステップ１３０４の処理で選択した要素テーブル４１０の中心座標４１２に登録された座標との距離に基づく中心座標間距離一致度を算出する処理である。なお、情報集合の中心座標と要素の中心座標との間の距離が小さいほど中心座標間距離一致度は高くなる。 Next, the document processing unit 112 executes a distance coincidence calculation process between center coordinates (1305). The center coordinate distance coincidence calculation processing is performed between center coordinates based on the distance between the center coordinates of the information set selected in step 1303 and the coordinates registered in the center coordinates 412 of the element table 410 selected in step 1304. This is a process for calculating the distance matching degree. Note that the smaller the distance between the center coordinates of the information set and the center coordinates of the elements, the higher the distance coincidence between the center coordinates.

　次に、文書処理部１１２は、単語一致度算出処理を実行する（１３０６）。単語一致度算出処理は、ステップ１３０３の処理で選択した情報集合に含まれる文字と、ステップ１３０４の処理で選択した要素テーブル４１０に対応付けられた単語辞書１２２に登録された文字との一致度を算出する処理である。情報集合に含まれる文字のうち、単語辞書１２２に登録された文字と同じ文字の数が多いほど単語一致度は高くなる。 Next, the document processing unit 112 executes a word matching degree calculation process (1306). In the word matching degree calculation process, the degree of matching between the characters included in the information set selected in step 1303 and the characters registered in the word dictionary 122 associated with the element table 410 selected in step 1304 is calculated. This is a calculation process. Of the characters included in the information set, the greater the number of characters that are the same as the characters registered in the word dictionary 122, the higher the word matching degree.

　次に、文書処理部１１２は、表記形式及び属性一致度算出処理を実行する（１３０７）。表記形式及び属性一致度算出処理は、ステップ１３０３の処理で選択した情報集合が、ステップ１３０４の処理で選択した要素テーブル４１０の表記形式４１３に登録された表記形式を満たすか否か、及びステップ１３０３の処理で選択した情報集合が、ステップ１３０４の処理で選択した要素テーブル４１０の属性４１４に登録された属性を満たすか否かに基づく表記形式及び属性一致度を算出する。なお、情報集合が表記形式及び属性を満たす場合の表記形式及び属性一致度が最も高く、情報集合が表記形式及び属性のいずれかを満たす場合の表記形式及び属性一致度が次に高く、情報集合が表記形式及び属性のいずれも満たさない場合の表記形式及び属性一致度が最も低くなる。 Next, the document processing unit 112 executes a notation format and attribute matching degree calculation process (1307). In the notation format and attribute matching degree calculation process, whether or not the information set selected in step 1303 satisfies the notation format registered in the notation format 413 of the element table 410 selected in step 1304, and step 1303. The notation format and attribute matching degree are calculated based on whether or not the information set selected in the process of step 1304 satisfies the attribute registered in the attribute 414 of the element table 410 selected in step 1304. When the information set satisfies the notation format and attribute, the notation format and attribute matching degree are the highest, and when the information set satisfies either the notation format and attribute, the notation format and attribute match degree are the second highest, and the information set Is the lowest in notation format and attribute matching degree when neither of the notation format nor attribute is satisfied.

　次に、文書処理部１１２は、ステップ１３０５の処理で算出した中心座標間距離一致度、ステップ１３０６の処理で算出した単語一致度、並びにステップ１３０７の処理で算出した表記形式及び属性一致度に基づいて総合一致度を算出する（１３０８）。具体的には、文書処理部１１２は、中心座標間距離一致度、単語一致度、並びに表記形式及び属性一致度に重み付けをして総合一致度を算出する。例えば、中心座標間距離一致度の重み係数を最も高く設定し、表記形式及び属性一致度の重み係数を次に高く設定し、単語一致度の重み係数を最も低く設定する。これによって、未だ抽出されたことない文字列が抽出された場合、当該文字列が単語辞書１２２に登録されていないので、単語一致度は低くなるが、単語一致度の重み係数は最も低いので、中心座標間距離一致度又は表記形式及び属性一致度が高ければ、当該文字列を要素として抽出できる。 Next, the document processing unit 112 is based on the distance between the center coordinates calculated in the process of step 1305, the word match calculated in the process of step 1306, and the notation format and attribute match calculated in the process of step 1307. The total matching degree is calculated (1308). Specifically, the document processing unit 112 calculates a total matching degree by weighting the center coordinate distance matching degree, the word matching degree, the notation format, and the attribute matching degree. For example, the weight coefficient for the center coordinate distance coincidence is set to the highest, the weighting coefficient for the notation format and the attribute coincidence is set to the next highest, and the weight coefficient for the word match is set to the lowest. As a result, when a character string that has not yet been extracted is extracted, since the character string is not registered in the word dictionary 122, the word matching degree is low, but the weight coefficient of the word matching degree is the lowest. If the distance coincidence between center coordinates or the notation format and the attribute coincidence are high, the character string can be extracted as an element.

　次に、文書処理部１１２は、ステップ１３０２の処理で選択した段落テーブル４００に対応付けられた全ての要素テーブル４１０がステップ１３０４の処理で選択されたか否かを判定する（１３０９）。 Next, the document processing unit 112 determines whether or not all the element tables 410 associated with the paragraph table 400 selected in step 1302 have been selected in step 1304 (1309).

　ステップ１３０９の処理で、ステップ１３０２の処理で選択した段落テーブル４００に対応付けられた全ての要素テーブル４１０がステップ１３０４の処理で選択されていないと判定された場合（１３０９：Ｎｏ）、文書処理部１１２は、ステップ１３０４の処理に戻り、新たな要素テーブル４１０を選択する。 If it is determined in step 1309 that all element tables 410 associated with the paragraph table 400 selected in step 1302 have not been selected in step 1304 (1309: No), the document processing unit 112 returns to the process of step 1304 and selects a new element table 410.

　一方、ステップ１３０９の処理で、ステップ１３０２の処理で選択した段落テーブル４００に対応付けられた全ての要素テーブル４１０がステップ１３０４の処理で選択されたと判定された場合（１３０９：Ｙｅｓ）、文書処理部１１２は、ステップ１３０１の処理で選択した抽出領域候補に存在する全ての情報集合をステップ１３０４の処理で選択したか否かを判定する（１３１０）。 On the other hand, if it is determined in step 1309 that all element tables 410 associated with the paragraph table 400 selected in step 1302 have been selected in step 1304 (1309: Yes), the document processing unit 112 determines whether all information sets existing in the extraction area candidate selected in the process of step 1301 have been selected in the process of step 1304 (1310).

　ステップ１３１０の処理で、ステップ１３０１の処理で選択した抽出領域候補に存在する全ての情報集合をステップ１３０４の処理で選択していないと判定された場合（１３１０：Ｎｏ）、文書処理部１１２は、ステップ１３０３の処理に戻り、ステップ１３０１の処理で選択した抽出領域候補に存在する他の情報集合を選択する。 When it is determined in step 1310 that all information sets existing in the extraction region candidate selected in step 1301 have not been selected in step 1304 (1310: No), the document processing unit 112 Returning to the processing in step 1303, another information set existing in the extraction region candidate selected in the processing in step 1301 is selected.

　一方、ステップ１３１０の処理で、ステップ１３０１の処理で選択した抽出領域候補に存在する全ての情報集合をステップ１３０４の処理で選択したと判定された場合（１３１０：Ｙｅｓ）、ステップ１３０１の処理で選択した抽出領域候補に存在する全ての情報集合の全て要素との総合一致度は算出済みである。この場合、文書処理部１１２は、各情報集合の総合一致度が最大となる要素を特定し、各情報集合の特定した要素が重複するか否かを判定する（１３１１）。 On the other hand, if it is determined in step 1310 that all information sets existing in the extraction region candidate selected in step 1301 are selected in step 1304 (1310: Yes), selection is performed in step 1301. The total matching degree with all the elements of all the information sets existing in the extracted region candidate has been calculated. In this case, the document processing unit 112 specifies an element that maximizes the total matching degree of each information set, and determines whether or not the specified elements of each information set overlap (1311).

　ステップ１３１１の処理で各情報集合の要素が重複しないと判定された場合、文書処理部１１２は、全ての情報集合の最大となる総合一致度が所定値以上であるか否かを判定する（１３１２）。 When it is determined in step 1311 that the elements of each information set do not overlap, the document processing unit 112 determines whether or not the maximum total matching degree of all information sets is equal to or greater than a predetermined value (1312). ).

　ステップ１３１２の処理で、全ての情報集合の最大となる総合一致度が所定値以上であると判定された場合（１３１２：Ｙｅｓ）、文書処理部１１２は、各情報集合を、各情報集合の総合一致度が最大となる要素として抽出し（１３１３）、抽出した情報集合を、総合一致度が最大となる要素と対応付けて単語辞書１２２及び出力ＤＢ１２３に登録する（１３１４）。そして、文書処理部１３１３は、抽出領域候補抽出処理で抽出された全ての抽出領域候補がステップ１３０１の処理で選択されたか否かを判定する（１３１５）。 If it is determined in step 1312 that the maximum total matching degree of all information sets is greater than or equal to a predetermined value (1312: Yes), the document processing unit 112 converts each information set to the total of each information set. An element having the highest matching degree is extracted (1313), and the extracted information set is registered in the word dictionary 122 and the output DB 123 in association with the element having the highest total matching degree (1314). Then, the document processing unit 1313 determines whether all extraction region candidates extracted in the extraction region candidate extraction processing have been selected in the processing of Step 1301 (1315).

　ステップ１３１５の処理で、抽出領域候補抽出処理で抽出された全ての抽出領域候補がステップ１３０１の処理で選択されていないと判定された場合（１３１５：Ｎｏ）、文書処理部１１２は、ステップ１３０１の処理に戻り、抽出領域候補抽出処理で抽出された抽出領域候補から処理対象となる一つの抽出領域候補を新たに選択する。 When it is determined in step 1315 that all extraction region candidates extracted in the extraction region candidate extraction processing are not selected in step 1301 (1315: No), the document processing unit 112 performs step 1301. Returning to the processing, one extraction region candidate to be processed is newly selected from the extraction region candidates extracted in the extraction region candidate extraction processing.

　一方、ステップ１３１５の処理で、抽出領域候補抽出処理で抽出された全ての抽出領域候補がステップ１３０１の処理で選択されたと判定された場合（１３１５：Ｙｅｓ）、文書処理部１１２は、一致度算出処理を終了する。 On the other hand, when it is determined in step 1315 that all extraction region candidates extracted in the extraction region candidate extraction processing are selected in step 1301 (1315: Yes), the document processing unit 112 calculates the degree of coincidence. The process ends.

　ステップ１３１１の処理で各情報集合の総合一致度が最大となる要素が重複すると判定された場合（１３１１：Ｙｅｓ）、又は、ステップ１３１２の処理で、最大となる総合一致度が所定値より小さい情報集合が存在すると判定された場合（１３１２：Ｎｏ）、ステップ１３０３の処理で選択した情報集合は、ステップ１３０２の処理で選択された段落テーブル４００にいずれの要素テーブル４１０にも一致しない。このため、文書処理部１１２は、全ての段落テーブル４００がステップ１３０２の処理で選択されたか否かを判定する（１３１６）。 When it is determined in the processing of step 1311 that the elements having the maximum total matching degree of each information set overlap (1311: Yes), or in the processing of step 1312, the maximum total matching degree is smaller than a predetermined value. If it is determined that a set exists (1312: No), the information set selected in the process of step 1303 does not match any of the element tables 410 in the paragraph table 400 selected in the process of step 1302. Therefore, the document processing unit 112 determines whether all the paragraph tables 400 have been selected in the process of step 1302 (1316).

　ステップ１３１６の処理で、全ての段落テーブル４００がステップ１３０２の処理で選択されていないと判定された場合（１３１６：Ｎｏ）、文書処理部１１２は、未だ選択されていない段落テーブル４００を選択し（１３１７）、ステップ１３０４の処理に戻り、ステップ１３１７の処理で選択された段落テーブルに対応付けられた要素テーブル４１０を選択する。 If it is determined in step 1316 that all the paragraph tables 400 have not been selected in step 1302 (1316: No), the document processing unit 112 selects a paragraph table 400 that has not yet been selected ( 1317), the process returns to step 1304, and the element table 410 associated with the paragraph table selected in step 1317 is selected.

　一方、ステップ１３１６の処理で、全ての段落テーブル４００がステップ１３０２の処理で選択されたと判定された場合（１３１６：Ｙｅｓ）、ステップ１３０３の処理で選択した情報集合に一致する要素テーブル４１０に対応付けられた段落テーブル４００は存在せず、当該情報集合を含む抽出領域候補はいずれの段落テーブル４００に類似しない。このため、文書処理部１１２は、ステップ１３１４の処理に進み、抽出領域候補抽出処理で抽出された全ての抽出領域候補がステップ１３０１の処理で選択されたか否かを判定する。 On the other hand, if it is determined in step 1316 that all the paragraph tables 400 have been selected in step 1302 (1316: Yes), they are associated with the element table 410 that matches the information set selected in step 1303. The extracted paragraph table 400 does not exist, and the extraction area candidate including the information set is not similar to any paragraph table 400. Therefore, the document processing unit 112 proceeds to the process of step 1314, and determines whether all the extraction area candidates extracted in the extraction area candidate extraction process have been selected in the process of step 1301.

　図１４は、本実施例の文書処理における文書データの全体領域をＸ－Ｙカット法を用いて分割する処理の説明図である。 FIG. 14 is an explanatory diagram of processing for dividing the entire area of the document data using the XY cut method in the document processing of this embodiment.

　図６に示すステップ６１４の処理では、図３に示す文書データの全体領域が、Ｘ－Ｙカット法を用いて分割される。図１４では、文書データの全体領域は、第１領域～第６領域の七つの領域に分割される。 In the processing of step 614 shown in FIG. 6, the entire area of the document data shown in FIG. 3 is divided using the XY cut method. In FIG. 14, the entire area of the document data is divided into seven areas, a first area to a sixth area.

　第１領域は、文字列「文書２　Ｓｐｏｒｔ　Ｂｏｏｋ」を含む。第２領域は、文字列「この製品には３つのタイプがある。以下に製品の一覧を示す。」を含む。第３領域は、ハンドバッグに関する文字列を含む。第４領域は長靴に関する文字列を含む。第５領域は手袋に関する文字列を含む。第６領域は折りたたみ傘に関する文字列を含む。 The first area includes the character string “Document 2 Sport Book”. The second area includes a character string “There are three types of this product. A list of products is shown below”. The third region includes a character string related to the handbag. The fourth area includes a character string related to boots. The fifth area includes a character string related to the glove. The sixth area includes a character string related to the folding umbrella.

　図１５は、本実施例の繰り返し部分表示画面１５００の説明図である。 FIG. 15 is an explanatory diagram of a repeated partial display screen 1500 according to the present embodiment.

　繰り返し部分表示画面１５００は、段落指定受付処理が実行された文書データが表示され、当該文書データにおいて、ユーザによって指定された段落の領域（１５０１）、及びユーザによって指定された要素の領域（１５０２）が矩形で囲まれて表示される。また、繰り返し部分表示画面１５００に表示された文書データにおいて、段落指定受付処理で抽出された繰り返し部分の領域（１５１１）が矩形で囲まれて表示され、段落指定受付処理で抽出された繰り返し部分に含まれる要素に対応して抽出された情報集合の領域（１５１２）も矩形で囲まれて表示される。 The repeated partial display screen 1500 displays document data on which a paragraph designation receiving process has been executed. In the document data, a paragraph area (1501) designated by the user and an element area (1502) designated by the user. Is displayed in a rectangle. In the document data displayed on the repeated part display screen 1500, the repeated part area (1511) extracted in the paragraph designation receiving process is displayed surrounded by a rectangle and displayed in the repeated part extracted in the paragraph designation receiving process. A region (1512) of the information set extracted corresponding to the included element is also displayed surrounded by a rectangle.

　また、繰り返し部分表示画面１５００では、抽出されなかった繰り返し部分の領域の指定（１５２１）及び当該繰り返し部分の領域に含まれる要素として抽出されるべき情報集合が存在する領域の指定（１５２２）を追加的に受け付けることができる。 In addition, in the repeated part display screen 1500, the designation of the area of the repeated part not extracted (1521) and the designation of the area where the information set to be extracted as an element included in the area of the repeated part exists (1522) are added. Can be accepted.

　図１５では、ハンドバッグに関する領域が段落として指定され、長靴に関する領域が繰り返し部分として抽出されている。手袋に関する領域も繰り返し部分として抽出されるべきであるのにもかかわらず、繰り返し部分として抽出されていない。このため、ユーザは、手袋に関する領域を繰り返し部分として指定し、当該領域の要素として抽出されるべき情報集合も指定している。 In FIG. 15, the region related to the handbag is designated as a paragraph, and the region related to boots is extracted as a repeated portion. Even though the region relating to the glove should be extracted as a repeated part, it is not extracted as a repeated part. For this reason, the user designates an area related to gloves as a repetitive part, and also designates an information set to be extracted as an element of the area.

　なお、繰り返し部分表示画面１５００で、繰り返し部分の領域、及び当該領域に含まれる要素として抽出されるべき情報集合が存在する領域の指定を受け付けた場合、段落指定受付処理部１１１は、これらの領域を繰り返しテーブル１２４に登録する。具体的には、段落指定受付処理部１１１は、繰り返しテーブル１２４のリストテーブル５００の繰り返し数５０１に登録された値に１を加算し、段落テーブルへのポインタ５０２に、指定された繰り返し部分の領域を登録する段落テーブル５１０へのポインタを登録する。また、段落指定受付処理部１１１は、段落テーブル５１０の左上座標４０１に指定された繰り返し部分の領域の左上座標を登録し、右下座標４０２に指定された繰り返し部分の領域の右下座標を登録する。また、段落指定受付処理部１１１は、指定された繰り返し部分の領域をＸ－Ｙカット法を用いて分割し、分割された領域の数を部分領域数４０３に登録する。また、段落指定受付処理部１１１は、段落テーブル５１０の要素数４０４に、指定された要素として抽出されるべき情報集合の数を登録する。また、段落指定受付処理部１１１は、段落テーブル５１０の要素へのポインタ４０５に、指定された要素として抽出される情報集合が登録される要素テーブル５２０へのポインタを登録する。段落テーブル５１０には、当該繰り返し部分がユーザによる指定の受け付けによって抽出されたことを示す情報が登録されてもよい。 When the repetition part display screen 1500 receives the designation of the area of the repetition part and the area where the information set to be extracted as an element included in the area exists, the paragraph designation reception processing unit 111 selects these areas. Is repeatedly registered in the table 124. Specifically, the paragraph designation reception processing unit 111 adds 1 to the value registered in the number of repetitions 501 of the list table 500 of the repetition table 124, and designates the area of the specified repetition portion in the pointer 502 to the paragraph table. Is registered as a pointer to the paragraph table 510 to be registered. In addition, the paragraph designation reception processing unit 111 registers the upper left coordinates of the repeated portion area designated by the upper left coordinates 401 of the paragraph table 510 and registers the lower right coordinates of the repeated portion area designated by the lower right coordinates 402. To do. In addition, the paragraph designation reception processing unit 111 divides the designated repetitive portion area using the XY cut method, and registers the number of divided areas as the partial area number 403. Also, the paragraph designation reception processing unit 111 registers the number of information sets to be extracted as the designated element in the element number 404 of the paragraph table 510. In addition, the paragraph designation reception processing unit 111 registers a pointer to the element table 520 in which the information set extracted as the designated element is registered in the pointer 405 to the element of the paragraph table 510. In the paragraph table 510, information indicating that the repetitive part is extracted by accepting designation by the user may be registered.

　次に、段落指定受付処理部１１１は、要素テーブル５２０の多角形座標４１１に、指定された情報集合の領域の各頂点の座標を登録した座標テーブル５３０へのポインタを登録する。また、段落指定受付処理部１１１は、要素テーブル５２０の中心座標４１２に、指定された情報集合の領域の中心座標を登録する。また、段落指定受付処理部１１１は、要素テーブル５２０の表記形式４１３に、指定された情報集合の表記形式を登録し、要素テーブル５２０の属性４１４に、当該情報集合の属性を登録する。要素テーブル５２０には、当該情報集合の領域がユーザによる指定の受け付けによって抽出されたことを示す情報が登録されてもよい。 Next, the paragraph designation reception processing unit 111 registers a pointer to the coordinate table 530 in which the coordinates of each vertex of the designated information set area are registered in the polygon coordinates 411 of the element table 520. In addition, the paragraph designation reception processing unit 111 registers the center coordinates of the designated information set area in the center coordinates 412 of the element table 520. The paragraph designation reception processing unit 111 registers the notation format of the designated information set in the notation format 413 of the element table 520 and registers the attribute of the information set in the attribute 414 of the element table 520. In the element table 520, information indicating that the area of the information set is extracted by accepting designation by the user may be registered.

　以上によって、ユーザは、繰り返し部分として抽出されるべき領域が抽出されたか否かを確認することができ、繰り返し部分として抽出されるべき領域が抽出されていない場合、当該領域を追加的に繰り返し部分として指定することができる。 By the above, the user can confirm whether or not the region to be extracted as the repeated portion has been extracted. If the region to be extracted as the repeated portion has not been extracted, the region is additionally repeated. Can be specified as

　なお、繰り返し部分表示画面１５００は、単語辞書１２２に登録された情報集合を表示する単語辞書表示画面を含んでもよい。単語辞書表示画面では、ユーザによる指定で登録された情報集合と繰り返し部分として抽出されることによって登録された情報集合とが区別可能に表示されてもよい。 Note that the repeated partial display screen 1500 may include a word dictionary display screen that displays an information set registered in the word dictionary 122. On the word dictionary display screen, the information set registered by the user's designation and the information set registered by being extracted as a repeated part may be displayed in a distinguishable manner.

　図１６は、本実施例の処理結果表示画面１６００の説明図である。 FIG. 16 is an explanatory diagram of a processing result display screen 1600 according to the present embodiment.

　処理結果表示画面１６００は、文書処理が実行された文書データが表示され、当該文書データにおいて、文書処理で抽出された抽出領域、及び、要素として抽出された情報集合の領域が表示される。 The processing result display screen 1600 displays document data on which document processing has been executed, and in the document data, an extraction area extracted by document processing and an area of an information set extracted as an element are displayed.

　図１６に示す処理結果表示画面１６００では、第１抽出領域１６１０、第２抽出領域１６２０、第３抽出領域１６３０、及び第４抽出領域１６４０が矩形で囲まれて表示される。各抽出領域では、ユーザによって指定された各要素として抽出された情報集合の領域１６１１～１６１３が矩形で囲まれて表示される。 In the processing result display screen 1600 shown in FIG. 16, the first extraction area 1610, the second extraction area 1620, the third extraction area 1630, and the fourth extraction area 1640 are displayed surrounded by a rectangle. In each extraction area, information set areas 1611 to 1613 extracted as each element designated by the user are displayed surrounded by a rectangle.

　例えば、段落指定受付処理で、図２に示すように、単語辞書１２２の要素１に「０５５－３２２３」、「０１５－０００１」及び「０１５－０１４９」が登録され、要素２に「ハンドバッグ」、「長靴」及び「手袋」が登録されている場合、第４抽出領域１６４０の要素１の位置に位置する「３３１－０１２０」は、単語辞書１２２の要素１に登録されておらず、要素２の位置に位置する「折りたたみ傘」は、単語辞書１２２の要素２に登録されていない。このため、第４抽出領域１６４０の総合一致度は、第１抽出領域１６１０～第３抽出領域１６３０の総合一致度より低くなる。 For example, in the paragraph designation receiving process, as shown in FIG. 2, “055-3223”, “015-0001”, and “015-0149” are registered in the element 1 of the word dictionary 122, and “handbag”, When “boots” and “gloves” are registered, “331-0120” located at the position of element 1 in the fourth extraction region 1640 is not registered in element 1 of the word dictionary 122 and The “folding umbrella” located at the position is not registered in the element 2 of the word dictionary 122. For this reason, the total matching degree of the fourth extraction region 1640 is lower than the total matching degree of the first extraction region 1610 to the third extraction region 1630.

　文書処理部１１２は、第１抽出領域１６１０～第３抽出領域１６３０に比べて総合一致度が低い第４抽出領域１６４０を囲む矩形、及び第４抽出領域１６４０に含まれる情報集合の領域１６１１～１６１３を囲む矩形を、点線で表示して、ユーザに第４抽出領域１６４０の総合一致度が第１抽出領域１６１０～第３抽出領域１６３０に比べて低いことを報知してもよい。このように、総合一致度に応じて抽出領域及び情報集合を囲む矩形の表示態様を変更することによって、ユーザが各抽出領域の総合一致度を即座に確認できる。なお、総合一致度の表示態様の変更は点線に限定されず、例えば総合一致度に応じて色を変更してもよいし、線の太さを変更してもよい。 The document processing unit 112 includes a rectangle surrounding the fourth extraction region 1640 having a lower degree of overall matching than the first extraction region 1610 to the third extraction region 1630 and information collection regions 1611 to 1613 included in the fourth extraction region 1640. May be displayed by a dotted line to notify the user that the total matching degree of the fourth extraction region 1640 is lower than that of the first extraction region 1610 to the third extraction region 1630. Thus, by changing the display mode of the rectangle surrounding the extraction area and the information set according to the total matching degree, the user can immediately confirm the total matching degree of each extraction area. Note that the change in the display mode of the total matching degree is not limited to the dotted line, and for example, the color may be changed according to the total matching degree, or the thickness of the line may be changed.

　なお、処理結果表示画面１６００では、抽出されるべき段落に対応する領域が抽出されていない場合、当該段落に対応する領域の指定、及び当該段落に対応する領域に含まれる抽出されるべき情報集合の領域の指定をユーザから受け付けてもよい。文書処理部１１２は、当該情報集合の領域から抽出した情報集合を単語辞書１２２及び出力ＤＢ１２３に登録する場合、当該情報集合がユーザによる指定を受け付けによって抽出されたことを示す情報を単語辞書１２２及び出力ＤＢ１２３に登録してもよい。 In the processing result display screen 1600, when an area corresponding to the paragraph to be extracted is not extracted, designation of the area corresponding to the paragraph and an information set to be extracted included in the area corresponding to the paragraph The designation of the area may be received from the user. When the document processing unit 112 registers the information set extracted from the region of the information set in the word dictionary 122 and the output DB 123, the document processing unit 112 displays information indicating that the information set is extracted by accepting designation by the user. It may be registered in the output DB 123.

　なお、処理結果表示画面１６００は、単語辞書１２２に登録された情報集合を表示する単語辞書表示画面を含んでもよい。単語辞書表示画面では、ユーザによる指定で登録された情報集合と繰り返し部分として抽出されることによって登録された情報集合とが区別可能に表示されてもよい。 It should be noted that the processing result display screen 1600 may include a word dictionary display screen that displays an information set registered in the word dictionary 122. On the word dictionary display screen, the information set registered by the user's designation and the information set registered by being extracted as a repeated part may be displayed in a distinguishable manner.

　本実施例では、文書データの領域を少なくとも一つの領域に分割する方法として、Ｘ－Ｙカット法を用いることを説明したが、他の方法を用いてもよい。例えば、文書データの領域に存在するテキストを一文とみなして形態素解析を実施し、形態素解析の結果に基づいて文書データの領域を少なくとも一つの領域に分割してもよい。 In this embodiment, the XY cut method is used as a method of dividing the document data area into at least one area, but other methods may be used. For example, the text existing in the document data area may be regarded as one sentence, the morpheme analysis may be performed, and the document data area may be divided into at least one area based on the result of the morpheme analysis.

　本実施例では、文書処理システムの記憶領域には、文書データから抽出すべき段落に含まれ、かつ抽出すべき情報が存在する領域である要素の物理構造（要素テーブル４１０の中心座標４１２）と、当該要素の領域に存在する情報集合（単語辞書１２２）と、が登録された学習辞書（ブロック辞書１２１及び単語辞書１２２）が記憶され、プロセッサは、入力された文書データを少なくとも一つの領域の領域に存在する情報集合を取得し（図１３　１００２）、取得した情報集合の存在領域と学習辞書に登録された要素の物理構造との一致度を示す物理構造一致度を算出し（図１３　１３０５）、取得した情報集合と学習辞書に登録された情報集合との一致度を示す情報集合一致度を算出し（図１３　１３０６）、算出した物理構造一致度及び情報集合一致度に基づいて、領域が抽出すべき段落であるか否かを判定し（図１３　１３１１及び１３１２）、領域が抽出すべき段落である場合、取得した情報集合を、物理構造一致度及び情報集合一致度を算出した情報集合に対応する要素の領域に存在する情報集合として、学習辞書に登録し（図１３　１３１４）、取得した情報集合を、物理構造一致度及び情報集合一致度を算出した情報集合の要素に対応付けた出力情報を記憶領域に記憶する（図１３　１３１４）ことを特徴とする。 In this embodiment, the storage area of the document processing system includes the physical structure of elements (the central coordinates 412 of the element table 410) that are included in the paragraph to be extracted from the document data and have information to be extracted. The information set (word dictionary 122) existing in the area of the element and the learning dictionary (block dictionary 121 and word dictionary 122) registered are stored, and the processor stores the input document data in at least one area. The information set existing in the area is acquired (FIG. 13, 1002), and the physical structure matching degree indicating the matching degree between the acquired area of the information set and the physical structure of the element registered in the learning dictionary is calculated (FIG. 13, 1305). ), The information set matching degree indicating the matching degree between the acquired information set and the information set registered in the learning dictionary is calculated (FIG. 13, 1306), and the calculated physical structure matching degree is calculated. And whether or not the region is a paragraph to be extracted based on the information set matching degree (FIGS. 1311 and 1312), and if the region is a paragraph to be extracted, the acquired information set is matched with the physical structure Is registered in the learning dictionary as an information set existing in the element area corresponding to the information set for which the degree and the information set match degree are calculated (FIG. 131314), and the acquired information set is the physical structure match degree and the information set match degree. The output information associated with the elements of the information set for which is calculated is stored in the storage area (FIG. 13, 1314).

　これによって、要素の物理構造一致度及び情報集合一致度に基づいて、文書データから情報集合を抽出し、文書データに対して情報集合を抽出する処理を実行するたびに学習辞書に要素の情報集合を記憶していくので、文書毎に情報集合を抽出するためのキーとなる単語等の情報集合を事前に定義しなくても、要素毎に情報集合を抽出できる。 Thus, each time an information set is extracted from the document data based on the physical structure matching degree and the information set matching degree of the element, and the information set is extracted from the document data, the element information set is stored in the learning dictionary. Therefore, the information set can be extracted for each element without defining in advance an information set such as a word that is a key for extracting the information set for each document.

　また、文書処理システムは、算出した物理構造一致度と算出した情報集合一致度とを重み付けした和である総合一致度を算出し（図１３　１３０８）、算出した総合一致度に基づいて、領域が抽出すべき段落であるか否かを判定することを特徴とする（図１３　１３１２）。これによって、一方の一致度が低くても、他方の一致度が高ければ、領域を抽出すべき段落と判定することができる。 In addition, the document processing system calculates a total matching score that is a sum of the calculated physical structure matching score and the calculated information set matching score (1308 in FIG. 13), and based on the calculated total matching score, It is characterized by determining whether or not it is a paragraph to be extracted (FIG. 13, 1312). As a result, even if one of the matching degrees is low, it can be determined that the region is a paragraph to be extracted if the other matching degree is high.

　また、文書処理システムは、入力された文書データにおいて、抽出すべき段落であるとされた領域、及び当該領域から取得された情報集合の領域を表示する処理結果表示画面（図１６）を表示するための画面データを出力し、抽出すべき段落であるとされた領域及び情報集合の領域の表示態様を総合一致度に応じて変更することを特徴とする。これによって、ユーザは、処理結果表示画面を確認して、文書データにおいて抽出された領域及び情報集合の領域を即座に確認できるとともに、総合一致度も確認できる。 In addition, the document processing system displays a processing result display screen (FIG. 16) that displays the area that is the paragraph to be extracted and the area of the information set acquired from the area in the input document data. Screen data for output is output, and the display mode of the area that is assumed to be a paragraph to be extracted and the area of the information set is changed according to the total matching degree. As a result, the user can confirm the processing result display screen and immediately confirm the area extracted from the document data and the area of the information set, as well as the overall matching degree.

　また、プロセッサは、処理結果表示画面を介して、抽出すべき段落の領域及び要素の領域の指定を受け付け可能であることすることを特徴とする。これによって、ユーザは、抽出すべき段落及び抽出すべき情報集合が抽出されていない場合、これらの領域を指定することによって、確実に情報集合を抽出することができる。 Further, the processor is characterized in that it can accept designation of a paragraph area and an element area to be extracted via a processing result display screen. Thus, when the paragraph to be extracted and the information set to be extracted are not extracted, the user can reliably extract the information set by designating these areas.

　また、学習辞書に登録された要素の物理構造は、段落内での要素の位置を含み、学習辞書には、要素の領域に存在する情報集合の表記形式及び属性の少なくとも一方が登録され、プロセッサは、領域内での取得した情報集合の存在領域の位置を算出し、情報集合の存在領域の位置と学習辞書に登録された段落内での要素の位置との距離を算出し、算出した距離に基づいて物理構造一致度を算出し（図１３　１３０５）、取得した情報集合の表記形式及び属性の少なくとも一方が、学習辞書に登録された要素の領域に存在する情報集合の表記形式及び属性の少なくとも一方に一致するか否かに基づいて表記形式及び属性一致度を算出し（図１３　１３０７）、算出した物理構造一致度と算出した情報集合一致度と算出した表記形式及び属性一致度とを重み付けした和である総合一致度を算出する（図１３　１３０８）ことを特徴とする。 The physical structure of the element registered in the learning dictionary includes the position of the element in the paragraph. In the learning dictionary, at least one of the notation format and attribute of the information set existing in the element area is registered, and the processor Calculates the position of the area of the acquired information set in the area, calculates the distance between the position of the information set and the position of the element in the paragraph registered in the learning dictionary, The physical structure matching degree is calculated based on the information (FIG. 13, 1305), and at least one of the acquired information set notation format and attribute is the notation format and attribute of the information set existing in the element area registered in the learning dictionary. The notation format and attribute matching degree are calculated based on whether or not they match at least one (FIG. 13, 1307), the calculated physical structure matching degree and the calculated information set matching degree, and the calculated notation format and attribute one Calculating the total matching score is the sum of weighted and time (Figure 13 1308) it is characterized.

　これによって、情報集合の存在領域の位置と学習辞書に登録された段落内での要素の位置との距離に基づく物理構造一致度、情報集合一致度、並びに表記形式及び属性一致度に基づいて総合一致度を算出でき、抽出精度を向上させることができる。 Thus, based on the distance between the position of the information set existing area and the position of the element in the paragraph registered in the learning dictionary, the physical structure match degree, the information set match degree, the notation format and the attribute match degree The degree of coincidence can be calculated, and the extraction accuracy can be improved.

　また、学習辞書には、文書データから抽出すべき段落の物理構造が登録され（図４　段落テーブル４００）、プロセッサは、入力された文書データの少なくとも一つの領域から、学習辞書に登録された段落の物理構造と類似する領域を選択し（図６　６１５）、選択した領域に存在する情報集合を取得することを特徴とする。これによって、抽出すべき段落と物理構造が類似する領域のみから情報集合を抽出するので、文書処理システムの処理負荷を軽減できる。 Further, the physical structure of the paragraph to be extracted from the document data is registered in the learning dictionary (paragraph table 400 in FIG. 4), and the processor registers the paragraph registered in the learning dictionary from at least one area of the input document data. An area similar to the physical structure is selected (FIG. 6, 615), and an information set existing in the selected area is acquired. As a result, an information set is extracted only from an area having a physical structure similar to that of the paragraph to be extracted, so that the processing load of the document processing system can be reduced.

　また、プロセッサは、学習辞書を生成するための学習辞書生成用文書データが入力され（図７　７０２）、入力された学習辞書作成用文書データの領域のうち抽出すべき段落の領域の指定をユーザから受け付け（図７　７０５）、指定を受け付けた抽出すべき段落の領域内で要素の領域の指定をユーザから受け付け（図７　７０７）、指定を受け付けた抽出すべき段落の領域を抽出すべき段落の物理構造として学習辞書に登録し（図７　７０６）、指定を受け付けた要素の領域を要素の物理構造として学習辞書に登録し（図７　７０８及び７０９）、指定を受け付けた要素の領域に存在する情報集合を要素の情報集合として学習辞書に登録する（図７　７１０）ことを特徴とする。 Further, the processor receives the learning dictionary generation document data for generating the learning dictionary (FIG. 7, 702), and the user specifies the paragraph area to be extracted from the input learning dictionary creation document data area. (Fig. 7, 705), the specification of the element area within the paragraph area to be extracted is accepted from the user (Fig. 7, 707), and the paragraph area to be extracted that has received the designation is to be extracted. Is registered in the learning dictionary as the physical structure (Fig. 7, 706), and the region of the element that has received the designation is registered in the learning dictionary as the physical structure of the element (Fig. 7, 708 and 709) and exists in the region of the element that has received the designation. The information set to be registered is registered in the learning dictionary as an element information set (710 in FIG. 7).

　これによって、ユーザは、一つの抽出すべき段落の領域、及び当該段落の領域での要素の領域を指定するだけで、学習辞書を生成できる。 Thus, the user can generate a learning dictionary only by specifying one paragraph area to be extracted and an element area in the paragraph area.

　また、プロセッサは、入力された学習情報作成用文書データを少なくとも一つの領域から、学習辞書に登録された段落の物理構造と類似する領域を選択し（図１０　１００１）、選択した領域に存在する情報集合を取得し（図１０　１００２）、取得した情報集合の存在領域と学習辞書に登録された要素の物理構造とを比較して、選択した領域が指定を受け付けた抽出すべき段落の領域の繰り返し部分であるか否かを判定し（図１０　１００５～１００７、１０１１）、選択した領域が繰り返し部分である場合、抽出した領域から取得した情報集合を、比較に用いた要素に対応付けて、学習辞書に登録し、抽出した領域から取得した情報集合を、比較に用いた要素に対応付けた出力情報を記憶領域に記憶する（図８　８０９）ことを特徴とする。 In addition, the processor selects, from the at least one area, the input learning information creation document data, an area similar to the physical structure of the paragraph registered in the learning dictionary (FIG. 10, 1001), and exists in the selected area. The information set is acquired (FIG. 10, 1002), the existence area of the acquired information set is compared with the physical structure of the element registered in the learning dictionary, and the selected area is the area of the paragraph to be extracted that has received the designation. It is determined whether or not it is a repetitive part (FIG. 10, 1005 to 1007, 1011), and if the selected area is a repetitive part, the information set acquired from the extracted area is associated with the element used for comparison, It is characterized in that output information associated with elements used for comparison is stored in a storage area, which is registered in a learning dictionary and acquired from the extracted area (FIG. 8, 809). That.

　これによって、ユーザは、一つの抽出すべき段落の領域、及び当該段落の領域での要素の領域を指定するだけで、当該段落の領域の繰り返し部分を抽出して、当該繰り返し部分の情報集合を学習辞書に登録でき、学習辞書に登録される情報集合の数が多くなるので、抽出精度を向上させることができる。 This allows the user to extract the repeated part of the paragraph area and specify the information set of the repeated part simply by specifying one paragraph area to be extracted and the element area in the paragraph area. Since the number of information sets that can be registered in the learning dictionary and registered in the learning dictionary increases, the extraction accuracy can be improved.

　プロセッサは、入力された学習辞書生成用文書データにおいて、指定を受け付けた段落の領域、指定を受け付けた要素の領域、繰り返し部分として抽出された領域、及び繰り返し部分として抽出された領域から取得された情報集合の領域を表示する繰り返し部分表示画面を表示するための画面データを出力し（図８　８０７）、繰り返し部分表示画面を介して、繰り返し部分として抽出されるべき領域の指定、及び繰り返し部分として抽出されるべき領域で情報集合を抽出する要素の領域の指定を受け付け可能であることすることを特徴とする。 In the input learning dictionary generation document data, the processor is acquired from the area of the paragraph that received the specification, the area of the element that received the specification, the area extracted as the repeated part, and the area extracted as the repeated part Outputs screen data for displaying the repeated part display screen that displays the area of the information set (FIG. 8, 807), specifies the area to be extracted as the repeated part, and as the repeated part via the repeated part display screen It is possible to accept the specification of the area of the element from which the information set is extracted in the area to be extracted.

　これによって、ユーザは、抽出すべき繰り返し部分及び当該繰り返し部分の抽出すべき情報集合の領域が抽出されていない場合、これらの領域を指定することによって、確実に情報集合を抽出することができる。 Thus, when the repetitive part to be extracted and the area of the information set to be extracted from the repetitive part are not extracted, the user can reliably extract the information set by designating these areas.

　なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることも可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to the above-mentioned Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

　また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。また、上記の各構成、機能、処理部、処理手段などは、それらの一部または全部を、例えば集積回路で設計するなどによりハードウェアで実現してもよい。 Also, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment. Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.

　また、前記の各構成、機能などは、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。 Further, each of the above-described configurations and functions may be realized by software by interpreting and executing a program that realizes each function by the processor.

　各機能を実現するプログラム、テーブル、ファイルなどの情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）などの記録装置、または、ＩＣ（Integrated Circuit）カード、ＳＤカード、ＤＶＤ（Digital Versatile Disc）などの記録媒体に置くことができる。 Information such as programs, tables, and files that realize each function is recorded in memory, hard disk, SSD (Solid State Drive), or IC (Integrated Circuit) card, SD card, DVD (Digital Versatile Disc), etc. Can be placed on any recording medium.

　また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 Also, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Claims

A document processing system that extracts an information set from input document data,
A processor and a storage area;
In the storage area,
A learning dictionary in which a physical structure of an element which is included in a paragraph to be extracted from the document data and where information to be extracted exists and an information set existing in the area of the element are registered is stored,
The processor is
Obtaining an information set present in at least one area of the input document data;
Calculating the physical structure coincidence indicating the coincidence between the acquired region of the information set and the physical structure of the element registered in the learning dictionary;
Calculating an information set coincidence indicating a degree of coincidence between the acquired information set and the information set registered in the learning dictionary;
Based on the calculated physical structure coincidence and information set coincidence, determine whether the region is the paragraph to be extracted,
When the region is the paragraph to be extracted, the learning dictionary is used as the information set existing in the region of the element corresponding to the information set for which the physical structure matching degree and the information set matching degree are calculated. Registered with
The document processing system, wherein output information in which the acquired information set is associated with an element of the information set for which the physical structure matching degree and the information set matching degree are calculated is stored in the storage area.

The document processing system according to claim 1,
The processor is
Calculating a total matching degree that is a weighted sum of the calculated physical structure matching degree and the calculated information set matching degree;
A document processing system that determines whether or not the region is the paragraph to be extracted based on the calculated total matching degree.

The document processing system according to claim 2,
The processor is
In the input document data, output screen data for displaying a region to be extracted and a processing result display screen for displaying a region of an information set acquired from the region,
A document processing system, wherein a display mode of an area determined to be a paragraph to be extracted and an area of an information set acquired from the area is changed according to the total matching degree.

The document processing system according to claim 3,
The document processing system, wherein the processor is capable of accepting designation of the paragraph area to be extracted and the element area via the processing result display screen.

The document processing system according to claim 2,
The physical structure of the element registered in the learning dictionary includes the position of the element in the paragraph,
In the learning dictionary, at least one of a notation format and an attribute of an information set existing in the area of the element is registered,
The processor is
Calculating the position of the acquired region of the acquired information set in the region;
Calculating the distance between the position of the region of the information set and the position of the element in the paragraph registered in the learning dictionary;
Calculate the physical structure coincidence based on the calculated distance,
Based on whether at least one of the notation format and attribute of the acquired information set matches at least one of the notation format and attribute of the information set existing in the area of the element registered in the learning dictionary, and Calculate attribute match,
A document processing system that calculates a total matching degree that is a weighted sum of the calculated physical structure matching degree, the calculated information set matching degree, and the calculated notation format and attribute matching degree.

The document processing system according to claim 1,
In the learning dictionary, a physical structure of a paragraph to be extracted from the document data is registered,
The processor is
Selecting a region similar to the physical structure of the paragraph registered in the learning dictionary from at least one region of the input document data;
A document processing system for acquiring an information set existing in the selected area.

The document processing system according to claim 6,
The processor is
Learning dictionary generation document data for generating the learning dictionary is input,
Accepting from the user the specification of the area of the paragraph to be extracted among the areas of the input learning dictionary creation document data,
Accepting designation of the area of the element from the user within the area of the paragraph to be extracted that has accepted the designation,
Registering the region of the paragraph to be extracted that has received the designation as the physical structure of the paragraph to be extracted in the learning dictionary,
Register the region of the element that has received the designation as the physical structure of the element in the learning dictionary,
A document processing system, wherein an information set existing in a region of an element that has received the designation is registered in the learning dictionary as the information set of the element.

The document processing system according to claim 7,
The processor is
Select the region similar to the physical structure of the paragraph registered in the learning dictionary from at least one region of the input learning information creation document data,
Obtaining an information set present in the selected area;
Comparing the acquired region of the information set with the physical structure of the element registered in the learning dictionary, whether or not the selected region is a repeated part of the region of the paragraph to be extracted that has received the designation Determine
When the selected area is the repeated portion, the information set acquired from the extracted area is registered in the learning dictionary in association with the element used for the comparison,
The document processing system, wherein the output information in which the information set acquired from the extracted area is associated with the element used for the comparison is stored in the storage area.

The document processing system according to claim 8,
The processor is
Acquired from the input document data for generating the learning dictionary from the paragraph region that has received the specification, the region of the element that has received the specification, the region extracted as the repeated portion, and the region extracted as the repeated portion Output screen data for displaying the repeated partial display screen that displays the area of the information set
It is possible to accept specification of an area to be extracted as the repeated part and specification of an element area for extracting an information set in the area to be extracted as the repeated part via the repeated part display screen. A document processing system.

A document processing method for extracting an information set from document data input using a computer having a processor and a storage area,
In the storage area, the physical structure of an element that is included in the paragraph to be extracted from the document data and in which the information to be extracted exists, and the information set existing in the area of the element are registered. Learning dictionary is remembered,
The document processing method includes:
The processor obtains an information set existing in at least one area of the input document data,
The processor calculates a physical structure coincidence indicating a degree of coincidence between the acquired region of the information set and the physical structure of the element registered in the learning dictionary;
The processor calculates an information set coincidence indicating a degree of coincidence between the acquired information set and the information set registered in the learning dictionary;
The processor determines whether the divided area is the paragraph to be extracted based on the calculated physical structure matching degree and information set matching degree,
When the divided area is the paragraph to be extracted, the processor has the acquired information set in an area of an element corresponding to the information set for which the physical structure matching degree and the information set matching degree are calculated. Register as an information set in the learning dictionary,
The document processing method, wherein the processor stores, in the storage area, output information in which the acquired information set is associated with elements of the information set for which the physical structure matching degree and the information set matching degree are calculated. .