JP2005208981A

JP2005208981A - Featured value extraction device, method for extracting featured value, and document filing device

Info

Publication number: JP2005208981A
Application number: JP2004015511A
Authority: JP
Inventors: Kagenori Nagao; 景則長尾; Hitoshi Okamoto; 仁岡本; Shinichi Yada; 伸一矢田
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-01-23
Filing date: 2004-01-23
Publication date: 2005-08-04

Abstract

<P>PROBLEM TO BE SOLVED: To solve problems that when pixel values are accumulated as they are in a projection direction, the ground color of paper, the background colors of an original, image components inserted into a document, ruled line components, etc. are also accumulated, character components in a projected waveform can not be clearly expressed and featured values are not correctly extracted. <P>SOLUTION: In a document filing device for extracting featured values such as character size and a line interval on a character image by using a projected waveform and managing document data in each document on the basis of the extracted featured values, whether respective pixels on a projection line are present in a character area or not is estimated in forming a projected waveform and whether the pixel values are to be accumulated or not is controlled in accordance with the estimated result, so that the influence of disturbance such as the ground color of paper, the background colors of the original, image components inserted into the document, and ruled line components is removed. Featured values are extracted by using the projected waveform generated by removing the influence of disturbance and the section of each document is decided from the extracted featured values. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、特徴量抽出装置および特徴量抽出方法、ならびに文書ファイリング装置に関し、特に文書画像の投影波形を利用して文書のレイアウトや文字サイズ、行間隔などの特徴量を抽出する特徴量抽出装置および特徴量抽出方法、ならびに当該特徴量抽出装置を用いて文書単位の区切り位置を判定し、文書毎に文書データの管理を行う文書ファイリング装置に関する。 The present invention relates to a feature amount extraction device, a feature amount extraction method, and a document filing device, and more particularly to a feature amount extraction device that extracts feature amounts such as document layout, character size, and line spacing using a projected waveform of a document image. The present invention also relates to a document filing apparatus that determines a delimiter position of a document unit by using the feature quantity extraction method, and manages document data for each document.

近年、１枚以上の紙原稿からなる文書を複数部に亘って効率的に電子化する場合、自動給紙機能を持つスキャナ装置を用いて当該原稿を連続的に読み取る処理が広く行われている。この際、読み取った原稿画像データ（文書データ）を文書毎に管理するためには、文書と文書との間の切れ目を何らかの手法を用いて検出する必要がある。 In recent years, when a document composed of one or more paper originals is efficiently digitized over a plurality of copies, a process of continuously reading the originals using a scanner device having an automatic paper feed function has been widely performed. . At this time, in order to manage the read document image data (document data) for each document, it is necessary to detect a break between the documents using some method.

文書と文書との間の切れ目を検出するために、従来は、スキャナ装置で読み取った画像データから、あらかじめ設定しておいた文字認識エリア部分を切り出して文字認識し、この文字認識の結果に基づいて文書の区切りを判定するようにしたり（例えば、特許文献１参照）、自動給紙機能を持つスキャナ装置により複数の文書を一括して読み取り、読み取った原稿画像の特徴量を算出し、この算出した特徴量に基づいて文書単位の区切りを判定するようにしていた（例えば、特許文献２参照）。 In order to detect a break between documents, conventionally, a character recognition area portion set in advance is cut out from image data read by a scanner device, and character recognition is performed. Based on the result of character recognition, The document separation is determined (for example, see Patent Document 1), a plurality of documents are read at once by a scanner device having an automatic paper feed function, and the feature amount of the read original image is calculated, and this calculation is performed. The document unit break is determined based on the feature amount (see, for example, Patent Document 2).

これらいずれの従来技術も、あらかじめ文書区切り用原稿（例えば、白紙原稿）を各文書間に挿入したり、文書区切りとなる原稿に手を加えたりすることなく、文書の区切りを自動的に判定することができるために、ユーザに強いる負担を大幅に軽減することができる。特に、特許文献２に係る従来技術では、対象原稿のフォーマットに制約がなく、より広範な種類の文書に対応できる。これに対して、特許文献１に係る従来技術では、特定フォーマットの原稿にしか対応できない。 In any of these conventional techniques, document separation is automatically determined without inserting a document separation document (for example, a blank document) between each document in advance or modifying the document to be document separation. Therefore, the burden on the user can be greatly reduced. In particular, with the conventional technique according to Patent Document 2, there is no restriction on the format of the target document, and it is possible to deal with a wider variety of documents. On the other hand, the prior art according to Patent Document 1 can only deal with a document of a specific format.

ところで、特許文献２に係る従来技術では、区切り位置判定の根拠となる特徴量の例として、画像データに利用されている色相、スクリーン線数、原稿のレイアウト、文字の組み方向などを挙げている。しかし、一般のオフィス文書では、画像を含まない文字のみの原稿も多いために、画像の色相やスクリーン線数などのような特徴量を、区切り位置の判定に利用できない場合が多い。また、横書きが一般的であることから、文字の組み方向も文書の区切り位置判定には役立たない場合が多い。 By the way, in the prior art according to Patent Document 2, as examples of the feature quantity that is the basis for the delimiter position determination, the hue used for the image data, the number of screen lines, the layout of the document, the direction of character combination, etc. . However, in general office documents, there are many text-only originals that do not include images, and thus feature quantities such as the hue of the image and the number of screen lines cannot often be used to determine the separation position. In addition, since horizontal writing is common, the direction in which the characters are combined is often not useful for determining the document break position.

一方、文書のレイアウトや文字サイズ、行間隔などの特徴量は、オフィス文書でも文書ごとに異なるのが一般的である。したがって、これらの特徴量を文書の区切り位置の判定に利用するのは有効な手法である。文書のレイアウトや文字サイズ、行間隔などの特徴量を文書画像から抽出する方法としては、文書画像の縦・横方向の投影波形を形成して、当該投影波形を利用するのが一般的である。 On the other hand, feature quantities such as document layout, character size, and line spacing are generally different for each office document. Therefore, it is an effective technique to use these feature amounts for determination of the document break position. As a method for extracting feature quantities such as document layout, character size, and line spacing from a document image, it is common to form a projection waveform in the vertical and horizontal directions of the document image and use the projection waveform. .

具体的には、文書画像を二値化して縦・横方向に投影をとり、これらの投影波形に対してしきい値処理を行うことによって文字領域／空白領域を判定し、その判定結果から文字サイズや行間隔などを検知するようにしていた（例えば、特許文献３参照）。また、文書画像を二値化して、その後横方向に投影をとることによって行の切り出しを行い、次いで分離した各行の画像において縦方向に投影をとることによって各文字を分離し、さらに各文字について再度横方向の投影をとることによって文字の分離を修正することで文字サイズを検知するようにしていた（例えば、特許文献４参照）。 Specifically, the document image is binarized, projected in the vertical and horizontal directions, and a threshold value process is performed on these projection waveforms to determine a character area / blank area. The size and line spacing are detected (for example, see Patent Document 3). Also, binarization of the document image is performed, and then a line is cut out by projecting in the horizontal direction, and then each character is separated by projecting in the vertical direction in the image of each separated line. The character size is detected by correcting the separation of the characters by taking the horizontal projection again (see, for example, Patent Document 4).

特開平１０−２１３８０号公報Japanese Patent Laid-Open No. 10-21380 特開２００２−２４２５８号公報Japanese Patent Laid-Open No. 2002-24258 特公平７−１１１７３８号公報Japanese Patent Publication No. 7-1111738 特開平５−８９２８３号公報JP-A-5-89283

上述したように、投影波形を用いる特許文献３，４に係る従来技術ではいずれも、図１２に示すように、読み込んだ文書画像をまず二値化し、この二値化画像に対する投影波形を得て、この投影波形を基に特徴量の抽出を行っている。したがって、処理結果は二値化しきい値に大きく影響される。ところが、二値化しきい値を適切な値に設定しても、当該二値化しきい値は、紙の地色や原稿の背景色、文書中に挿入された画像の有無、文字の色・濃度に大きく左右されるため、安定した処理結果を得るのは難しい。 As described above, in each of the related arts according to Patent Documents 3 and 4 using the projection waveform, as shown in FIG. 12, the read document image is first binarized, and the projection waveform for the binarized image is obtained. The feature amount is extracted based on the projected waveform. Therefore, the processing result is greatly affected by the binarization threshold. However, even if the binarization threshold value is set to an appropriate value, the binarization threshold value still depends on the background color of the paper, the background color of the document, the presence / absence of an image inserted in the document, the character color / density. Therefore, it is difficult to obtain a stable processing result.

図１３に示すように、二値化せずに多値のまま投影をとれば、上記の二値化レベルの問題を解消することができる。しかしながら、紙の地色や原稿の背景色が濃い場合、文書中に画像が挿入されている場合、文字の濃度が低い場合、あるいは行が短い場合などは、文字部の投影と地の部分における投影の差がはっきりしなくなる。また、低濃度のノイズが多く含まれた文書画像の場合、二値化処理を行えばノイズは白レベルと判定されるため、結果的にノイズの影響を排除できる。しかしながら、多値のまま投影をとるとノイズ成分も累積されることになるため、この点でも処理が難しくなる。 As shown in FIG. 13, the above-described problem of the binarization level can be solved by taking a multi-value projection without binarization. However, if the background color of the paper or the background color of the document is dark, the image is inserted in the document, the character density is low, or the line is short, etc. The difference in projection is not clear. Further, in the case of a document image containing a lot of low density noise, if binarization processing is performed, the noise is determined to be a white level, and as a result, the influence of noise can be eliminated. However, if projection is performed with multiple values, noise components are also accumulated, which makes processing difficult in this respect.

すなわち、文字画像上の文字サイズや行間隔などの特徴量を検出するのに投影波形を用いる従来技術では、投影方向に画素値をそのまま累積してしまうと、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などについても累積してしまうため、投影波形中の文字成分がはっきりしなくなる。このように、投影波形中の文字成分がはっきりせず、投影波形のどの部分が文字部に相当するかが不明確であると、投影波形から文字サイズや行間隔などの特徴量を検出する処理を正しく行えないことになる。 That is, in the conventional technology that uses a projection waveform to detect a feature amount such as a character size or line spacing on a character image, if the pixel values are accumulated as they are in the projection direction, the background color of the paper or the background color of the document Since the image components and ruled line components inserted in the document are also accumulated, the character components in the projected waveform are not clear. As described above, when the character component in the projected waveform is not clear and it is unclear which part of the projected waveform corresponds to the character part, the process of detecting the feature amount such as the character size and the line spacing from the projected waveform. Cannot be performed correctly.

本発明は、上記課題に鑑みてなされたものであって、その目的とするところは、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などの外乱の影響を受けることなく、文字サイズや行間隔などの特徴量の抽出を正しく行うことが可能な特徴量抽出装置、特徴量抽出方法、ならびに当該特徴量抽出装置を用いた文書ファイリング装置を提供することにある。 The present invention has been made in view of the above problems, and the object of the present invention is to influence the influence of disturbances such as the background color of the paper, the background color of the document, the image component inserted in the document, and the ruled line component. To provide a feature amount extraction device, a feature amount extraction method, and a document filing device using the feature amount extraction device, which can correctly extract feature amounts such as character size and line spacing without receiving them. .

上記目的を達成するために、本発明では、投影波形を形成する際に、投影ライン上の各画素が文字領域に存在するか否かを推定し、前記文字領域に存在すると推定した画素についてのみ画素値を累積して投影波形を形成する。そして、この形成した投影波形を所定のしきい値と比較することによって特徴量を抽出したり、あるいはこの抽出した特徴量に基づいて文書単位の区切りを判定したりする。 In order to achieve the above object, in the present invention, when forming a projection waveform, it is estimated whether each pixel on the projection line exists in the character area, and only about the pixel estimated to exist in the character area. The pixel values are accumulated to form a projection waveform. Then, a feature amount is extracted by comparing the formed projection waveform with a predetermined threshold value, or a document unit break is determined based on the extracted feature amount.

投影波形を用いて文字画像上の文字サイズや行間隔などの特徴量を抽出するに当たり、投影ライン上の各画素が文字領域に存在するか否かの推定結果に応じて、画素値を累積するか否かを制御することで、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などの外乱の影響を排除する。そして、外乱の影響を排除して生成した投影波形を用いて、文字サイズや行間隔などの特徴量を抽出し、またこの抽出した特徴量から文書単位の区切りを判定する。 When extracting feature quantities such as character size and line spacing on the character image using the projection waveform, the pixel values are accumulated according to the estimation result of whether each pixel on the projection line exists in the character area. By controlling whether or not, the influence of disturbance such as the background color of the paper, the background color of the document, the image component inserted in the document, and the ruled line component is eliminated. Then, using the projection waveform generated by eliminating the influence of disturbance, feature quantities such as character size and line spacing are extracted, and a document unit break is determined from the extracted feature quantities.

本発明によれば、投影波形を形成する際に、投影ライン上の各画素が文字領域に存在するか否かを推定し、その推定結果に応じて画素値を累積するか否かを制御することで、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などの外乱の影響を排除することができるため、外乱の影響を受けることなく、文字サイズや行間隔などの特徴量の抽出を正しく行うことができるとともに、文書単位の区切り判定を正しく行うことができる。 According to the present invention, when forming a projection waveform, it is estimated whether or not each pixel on the projection line exists in the character area, and whether or not the pixel value is accumulated is controlled according to the estimation result. This eliminates the effects of disturbances such as paper background color, document background color, image components inserted into the document, and ruled line components, so that the character size and line spacing are not affected by disturbances. And the like can be correctly extracted, and document segment delimitation can be correctly performed.

以下、本発明の実施の形態について図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施形態に係る文書ファイリング装置の構成例を示すブロック図である。図１から明らかなように、本実施形態に係る文書ファイリング装置は、文書入力部１１、特徴量抽出部１２、類似度評価部１３、文書蓄積部１４、文書区切り部１５および文書出力部１６を備え、これらの構成要素がバスライン１７を介して相互に接続された構成となっている。かかる構成の文書ファイリング装置において、特徴量抽出部１２が本発明の最も特徴とする部分である。 FIG. 1 is a block diagram illustrating a configuration example of a document filing apparatus according to an embodiment of the present invention. As is apparent from FIG. 1, the document filing apparatus according to the present embodiment includes a document input unit 11, a feature amount extraction unit 12, a similarity evaluation unit 13, a document storage unit 14, a document delimiter unit 15, and a document output unit 16. These components are connected to each other via a bus line 17. In the document filing apparatus having such a configuration, the feature quantity extraction unit 12 is the most characteristic part of the present invention.

文書入力部１１は、入力文書から文書データを取得して、これを本文書ファイリング装置に登録すべき文書データとして入力する。ここで、入力される文書データとしては、例えば、印刷物からスキャンされて取得された多値画像データ、あるいはデジタルカメラで撮影した多値画像データなどが挙げられる。これに対応して、文書入力部１１としては、例えば、ＡＤＦ(Auto Document Feeder)を備えたスキャナ装置とその制御手段、あるいはデジタルカメラのメモリ（カード）に蓄積された画像を連続して取り出すメモリリーダ装置とその制御手段が用いられる。 The document input unit 11 acquires document data from the input document and inputs it as document data to be registered in the document filing apparatus. Here, the input document data includes, for example, multi-value image data obtained by scanning from a printed material, or multi-value image data captured by a digital camera. Correspondingly, as the document input unit 11, for example, a scanner device provided with an ADF (Auto Document Feeder) and its control means, or a memory for continuously extracting images stored in a memory (card) of a digital camera A reader device and its control means are used.

特徴量抽出部１２は、文書入力部１１により入力された文書データ（多値画像データ）から、当該文書データ固有の特徴を示す量（以下、「特徴量」と記す）を抽出する。本例に係る特徴量抽出部１２では、図２に示すように、各文書データの画素値を水平・垂直方向に累積した投影波形を形成し、この投影波形より文書データを代表する文字の高さｈおよび幅ｗを例えばｎ個抽出する。 The feature amount extraction unit 12 extracts from the document data (multi-valued image data) input by the document input unit 11 an amount indicating a feature unique to the document data (hereinafter referred to as “feature amount”). In the feature quantity extraction unit 12 according to this example, as shown in FIG. 2, a projection waveform is formed by accumulating pixel values of each document data in the horizontal and vertical directions, and the height of a character representing the document data is determined from the projection waveform. For example, n pieces of height h and width w are extracted.

従来技術の説明で述べたように、文書画像の画素値をそのまま累積すると、その投影波形には紙の地色や原稿の背景色、文書中の画像、文字の濃度、行の長さなどの影響が現れる（図１３参照）。投影波形から文字の高さｈおよび幅ｗに関する特徴を安定して抽出するためには、文字以外の成分による影響をできるだけ排除したい。そのため、本例に係る特徴量抽出部１２では、投影波形を形成する際に、投影ライン上の各画素について、その画素が文字領域に存在するか否かを推定し、文字領域に存在すると推定した画素についてのみ、その画素値を累積して投影波形を形成するようにする。 As described in the description of the prior art, when the pixel values of the document image are accumulated as they are, the projected waveform includes the paper ground color, the background color of the document, the image in the document, the character density, the line length, and the like. An effect appears (see FIG. 13). In order to stably extract the characteristics related to the height h and width w of the character from the projection waveform, it is desirable to eliminate as much as possible the influence of components other than the character. Therefore, in the feature quantity extraction unit 12 according to this example, when forming a projection waveform, for each pixel on the projection line, it is estimated whether the pixel exists in the character region, and is estimated to exist in the character region. Only for the selected pixels, the pixel values are accumulated to form a projection waveform.

図３に、文書画像中の一部分の構成の一例を示す。図３に示す文書画像は、文字部３１の領域以外に、べた塗り部３２や網点画像部３３を含んでいる。したがって、単に画素値を累積しただけの投影波形ｐ(y)には、紙の地色やべた塗り部３２、網点画像部３３の影響が含まれてしまう。投影波形ｐ(y)は下式（１）のように表すことができる。式（１）におけるf(x, y)は、位置(x, y)における画素値を表す。
ｐ(y)＝Σｘ｛ｆ（ｘ，ｙ）｝ ……（１） FIG. 3 shows an example of the configuration of a part in the document image. The document image shown in FIG. 3 includes a solid paint part 32 and a halftone dot image part 33 in addition to the area of the character part 31. Therefore, the projection waveform p (y) obtained by simply accumulating the pixel values includes the influence of the paper ground color, the solid coating portion 32, and the halftone image portion 33. The projection waveform p (y) can be expressed as the following formula (1). F (x, y) in Expression (1) represents a pixel value at the position (x, y).
p (y) = Σx {f (x, y)} (1)

ここで、文字部３１固有の性質を利用して、投影波形ｐ(y)からべた塗り部３２や網点画像部３３の影響を排除する。本例に係る特徴量抽出部１２では、投影ライン上の画素値の散布度に着目して文字部３１の領域判定を行い、特に散布度として分散を利用するものとする。 Here, the influence of the solid coating portion 32 and the halftone dot image portion 33 is eliminated from the projection waveform p (y) by using the property unique to the character portion 31. In the feature quantity extraction unit 12 according to this example, it is assumed that the region of the character part 31 is determined by paying attention to the degree of dispersion of pixel values on the projection line, and in particular, dispersion is used as the degree of dispersion.

図４に、ｙ＝ａの直線で示される投影ライン上の画素値ｆ（ｘ，ａ）を示す。この画素値ｆ（ｘ，ａ）において、任意の位置ｘについて、［ｘ−Ａ／２, ｘ＋Ａ／２］の区間における画素値の分散ｖ(x)を求めると、図５に示すようになる。ここで、Ａはあらかじめ定められた特定の区間幅である。 FIG. 4 shows the pixel value f (x, a) on the projection line indicated by the straight line y = a. When the pixel value variance v (x) in the interval [x−A / 2, x + A / 2] is obtained for an arbitrary position x in the pixel value f (x, a), the result is as shown in FIG. . Here, A is a predetermined specific section width.

図５からわかる通り、べた塗り部３２の分散は領域の両端を除いて小さく、網掛け部３３の分散は大きく、文字部３１はべた塗り部３２と網掛け部３３の中間となる。したがって、投影ライン上の画素値の分布において、中程度の分散値を持つ領域の画素値のみを選択的に累積すれば、図６に示すように、べた塗り部３２や網点画像部３３の影響を排除し、文字成分をより良く反映した投影波形ｐ１(y)を得ることができる。 As can be seen from FIG. 5, the dispersion of the solid paint portion 32 is small except for both ends of the region, the dispersion of the shaded portion 33 is large, and the character portion 31 is intermediate between the solid paint portion 32 and the shaded portion 33. Accordingly, in the distribution of pixel values on the projection line, if only the pixel values of the region having a medium dispersion value are selectively accumulated, as shown in FIG. The projection waveform p1 (y) that better reflects the character component can be obtained by eliminating the influence.

このようにして得られる投影波形ｐ１(y)は、下式（２）のように表すことができる。
ｐ１(y)＝Σｘ｛Ｔｈ（ｖ(x)）・ｆ（ｘ，ｙ）｝ …（２）
上式（２）において、Ｔｈ(x)はしきい値関数である。 The projection waveform p1 (y) obtained in this way can be expressed as the following equation (2).
p1 (y) = Σx {Th (v (x)) · f (x, y)} (2)
In the above equation (2), Th (x) is a threshold function.

しきい値関数Ｔｈ(x)は、下式（３）のように、文字領域に存在すると推定される画素のみを選択する。下式（３）において、ｖｌおよびｖｈは、図５に示すように、文字部３１の領域における分散値のレンジに対応するしきい値である。
Ｔｈ(v)＝１（ｖｌ≦ｖ≦ｖｈ），
Ｔｈ(v)＝０（ｖ＜ｖｌ、またはｖｈ＜ｖ） …（３） The threshold function Th (x) selects only pixels that are estimated to exist in the character area, as shown in the following equation (3). In the following equation (3), vl and vh are threshold values corresponding to the range of the dispersion value in the region of the character part 31, as shown in FIG.
Th (v) = 1 (vl ≦ v ≦ vh),
Th (v) = 0 (v <vl or vh <v) (3)

図７は、特徴量抽出部１２における投影波形形成部の具体的な構成の一例を示すブロック図である。本例に係る投影波形形成部は、散布度検出部２１、しきい値処理部２２および累積処理部２３を有し、入力される画像データをライン単位で処理する構成となっている。 FIG. 7 is a block diagram illustrating an example of a specific configuration of the projection waveform forming unit in the feature quantity extracting unit 12. The projection waveform forming unit according to this example includes a dispersion degree detection unit 21, a threshold processing unit 22, and an accumulation processing unit 23, and is configured to process input image data in units of lines.

散布度検出部２１は、ライン単位で入力される画像データの散布度を検出する。しきい値処理部２２は、散布度検出部２１の検出結果をしきい値ｖｌ，ｖｈ（図５参照）と比較し、検出した散布度がしきい値ｖｌ，ｖｈ間であれば論理“１”を出力し、それ以外であれば論理“０”を出力する。累積処理部２３は、しきい値処理部２２の出力が論理“１”のときの画素値を累積し、投影波形ｐｌ(y)として出力する。 The spread degree detection unit 21 detects the spread degree of image data input in units of lines. The threshold value processing unit 22 compares the detection result of the distribution degree detection unit 21 with the threshold values vl and vh (see FIG. 5). If the detected distribution degree is between the threshold values vl and vh, a logical “1” is obtained. "" Is output, otherwise logic "0" is output. The accumulation processing unit 23 accumulates pixel values when the output of the threshold processing unit 22 is logic “1”, and outputs it as a projection waveform pl (y).

なお、本例に係る投影波形形成部では、文字領域に存在すると推定される画素についてのみその画素値を累積し、それ以外の画素についてはその画素値を累積しないとする、累積する／累積しないの二値の累積法を採る場合を例に挙げたが、画素値に対して文字領域らしさに応じた重み付けをした後、累積する手法を採ることも可能である。 In the projection waveform forming unit according to the present example, the pixel values are accumulated only for the pixels estimated to exist in the character area, and the pixel values are not accumulated for other pixels. As an example, the binary accumulation method is used. However, it is also possible to adopt a method of accumulating after weighting the pixel values according to the character area likeness.

このようにして求めた投影波形ｐ１(y)を、図６に示すような適当なしきい値ｔyと比較することにより、当該しきい値ｔyよりも大きい領域を黒領域、小さい領域を白領域と判定することができる。垂直方向の投影ラインについても上記手順と同様の処理を行い、垂直方向にとった投影波形を求めて白領域・黒領域の判定を行うようにすれば良い。 By comparing the projection waveform p1 (y) thus obtained with an appropriate threshold value ty as shown in FIG. 6, an area larger than the threshold value ty is a black area, and a small area is a white area. Can be determined. The same processing as in the above procedure may be performed for the vertical projection lines, and the projection waveform taken in the vertical direction may be obtained to determine the white region / black region.

そして、白領域・黒領域の判定結果より、図８に示すように、投影波形データの二値化結果として二値データ系列を出力して連続する黒領域の長さを総てについて求めて、例えば、それらの値において出現頻度のピークを中心としたｎ個の値をこの文書データの水平方向投影の特徴量とする。 Then, from the determination result of the white region / black region, as shown in FIG. 8, the binary data series is output as the binarization result of the projection waveform data, and the length of the continuous black region is obtained for all, For example, in these values, n values centered on the peak of the appearance frequency are set as the feature amounts of the horizontal projection of the document data.

なお、特徴量は黒領域の長さに限られるものではなく、白領域と判定された部分を行間隔、または文字間隔とみなし、これらの値を特徴量に含めることもできる。さらに、一般に行間隔は文字間隔よりも大きいので、白間隔が大きい方向を文字の組方向として、特徴量に加えるようにしても良い。 Note that the feature amount is not limited to the length of the black region, and a portion determined to be a white region can be regarded as a line interval or a character interval, and these values can be included in the feature amount. Furthermore, since the line spacing is generally larger than the character spacing, the direction in which the white spacing is large may be added to the feature amount as the character set direction.

再び図１において、文書蓄積部１４は、入力された文書データを特徴量抽出部１２によって抽出されたｎ種類の特徴量と関連付けて記憶蓄積するものであり、ハードディスクドライブやＤＶＤ(Digital Versatile Disc)−ＲＡＭ／±ＲＷ／±Ｒドライブ等の大容量記憶装置によって実現される。 Referring back to FIG. 1, the document storage unit 14 stores and stores the input document data in association with the n types of feature values extracted by the feature value extraction unit 12, and includes a hard disk drive and a DVD (Digital Versatile Disc). -Realized by a mass storage device such as a RAM / ± RW / ± R drive.

類似度評価部１３は、特徴量抽出部１２によって抽出され、文書データと関連付けられて文書蓄積部１４に蓄積されている特徴量について、複数の特徴量が蓄積されていれば、複数の特徴量をそれぞれ比較して相互間の類似度を求める。ここでいう類似度とは、例えば、特徴量がベクトル表現のもの（以下、この特徴量を「特徴ベクトル」と記す）であれば、各々の文書データに関連付けられた特徴ベクトル間のユークリッド距離に基づいて評価される度合いである。 The similarity evaluation unit 13 extracts a plurality of feature amounts if a plurality of feature amounts are accumulated for the feature amounts extracted by the feature amount extraction unit 12 and associated with the document data and accumulated in the document accumulation unit 14. Are respectively compared to obtain the similarity between them. The similarity referred to here is, for example, the Euclidean distance between feature vectors associated with each document data if the feature value is a vector expression (hereinafter, this feature value is referred to as “feature vector”). It is a degree evaluated based on.

具体的には、図９（Ａ）に示すように、特徴ベクトル間のユークリッド距離が所定のしきい値よりも小さい場合を類似度：大として評価し、また図９（Ｂ）に示すように、特徴ベクトル間のユークリッド距離が当該所定のしきい値以上の場合を類似度：小として評価する。ただし、特徴ベクトル間の距離の定義についてはユークリッド距離に限られるものではない。 Specifically, as shown in FIG. 9A, the case where the Euclidean distance between feature vectors is smaller than a predetermined threshold is evaluated as similarity: large, and as shown in FIG. 9B. The case where the Euclidean distance between feature vectors is equal to or greater than the predetermined threshold is evaluated as similarity: small. However, the definition of the distance between feature vectors is not limited to the Euclidean distance.

文書区切り部１５は、一連の文書データ間の類似度を類似度評価部１３によって求め、この求めた類似度を基に一連の文書データに区切りを入れる。なお、（１ページ以上の文書データからなる）文書単位に蓄積する方法としては、例えば、文書単位にファイルフォルダを作成し、対応する文書データを入力順の連番を持つファイル名で格納する方法や、複数ページを保持できるマルチページＴＩＦＦ(Tagged Image File Format)のような画像ファイルフォーマットを用いる方法がある。 The document delimiter 15 obtains a similarity between a series of document data by the similarity evaluation unit 13 and puts a delimiter into the series of document data based on the obtained similarity. As a method of accumulating in document units (consisting of document data of one page or more), for example, a method of creating a file folder for each document and storing corresponding document data with a file name having a sequential number in the input order Alternatively, there is a method of using an image file format such as a multi-page TIFF (Tagged Image File Format) that can hold a plurality of pages.

文書出力部１６は、出力が指示された文書データを所定の形式で出力するものであり、例えば、ＣＲＴ(Cathode Ray Tube)とその制御手段、プリンタ装置とその制御手段、磁気ディスクやメモリカード等のリード／ライト装置とその制御手段、あるいはネットワーク等を介してデータの授受を行うデータ転送装置によって実現される。すなわち、文書出力部１７からは、例えば、紙に印刷された文書、ＣＲＴに出力された画像データ、あるいはＨＴＭＬ(Hyper Text Markup Language)等により整形されたファイルが、出力結果として出力される。 The document output unit 16 outputs document data instructed to be output in a predetermined format. For example, a CRT (Cathode Ray Tube) and its control means, a printer device and its control means, a magnetic disk, a memory card, etc. The read / write device and its control means, or a data transfer device that exchanges data via a network or the like. That is, the document output unit 17 outputs, for example, a document printed on paper, image data output to a CRT, or a file formatted by HTML (Hyper Text Markup Language) as an output result.

次に、上記構成の本実施形態に係る文書ファイリング装置における文書データの区切り処理の手順について、図１０のフローチャートにしたがって説明する。 Next, the procedure of document data separation processing in the document filing apparatus according to this embodiment having the above-described configuration will be described with reference to the flowchart of FIG.

ユーザは、例えば、１ページ以上からなる紙原稿をＡＤＦにセットする。この際、紙原稿は単一の（１ページ以上からなる）文書、複数の（１ページ以上からなる）文書のいずれであっても良い。また、紙原稿をセットする際に、ユーザは文書の区切りを意識する必要はない。ＡＤＦにセットされた紙原稿は、ＡＤＦにより1ページずつスキャナ装置に送られる。このとき、スキャナ装置は、図１の文書入力部１１として機能することになる。すなわち、文書入力部１１からは、ＡＤＦにセットした紙原稿のページ数と同数の文書データが文書ファイリング装置に入力される。 For example, the user sets a paper document including one page or more on the ADF. At this time, the paper document may be either a single document (consisting of one or more pages) or a plurality of documents (comprising one or more pages). Also, when setting a paper document, the user need not be aware of document separation. The paper document set in the ADF is sent to the scanner device page by page by the ADF. At this time, the scanner device functions as the document input unit 11 in FIG. That is, from the document input unit 11, the same number of document data as the number of pages of the paper document set in the ADF is input to the document filing device.

文書入力部１１から文書データが入力されると（ステップＳ１１）、入力された文書データが既に特徴量を抽出され、特徴量と関連付けられて文書蓄積部１４に蓄積されている文書データであるか否かを判断する（ステップＳ１２）。入力された文書データが未だ特徴量を抽出されていない文書データであれば、入力された文書データから特徴量を特徴量抽出部１２によって抽出し(ステップＳ１３)、しかる後ステップＳ１４の処理に移行する。入力された文書データが既に特徴量を抽出された文書データであれば、直接ステップＳ１４の処理に移行する。すなわち、入力された文書データが既に特徴量を抽出され、特徴量と関連付けられて文書蓄積部１４に蓄積されている文書データについては改めて特徴量を抽出する処理は行わない。 When document data is input from the document input unit 11 (step S11), whether the input document data has already been extracted with a feature amount and is stored in the document storage unit 14 in association with the feature amount. It is determined whether or not (step S12). If the input document data is document data for which feature amounts have not yet been extracted, the feature amount is extracted from the input document data by the feature amount extraction unit 12 (step S13), and then the process proceeds to step S14. To do. If the input document data is already extracted document data, the process directly proceeds to step S14. That is, the process for extracting the feature amount is not performed on the document data in which the feature amount has already been extracted from the input document data and is stored in the document storage unit 14 in association with the feature amount.

上記の処理を入力された総ての文書データに対して行う（ステップＳ１４）。続いて、文書蓄積部１４に蓄積されている文書データに関連付けられた特徴量と、入力された文書データと関連付けられた特徴量とを、文書蓄積部１４に蓄積されている文書データについて類似度評価部１３によって評価する（ステップＳ１５）。そして、その評価結果に基づいて、文書データについて文書単位の区切り結果を確定し、文書単位に分離して文書蓄積部１４に蓄積する（ステップＳ１６）。 The above processing is performed for all input document data (step S14). Subsequently, the feature amount associated with the document data stored in the document storage unit 14 and the feature amount associated with the input document data are compared with each other with respect to the document data stored in the document storage unit 14. Evaluation is performed by the evaluation unit 13 (step S15). Then, based on the evaluation result, a document unit separation result is determined for the document data, separated into document units, and stored in the document storage unit 14 (step S16).

なお、上記実施形態では、特徴量抽出部１２において、入力される多値文書データをそのまま処理し、分散を画素値の散布度として用いて文字領域の判定を行う場合を例に挙げて説明したが、本発明はこれに限られるものではない。 In the above-described embodiment, an example has been described in which the feature amount extraction unit 12 directly processes input multi-value document data and performs character region determination using variance as a distribution of pixel values. However, the present invention is not limited to this.

すなわち、特徴量抽出部１２において、入力される多値文書データをそのまま処理するのではなく、あらかじめ二値化して白画素・黒画素のみからなる画像データに変換し、さらに分散を画素値の散布度とする代わりに、投影ライン上の各黒画素についてその後に続く黒画素の個数、即ちラン長を散布度として用いて文字領域の判定を行うことも可能である。この場合、図３において、べた塗り部３２のラン長は大きく、網点画像部３３のラン長は小さく、文字部３１のラン長はべた塗り部３２と網点画像部３３の中間となる。 That is, the feature quantity extraction unit 12 does not process the input multi-value document data as it is, but binarizes it in advance and converts it into image data consisting of only white and black pixels, and further distributes the dispersion of pixel values. Instead of the degree, it is also possible to determine the character area by using the number of subsequent black pixels, that is, the run length, as the dispersion degree for each black pixel on the projection line. In this case, in FIG. 3, the run length of the solid paint portion 32 is large, the run length of the halftone image portion 33 is small, and the run length of the character portion 31 is intermediate between the solid paint portion 32 and the halftone image portion 33.

また、上記実施形態では、文字領域の判定にしきい値関数Ｔｈ(v)を用いた場合を例に挙げて説明したが、当該しきい値関数Ｔｈ(v)の代わりに重み関数ｗ(v)を用いることも可能である。このことについて以下に具体的に説明する。 In the above embodiment, the case where the threshold function Th (v) is used for the character region determination has been described as an example. However, the weighting function w (v) is used instead of the threshold function Th (v). It is also possible to use. This will be specifically described below.

式（２）におけるしきい値関数Ｔｈ(v)は、“０”または“１”の２種類の値のみをとる関数である。つまり、しきい値関数Ｔｈ(v)は、投影ライン上にある全ての画素を、文字領域にある画素か否かに選別する役割を担っている。ある画素が文字領域にあるか否かを選別する基準が、図５における２つのしきい値ｖｌ，ｖｈである。したがって、得られる投影波形ｐｌ(y)の形状は、これら２つのしきい値ｖｌ，ｖｈに大きく影響される。しかし、文字領域における画素値の分散ｖ(x)は、フォントや文字の大きさによって大きく異なるため、２つのしきい値値ｖｌ，ｖｈを適切に設定するのは難しい場合がある。 The threshold function Th (v) in Expression (2) is a function that takes only two types of values “0” or “1”. That is, the threshold function Th (v) plays a role of selecting all pixels on the projection line as pixels in the character area. The two threshold values vl and vh in FIG. 5 are criteria for selecting whether or not a certain pixel is in the character area. Accordingly, the shape of the obtained projection waveform pl (y) is greatly influenced by these two threshold values vl and vh. However, since the variance v (x) of pixel values in the character area varies greatly depending on the font and character size, it may be difficult to set the two threshold values vl and vh appropriately.

そこで、式（２）におけるしきい値関数Ｔｈ(v)の代わりに、図１１に示すような重み関数ｗ(v)を用いることで、２つのしきい値ｖｌ，ｖｈを容易に設定することができる。重み関数ｗ(v)では、図１１から明らかなように、２つのしきい値ｖｌ，ｖｈの近傍で重みがなだらかに変化する。したがって、例えばしきい値ｖｌを若干変更してｖｌ＋αとしても、重み関数ｗ(v)の値ｗ(ｖｌ＋α)が大きく変化することはない。そのため、得られる投影波形の形状が２つのしきい値ｖｌ，ｖｈに大きく影響されることがない。 Therefore, it is possible to easily set the two threshold values vl and vh by using a weighting function w (v) as shown in FIG. 11 instead of the threshold function Th (v) in the equation (2). Can do. In the weight function w (v), as is clear from FIG. 11, the weight changes gently in the vicinity of the two threshold values vl and vh. Therefore, for example, even if the threshold value vl is slightly changed to become vl + α, the value w (vl + α) of the weight function w (v) does not change greatly. Therefore, the shape of the obtained projection waveform is not greatly affected by the two threshold values vl and vh.

重み関数ｗ(v)を用いて求められる投影波形ｐ２(y)は、下式（４）のように表すことができる。
ｐ２(y)＝Σｘ｛ｗ（ｖ(x)）・ｆ（ｘ，ｙ）｝ …（４） The projection waveform p2 (y) obtained using the weight function w (v) can be expressed as the following equation (4).
p2 (y) = Σx {w (v (x)) · f (x, y)} (4)

また、上記実施形態では、特徴量抽出部１２において、文字部固有の性質として画素値の散布度を用いて画像特徴量の抽出を行うとしたが、文字部固有の性質として画素値の周波数特性を用いることもできる。 In the above embodiment, the feature amount extraction unit 12 extracts the image feature amount using the pixel value dispersion degree as the character-specific property, but the frequency characteristic of the pixel value as the character-specific property. Can also be used.

図４よりわかる通り、べた塗り部３２の周波数は小さく（低く）、網点画像部３３の周波数は大きく（高く）、文字部３１はべた塗り部３２と網点画像部３３の中間となる。したがって、投影ライン上の画素値分布において、中程度の周波数を持つ領域のみを選択的に累積すれば、べた塗り部３２や網点画像部３３の影響を排除し、文字成分をより良く反映した投影波形を得ることができる。 As can be seen from FIG. 4, the frequency of the solid fill portion 32 is small (low), the frequency of the halftone image portion 33 is large (high), and the character portion 31 is intermediate between the solid fill portion 32 and the halftone dot image portion 33. Therefore, in the pixel value distribution on the projection line, if only the region having a medium frequency is selectively accumulated, the influence of the solid paint portion 32 and the dot image portion 33 is eliminated, and the character component is reflected better. A projection waveform can be obtained.

文字部固有の性質として画素値の周波数特性を用いて画像特徴量の抽出を行うには、投影ライン上の画素値に対して、文字部３１に相当する周波数帯域のみを通過させる帯域通過フィルタを適用すれば良い。この帯域通過フィルタを通して得られる投影波形ｐ３(y)は、下式（５）のように表すことができる。
ｐ３(y)＝Σｘ｛ＢＰＦ（ｆ（ｘ，ｙ）） …（５）
上式（５）において、ＢＰＦ(x)は帯域通過フィルタであり、当該ＢＰＦ(x)の通過帯域は文字部３１に相当する周波数帯域である。 In order to extract the image feature amount using the frequency characteristic of the pixel value as a property unique to the character part, a band pass filter that passes only the frequency band corresponding to the character part 31 with respect to the pixel value on the projection line. Apply. The projection waveform p3 (y) obtained through this band pass filter can be expressed as the following equation (5).
p3 (y) = Σx {BPF (f (x, y)) (5)
In the above equation (5), BPF (x) is a band pass filter, and the pass band of the BPF (x) is a frequency band corresponding to the character part 31.

本発明の一実施形態に係る文書ファイリング装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the document filing apparatus which concerns on one Embodiment of this invention. 特徴量抽出部での特徴量抽出の概念図である。It is a conceptual diagram of the feature-value extraction in a feature-value extraction part. 文書画像中の一部分の構成の一例を示す図である。It is a figure which shows an example of a structure of the part in a document image. 直線ｙ＝ａで示される投影ライン上の画素値ｆ（ｘ，ａ）を示す図である。It is a figure which shows the pixel value f (x, a) on the projection line shown by the straight line y = a. 画素値ｆ（ｘ，ａ）において、任意の位置ｘについて求めた画素値の分散ｖ(x)を示す図である。It is a figure which shows dispersion | distribution v (x) of the pixel value calculated | required about arbitrary positions x in pixel value f (x, a). 投影ライン上の画素値の分布において、中程度の分散値を持つ領域のみを選択的に累積したときの投影波形ｐ１(y)を示す図である。It is a figure which shows the projection waveform p1 (y) when only the area | region which has a medium dispersion | distribution value selectively accumulates in distribution of the pixel value on a projection line. 特徴量抽出部における投影波形形成部の具体的な構成の一例を示すブロック図である。It is a block diagram which shows an example of the specific structure of the projection waveform formation part in a feature-value extraction part. 白領域・黒領域の判定結果より、連続する黒領域の長さを総てについて求めて、出現頻度のピークを中心としたｎ個の値を水平方向投影の特徴量とする場合の例を示す図である。An example in which the lengths of continuous black areas are all obtained from the determination results of white areas and black areas, and n values centering on the peak of the appearance frequency are used as the feature values of the horizontal projection is shown. FIG. 類似度評価の概念図である。It is a conceptual diagram of similarity evaluation. 文書データの区切り処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the division | segmentation process of document data. 文字領域の判定に用いる重み関数ｗ(v)の特性図である。It is a characteristic view of the weight function w (v) used for determination of a character area. 従来技術の課題の説明に供する図（その１）である。It is FIG. (1) with which it uses for description of the subject of a prior art. 従来技術の課題の説明に供する図（その２）である。It is FIG. (The 2) with which it uses for description of the subject of a prior art.

Explanation of symbols

１１…文書入力部、１２…特徴量抽出部、１３…類似度評価部、１４…文書蓄積部、１５…文書区切り部、１６…文書出力部、２１…散布度検出部、２２…しきい値処理部、２３…累積処理部 DESCRIPTION OF SYMBOLS 11 ... Document input part, 12 ... Feature quantity extraction part, 13 ... Similarity evaluation part, 14 ... Document storage part, 15 ... Document delimiter part, 16 ... Document output part, 21 ... Scatter degree detection part, 22 ... Threshold value Processing unit, 23... Cumulative processing unit

Claims

A projection waveform forming means for estimating whether or not each pixel on the projection line is present in the character region, and accumulating pixel values only for the pixels estimated to be present in the character region to form a projection waveform;
A feature quantity extraction apparatus comprising: extraction means for extracting a feature quantity by comparing the projection waveform formed by the projection waveform forming means with a predetermined threshold value.

The feature amount extraction apparatus according to claim 1, wherein the projection waveform forming unit estimates whether or not the character area is present based on a distribution degree of the pixel values on the projection line.

The projection waveform forming means obtains a variance of pixel values in a predetermined section for the pixel values on the projection line, and selectively accumulates only pixel values in a region having a predetermined variance value. The feature amount extraction apparatus according to claim 2.

The projection waveform forming means binarizes the input multi-valued image data and converts it into image data consisting of only white pixels and black pixels, and determines the number of black pixels following each black pixel on the projection line. The feature quantity extraction device according to claim 2, wherein the feature quantity extraction device is used as the degree of dispersion.

The feature amount extraction apparatus according to claim 1, wherein the projection waveform forming unit estimates whether or not the character waveform area exists based on a frequency characteristic of the pixel value on the projection line.

A first step of estimating whether or not each pixel on the projection line is present in the character area, and accumulating pixel values only for the pixels estimated to be present in the character area to form a projection waveform;
And a second step of extracting a feature amount by comparing the projection waveform formed in the first step with a predetermined threshold value.

A projection waveform forming means for estimating whether or not each pixel on the projection line is present in the character region, and accumulating pixel values only for the pixels estimated to be present in the character region to form a projection waveform;
Extraction means for extracting a feature amount by comparing the projection waveform formed by the projection waveform forming means with a predetermined threshold;
A document filing apparatus comprising: a determination unit that determines a break of a document unit based on the feature amount extracted by the extraction unit.