CN100550040C - Optical character recognition method and equipment and character recognition method and equipment - Google Patents
Optical character recognition method and equipment and character recognition method and equipment Download PDFInfo
- Publication number
- CN100550040C CN100550040C CNB2005100228818A CN200510022881A CN100550040C CN 100550040 C CN100550040 C CN 100550040C CN B2005100228818 A CNB2005100228818 A CN B2005100228818A CN 200510022881 A CN200510022881 A CN 200510022881A CN 100550040 C CN100550040 C CN 100550040C
- Authority
- CN
- China
- Prior art keywords
- word
- font
- candidate
- words
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 187
- 238000012015 optical character recognition Methods 0.000 title claims abstract description 33
- 230000003287 optical effect Effects 0.000 claims abstract description 33
- 238000009826 distribution Methods 0.000 claims abstract description 27
- 230000008569 process Effects 0.000 claims abstract description 22
- 238000000605 extraction Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 13
- 230000001174 ascending effect Effects 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 description 31
- 238000012545 processing Methods 0.000 description 12
- 238000011084 recovery Methods 0.000 description 10
- 239000013598 vector Substances 0.000 description 7
- 101100129500 Caenorhabditis elegans max-2 gene Proteins 0.000 description 6
- 101100083446 Danio rerio plekhh1 gene Proteins 0.000 description 6
- 230000008901 benefit Effects 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000003750 conditioning effect Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 239000002131 composite material Substances 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Landscapes
- Character Discrimination (AREA)
Abstract
本发明涉及一种光学字符识别方法和设备。在一方面,本发明提供一种如下的光学字体识别方法和设备:将输入文本图像的词划分为词对,分别识别在相应的词对中的较长词和较短词,分别根据相邻的词的字体信息和根据依照在行中的字体信息分布调节行的字体信息的粗调节步骤调节词的字体信息,以及附加地识别该行中词的字号。本发明也提供一种基于分类过程的字体识别方法和设备和通过使用投影方法结合连通单元方法鉴别区间类型和计算X高度的方法和设备。
The invention relates to an optical character recognition method and equipment. In one aspect, the present invention provides an optical font recognition method and device as follows: the words of the input text image are divided into word pairs, the longer words and the shorter words in the corresponding word pairs are respectively recognized, respectively, according to the adjacent The font information of the word and the coarse adjustment step of adjusting the font information of the word according to the distribution of the font information in the row adjust the font information of the word, and additionally identify the font size of the word in the row. The present invention also provides a font recognition method and device based on a classification process and a method and device for identifying interval types and calculating X-height by using a projection method combined with a connected unit method.
Description
技术领域 technical field
本发明涉及一种光学字符识别方法和设备,更具体地说,涉及一种在光学字符识别系统中识别字符字体类型的方法和设备。The present invention relates to an optical character recognition method and equipment, more specifically, to a method and equipment for identifying character font types in an optical character recognition system.
背景技术 Background technique
光学字符识别(OCR)系统已经被广泛地使用。字体信息比如字形、斜度、磅数和字号已经用于常规OCR系统中以改善OCR的性能,同时字体信息也有利于文件结构分析的性能和信息恢复。Optical character recognition (OCR) systems have been widely used. Font information such as font style, slant, point weight and font size have been used in conventional OCR systems to improve the performance of OCR, and font information is also beneficial to the performance of file structure analysis and information recovery.
如今有两种方法可用于字体识别:There are two methods available for font recognition today:
-从文字实体(字、线、段)中提取全局特征。这种方法适合于先验字体识别,在这种识别方法中识别字体而不需要任何字母类别的知识。-Extract global features from textual entities (words, lines, paragraphs). This approach is suitable for a priori font recognition, where fonts are identified without any knowledge of letter classes.
-从单个字母中提取局部特征。这种方法可以实质地得益于字母类别的知识。- Extract local features from individual letters. This approach can benefit substantially from knowledge of letter classes.
具体地,US6,337,924和US6,496,600公开了一种后验字体识别。首先已知字符种类(代码),由此可以提取局部特征以确定指定的字符的字体。这些特征基于字母细节,比如衬线的形状(拱形、方形、三角形等)和特定字母的表示比如g和g、a和a,或者可以将字符图像直接与不同的字体的字符图像模板进行比较。然而,它是一种可以使用字母类别知识的完全不同的字体识别。这种类型的方法可以实质地得益于字母类别的知识,因此可以实现更高的精确度。但是对于预测全-字体OCR的字体信息没有帮助。In particular, US6,337,924 and US6,496,600 disclose a posteriori font recognition. First, the character category (code) is known, from which local features can be extracted to determine the font of the specified character. These features are based on letter details such as the shape of the serif (arch, square, triangle, etc.) . However, it is a completely different type of font recognition that can use knowledge of letter classes. This type of approach can benefit substantially from knowledge of letter classes and thus achieve higher accuracy. But it is not helpful for predicting font information for full-font OCR.
CN 1271140A(中国申请号99105851.8)and“Optical FontRecognition Using typographical features”(IEEE TRANSACTIONSON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.20,NO.8,AUGUST 1998)to Abdelwahab Zramdini and Rolf Ingold已经公开了一种基于文字块或线的“英文字体识别”。提取全局特征以确定整个文字块或线的字体信息。这些特征一般由非印刷技术人员确定(如文字密度、尺寸、字母的取向和间距、衬线等)。然而,在一行中不同的词可以是不同的字体(字形、尺寸、磅数、斜度等),比如名称和加重等。这种方法一般从大块文字(比如文字块、至少一行)中提取特征,并且不能识别单个词的字体信息。CN 1271140A (Chinese application number 99105851.8) and "Optical FontRecognition Using typographical features" (IEEE TRANSACTIONSON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.20, NO.8, AUGUST 1998) to Abdelwahab Zramdini gold and Rolf have disclosed a block based on text Or line "English Font Recognition". Global features are extracted to determine font information for an entire text block or line. These characteristics are generally determined by non-printers (such as text density, size, orientation and spacing of letters, serifs, etc.). However, different words in a line can be in different fonts (typeface, size, point, slant, etc.), such as names and weights, etc. This method generally extracts features from a large block of text (such as a text block, at least one line), and cannot recognize the font information of a single word.
Optical font recognition from projection profiles(Electronicpublishing,VOL.6(3),249-260(September 1993))and Scriptidentification in printed bilingual documents to D DHANYA,A GRAMAKRISHNAN_ and PEETA BASA PATI(Sadhana Vol.27,Part1,February 2002,pp.73-82.Printed in India)已经公开了一种基于词的“英文字体识别”。字体特征从每个词图像中提取,因此它可以识别文本行中每个词的不同字体信息。然而,其它英文单词OFR处理通常不利用字形识别中的X高度或区间类型信息。但是在理想的英文单词OFR中,词X高度是非常有用的信息。词图像可以根据X高度归一化并且从归一化的图像中提取字体特征,以便确保以相同的等级提取不同词的字体特征并避免字号的影响。词区间类型也是字形识别所需的信息。通常,不同区间的英文单词类型可以具有不同的字体特征,这是因为在不同的区间中不同字符(描边)组合的缘故。词X高度和区间类型也用于字号识别。一般其它字体识别处理还没有确保精确的X高度和词区间类型计算的成熟的方案。它们通常通过简单的投影直方图法获得行的信息。Optical font recognition from projection profiles (Electronic publishing, VOL.6(3), 249-260 (September 1993)) and Script identification in printed bilingual documents to D DHANYA, A GRAMAKRISHNAN_ and PEETA BASA PATI (Sadhana Vol.27, Part1, February 200 , pp.73-82. Printed in India) has disclosed a word-based "English font recognition". Font features are extracted from each word image, so it can identify different font information for each word in a text line. However, other English word OFR processes generally do not utilize X-height or interval type information in glyph recognition. But in the ideal English word OFR, the word X height is very useful information. Word images can be normalized according to the X height and font features are extracted from the normalized image, so as to ensure that font features of different words are extracted at the same level and avoid the influence of font size. The word interval type is also information required for font recognition. Generally, English word types in different intervals may have different font characteristics, which is due to different combinations of characters (strokes) in different intervals. Word X height and interval type are also used for font size recognition. In general other font recognition processes do not yet have a mature solution for ensuring accurate X-height and word space type calculations. They usually obtain row information by simple projection histogram method.
此外,在文件中的特定点上可能改变字体以加以强调或者使读者容易注意。这可以通过选择另一字形或改变式样(比如磅数、斜度等)或选择相同字形的不同字号实现。在某些上述的先前字体识别方法中,从文字块或行中提取字体特征,或者从单个词中提取字体特征。显然,对于单个字体的文本行或块,第一种方法容易获取字体信息,但并不能鉴别单个词的字体信息,同时在一行中的不同的词可以具有不同字体(字形、字号、磅数、斜度等)。第二种方法可以鉴别在文本行中的每个词的不同字体信息,但它不能利用上下文字体信息,同时通常它们并不完全不相关。在执行任何OFR处理之前,必须对要处理的区域的大小进行特殊的考虑。如果区域太小,在其中包含的信息可能不足以用于分类;然而,如果太大,则在相同区域中混合的不同类型可能太多。In addition, fonts may be changed at specific points in the document for emphasis or to ease the reader's attention. This can be done by selecting another font or changing the style (such as point size, slope, etc.) or choosing a different font size of the same font. In some of the previous font recognition methods described above, font features are extracted from blocks or lines of text, or from individual words. Obviously, for a text line or block of a single font, the first method is easy to obtain font information, but cannot identify the font information of a single word, and different words in a line can have different fonts (font style, font size, point number, slope, etc.). The second method can identify different font information for each word in a text line, but it cannot exploit contextual font information, and usually they are not completely irrelevant. Before performing any OFR processing, special consideration must be given to the size of the region to be processed. If the region is too small, there may not be enough information contained in it for classification; however, if it is too large, there may be too many different types mixed in the same region.
发明内容 Contents of the invention
考虑到已有技术的上述缺陷,要求提供一种能够消除上述已有技术的缺陷的新颖的字体识别方法,这种方法能够精确地鉴别在英文文本行中的每个词的字体信息(字形、字号、衬线、磅数、斜度和间距)。In view of the above-mentioned defects of the prior art, it is required to provide a novel font recognition method capable of eliminating the above-mentioned defects of the prior art, and this method can accurately identify the font information (font shape, font size, serif, point size, slope, and spacing).
还要求一种能对远多于具有文件版面恢复功能的普通OCR软件(比如Omnipage、FineReader等)的更多字形加以识别的新颖的字体识别方法。Also require a novel font recognition method that can recognize more fonts than common OCR software (such as Omnipage, FineReader, etc.) with file layout recovery functions.
根据第一方面,本发明提供一种光学字体识别方法,包括如下步骤:According to a first aspect, the present invention provides an optical font recognition method, comprising the steps of:
将输入文本行图像的词划分为词对的划分步骤;The step of dividing the words of the input text line image into word pairs;
识别每个词对中较长词的字体信息的字体识别步骤;A font recognition step that identifies font information of the longer word in each word pair;
基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别步骤。Discriminating the font information of the shorter word in each word pair based on the font information of the longer word in the word pair that includes the shorter word and the font information of the longer word in the word pair adjacent to the shorter word. Font identification step.
优选地还包括根据相邻词的字体信息调节词的字体信息的细调节步骤。Preferably, a fine adjustment step of adjusting the font information of words according to the font information of adjacent words is also included.
光学字体识别方法进一步包括识别上述行中词的字号的识别步骤。The optical font recognition method further includes a recognition step of recognizing font sizes of words in the above-mentioned lines.
这种光学字体识别方法在行图像的OFR处理中采用词对机构,这种机构基于英文单词的字体类别方法并同时考虑在实际英文文本中的字体分布的特性。它使用基于上下文字体信息的两阶段结果调节技术,这种技术能够实现更高的精确度和在文本行中实现更加整齐的输出。This optical font recognition method uses a word pair mechanism in the OFR processing of line images, which is based on the font category method of English words and simultaneously considers the characteristics of font distribution in actual English texts. It uses a two-stage result conditioning technique based on contextual font information, which enables greater precision and neater output within lines of text.
优选地,鉴别每个词对中较短词的字体信息的字体鉴别步骤包括:Preferably, the font identification step of identifying the font information of the shorter word in each word pair comprises:
将与待鉴别较短词相邻的词对中的较长词的字体信息与待鉴别较短词所在的词对中的较长词的字体信息进行比较;和comparing the font information of the longer word in the word pair adjacent to the shorter word to be identified with the font information of the longer word in the word pair in which the shorter word is to be identified; and
如果与待鉴别较短词相邻的词对中的较长词的字体信息与待鉴别较短词所在的词对中的较长词的字体信息相同,则确定该词对中的较短词具有与该词对中的较长词和相邻词对中的较长词相同的字体信息。If the font information of the longer word in the word pair adjacent to the shorter word to be identified is the same as the font information of the longer word in the word pair where the shorter word is located, determine the shorter word in the word pair have the same font information as the longer word in the word pair and the longer word in the adjacent word pair.
优选地,如果与待鉴别较短词相邻的词对中的较长词的字体信息不同于待鉴别较短词所在的词对中的较长词的字体信息,则识别较短词的字体信息。Preferably, if the font information of the longer word in the word pair adjacent to the shorter word to be identified is different from the font information of the longer word in the word pair where the shorter word to be identified is located, then identify the font of the shorter word information.
优选地,鉴别每个词对中较短词的字体信息的字体鉴别步骤进一步包括如果在词对中的较短词是在行中的第一或最后词则识别较短词的字体信息。Preferably, the step of identifying the font information of the shorter word in each word pair further comprises identifying the font information of the shorter word if the shorter word in the word pair is the first or last word in the line.
优选地,调节字体信息的细调节步骤包括:确定在行中的每个词的第一候选字体是否可用;如果确定该词的第一候选字体不可用则确定该词的第二候选字体是否可用;如果确定该词的第二候选字体可用则将该词的第二候选字体与该词的第一候选字体进行交换。Preferably, the fine-tuning step of adjusting the font information comprises: determining whether a first candidate font for each word in the line is available; if it is determined that the first candidate font for the word is not available, determining whether a second candidate font for the word is available ; If it is determined that the second candidate font of the word is available, the second candidate font of the word is exchanged with the first candidate font of the word.
优选地,调节字体信息的细调节步骤进一步包括如果确定该词的第二候选字体不可用则确定该词的第三候选字体是否可用,以及如果确定该词的第三候选字体可用则将该词的第三候选字体与该词的第一候选字体进行交换。Preferably, the step of fine-tuning the font information further includes determining whether a third candidate font for the word is available if it is determined that the second candidate font for the word is not available, and determining whether the third candidate font for the word is available if it is determined that the word is available. The third candidate font of is exchanged with the first candidate font of the word.
优选地,如果确定该词的第三候选字体不可用,则判断该词的所有三个候选字体是否都是可靠的;以及如果确定该词的所有三个候选字体都是不可靠的则以相邻词的字体设定该词的第一候选字体。Preferably, if it is determined that the third candidate font of the word is unavailable, then it is judged whether all three candidate fonts of the word are reliable; and if it is determined that all three candidate fonts of the word are unreliable, the corresponding The font of the adjacent word sets the first candidate font for the word.
优选地,细调节步骤包括如下的预处理过程:比较行中每个词的字体信息和包含多个公知字体之字体信息的字典;以及以相似度的顺序获得该词的三个候选字体和三个相应的距离值,判断该词的所有三个候选字体都不可靠的条件包括所有三个距离值都大于预定的阈值。Preferably, the fine-tuning step includes the following preprocessing process: compare the font information of each word in the row with a dictionary containing font information of a plurality of known fonts; and obtain three candidate fonts and three candidate fonts of the word in order of similarity A corresponding distance value, the condition for judging that all three candidate fonts of the word are unreliable includes that all three distance values are greater than a predetermined threshold.
优选地,细调节步骤包括如下的预处理过程:比较行中每个词的字体信息和包含多个公知字体之字体信息的字典;以及以相似度的顺序获得该词的至少第一和第二候选字体和至少两个相应的距离值;以及根据词的第一候选字体对在行中的详细字形的分布进行计数。Preferably, the fine-tuning step includes the following preprocessing: comparing the font information of each word in the row with a dictionary containing font information of a plurality of well-known fonts; candidate fonts and at least two corresponding distance values; and counting the distribution of detailed glyphs in the line according to the first candidate font for the word.
优选地,针对行中的第一个或最后的词,确定所述词的第一候选字体是否可用的条件包括:详细的字形在行中大致一致,或者词的第一候选字体与行的主要字形一致。Preferably, for the first or last word in a line, the conditions for determining whether the first candidate font of the word is available include: the detailed glyphs are roughly consistent in the line, or the first candidate font of the word is the same as the main font of the line The font is consistent.
优选地,针对行中的第一个或最后的词,确定所述词的第二候选字体可用的条件包括:词的第二候选字体与行的主要字形一致并且对应的距离值小于预定的阈值。Preferably, for the first or last word in the row, the conditions for determining the availability of the second candidate font of the word include: the second candidate font of the word is consistent with the main glyph of the row and the corresponding distance value is less than a predetermined threshold .
优选地,所述候选字体包括第一候选字体、第二候选字体和第三候选字体,针对行中的第一个或最后的词,确定所述词的第三候选字体可用的条件包括:词的第三候选字体与行的主要字形一致并且对应的距离值小于预定的阈值。Preferably, the candidate fonts include a first candidate font, a second candidate font and a third candidate font, and for the first or last word in a row, the conditions for determining the availability of the third candidate font for the word include: word The third candidate font is consistent with the main glyph of the row and the corresponding distance value is smaller than a predetermined threshold.
优选地,针对除了行中的第一个和最后的词之外的每个词,确定所述词的第一候选字体是否可用的条件包括:详细字形在行中大致一致,并且当前词的第一候选字体与行的主要字形一致,或者当前词的第一候选字体与该行中的两个相邻词的第一候选字体不同,或者两个相邻词的详细字形是相同的。Preferably, for each word except the first and last words in the line, the conditions for determining whether the first candidate font of the word is available include: the detailed fonts are roughly consistent in the line, and the first word of the current word A candidate font is consistent with the main glyph of the row, or the first candidate font of the current word is different from the first candidate fonts of two adjacent words in the row, or the detailed glyphs of the two adjacent words are the same.
优选地,针对除了行中的第一个和最后的词之外的每个词,确定所述词的第二候选字体可用的条件包括:词的第二候选字体与两个相邻词一致并且对应的距离值小于预定的阈值。Preferably, for each word except the first and last words in the line, the conditions for determining that the second candidate font of the word is available include: the second candidate font of the word is consistent with two adjacent words and The corresponding distance value is less than a predetermined threshold.
优选地,候选字体包括第一候选字体、第二候选字体和第三候选字体,针对除了行中的第一或最后的词之外的每个词,确定所述词的第三候选字体可用的条件包括:词的第三候选字体与两个相邻词的主要字形一致并且对应的距离值小于预定的阈值。Preferably, the candidate fonts include a first candidate font, a second candidate font and a third candidate font, and for each word except the first or last word in a line, it is determined that the third candidate font for said word is available The conditions include: the third candidate font of the word is consistent with the main glyphs of two adjacent words and the corresponding distance value is smaller than a predetermined threshold.
优选地,还包括根据行中的字体信息分布调节行的字体信息的粗调节步骤,该调节行的字体信息的粗调节步骤包括:对行中的衬线和间距中的至少一个和详细字形的分布进行计数;确定衬线和间距中的所述至少一种是否均匀一致以及详细字形在行中是否大致一致,以及如果确定上述的分布满足该条件,则以主要的详细字形设置该行中所有词的第一候选字体并使用每个词的第一候选字体作为所识别的字体结果。Preferably, it also includes a coarse adjustment step of adjusting the font information of the row according to the distribution of font information in the row, and the coarse adjustment step of adjusting the font information of the row includes: at least one of serifs and spacing in the row and detailed glyphs distribution; determine whether said at least one of serifs and spacing is uniform and whether detail glyphs are approximately consistent in a row, and if it is determined that the above-mentioned distribution satisfies this condition, set all word's first candidate font and use each word's first candidate font as the identified font result.
优选地,识别该行中词的字号的识别步骤包括:判断该词的输入词X高度是否可用;如果输入词X高度可用,则以已知词的所识别字形和输入图像分辨率查询包含“图像分辨率/字形/字号/X高度”表的先验字号字典,以及获得不同字号的X高度列表;将该词的X高度与在X高度列表中的X高度进行匹配;以及将对应的字号作为所识别的字号。Preferably, the recognition step of identifying the character size of the word in the row includes: judging whether the input word X height of the word is available; if the input word X height is available, querying with the recognized font and input image resolution of the known word contains " Image resolution/font style/font size/X height" table's prior font size dictionary, and get a list of X heights of different font sizes; match the X height of the word with the X height in the X height list; and set the corresponding font size as the recognized font size.
优选地,识别该行中词的字号的识别步骤包括:以已知词的所识别字形和输入图像分辨率查询包含“图像分辨率/字形/字号/词高度”表的先验字号字典,并获得不同字号的词高度列表;将该词高度与在词高度列表中的词高度进行匹配;以及将对应的字号作为所识别的字号。Preferably, the recognition step of identifying the font size of the word in the line includes: querying a prior font size dictionary containing the table "image resolution/font shape/font size/word height" with the recognized font shape of the known word and the input image resolution, and A list of word heights of different font sizes is obtained; the word height is matched with word heights in the word height list; and the corresponding font size is used as the recognized font size.
根据第二方面,本发明提供一种字体识别方法,包括:According to a second aspect, the present invention provides a font recognition method, comprising:
将输入图像的词归一化到预定高度的归一化步骤;A normalization step that normalizes the words of the input image to a predetermined height;
从归一化的词中提取特征的特征提取步骤;A feature extraction step that extracts features from normalized words;
判断词的区间类型的判断步骤;The step of judging the interval type of the judgment word;
分类步骤,和classification steps, and
基于分类步骤的结果识别词的字体信息的识别步骤。A recognition step of recognizing font information of words based on the result of the classification step.
这种字体识别方法将词图像的高度归一化到给定的高度,由此以相同的等级提取具有相同区间类型的词图像的特征,这是因为在字形识别中根据词的区间类型对字典进行分类。首先考虑词区间类型以在字形分类中选择对应的区间类型字典。在作为结果的距离大于给定的阈值或者词区间类型是未知的时,将采用先验排列的多字典识别处理。因此,这种实施例将充分利用关于区间类型的信息,并且整个OCR处理变得更加有效。This font recognition method normalizes the height of the word image to a given height, thereby extracting the features of the word image with the same interval type at the same level. sort. First consider the word interval type to select the corresponding interval type dictionary in the grapheme classification. When the resulting distance is greater than a given threshold or the word interval type is unknown, a priori permuted multi-dictionary recognition process will be used. Therefore, such an embodiment will take full advantage of the information about the interval type, and the overall OCR process becomes more efficient.
优选地,输入图像是来自已经进行了行分段和字分段处理的文件图像中的词图像。Preferably, the input image is a word image from a document image that has undergone line and word segmentation processing.
优选地,判断步骤包括判断词的区间类型是否从外部已知的步骤。Preferably, the judging step includes a step of judging whether the interval type of the word is known from the outside.
优选地,分类步骤进一步包括将归一化词的提取特征与所判断的词区间类型的字典中的候选字形的特征进行比较,以及从比较中获得距离值,以及该方法进一步包括在识别步骤之前以预定的阈值核实该距离值的核实步骤。在此贝叶斯分类器优选用于分类步骤。Preferably, the classification step further includes comparing the extracted features of the normalized word with the features of the candidate glyphs in the judged word interval type dictionary, and obtaining a distance value from the comparison, and the method further includes before the recognition step A verification step of verifying the distance value with a predetermined threshold. Bayesian classifiers are preferably used here for the classification step.
优选地,所述字典至少从X高度词的字典、全大写词的字典、上行字母词和下行字母词的字典和全高度词的字典中选择。Preferably, the dictionary is selected from at least a dictionary of X-height words, a dictionary of all-caps words, a dictionary of ascenders and descenders, and a dictionary of full-height words.
优选地,区间类型的每个词典具有至少40种类型的候选字形。Preferably, each dictionary of interval type has at least 40 types of candidate glyphs.
优选地,如果在调节步骤中不能获得词的区间类型或者如果在核实步骤中距离值小于预定阈值,则该方法进一步包括使用多个字典的细分类步骤。Preferably, if the interval type of the word cannot be obtained in the adjusting step or if the distance value is smaller than a predetermined threshold in the verifying step, the method further comprises a subdivided classification step using a plurality of dictionaries.
优选地,细分类步骤包括:比较归一化词的提取特征和在除了所判断的词区间类型之外的至少一个其它词区间类型的字典中的候选字形的特征,分别从至少一个比较步骤中获得距离值,以及以另一预定阈值核实该距离值,以便基于核实步骤的结果识别词的字体信息。Preferably, the subdivision step includes: comparing the extracted features of the normalized word with the features of the candidate glyphs in the dictionary of at least one other word interval type except the judged word interval type, respectively from at least one comparison step A distance value is obtained and verified against another predetermined threshold to identify the font information of the word based on the result of the verifying step.
优选地,要比较的至少一个词区间类型的字典包括如下顺序的字典:X高度词的字典、全大写词的字典、上行字母词和下行字母词的字典和全高度词的字典。Preferably, the at least one word interval type dictionary to be compared includes dictionaries in the following order: a dictionary of X-height words, a dictionary of all-caps words, a dictionary of ascenders and descenders, and a dictionary of full-height words.
优选地,要识别的词的字体信息包括至少字形、衬线、磅数、斜度和间距。Preferably, the font information of the word to be recognized includes at least font style, serif, point size, slope and spacing.
可替换地,本发明也提供字体识别的另一方法,包括:Alternatively, the present invention also provides another method for font recognition, including:
将输入图像的词归一化到预定高度的归一化步骤;A normalization step that normalizes the words of the input image to a predetermined height;
从归一化的词中提取特征的特征提取步骤;A feature extraction step that extracts features from normalized words;
使用多个字典的细分类步骤;和a refinement step using multiple dictionaries; and
基于细分类步骤的结果识别词的字体信息的识别步骤。A recognition step of recognizing font information of words based on the result of the subdivision step.
类似地,输入图像来自已经过行分段和字分段处理的文件图像中的词图像,以及Gabor过滤器用于执行特征提取步骤和贝叶斯分类器用于细分类步骤。此外,细分类步骤包括:比较归一化词的提取特征和在至少一个词区间类型的字典中的候选字形的特征,分别从至少一个比较步骤中获得距离值,以及以预定的阈值核实该距离值,以便基于核实步骤的结果识别词的字体信息。要比较的至少一个词区间类型的字典包括如下顺序的字典:X高度词的字典、全大写词的字典、上行字母词和下行字母词的字典和全高度词的字典。Similarly, the input images are word images from document images that have been processed by line and word segmentation, and Gabor filters are used to perform the feature extraction step and Bayesian classifiers are used for the fine-grained classification step. In addition, the fine classification step includes: comparing the extracted features of the normalized word with the features of the candidate glyphs in the dictionary of at least one word interval type, respectively obtaining distance values from at least one comparison step, and verifying the distance with a predetermined threshold value to identify the word's font information based on the result of the verification step. The dictionaries of at least one word interval type to be compared include dictionaries in the following order: a dictionary of X-height words, a dictionary of all-caps words, a dictionary of ascenders and descenders, and a dictionary of full-height words.
优选地,要识别的词的字体信息包括至少字形、衬线、磅数、斜度和间距。Preferably, the font information of the word to be recognized includes at least font style, serif, point size, slope and spacing.
根据第三方面,本发明提供一种鉴别输入文本行的区间类型的方法,包括:According to a third aspect, the present invention provides a method for identifying an interval type of an input text line, comprising:
使用投影法计算输入文本行的行信息的行信息计算步骤;A line information calculation step for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择词的词信息的词信息计算步骤;A word information calculation step for calculating word information of selected words in the text line using a projection method;
判断词信息是否可靠的可靠性判断步骤;A reliability judgment step for judging whether the word information is reliable;
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;The interval type marking step of marking the interval type of the word according to the line information and the word information;
其中如果在可靠判断步骤中判断所计算的词信息不可靠,则使用连通单元方法计算所选择词的词基线和上线。Wherein, if it is judged in the reliable judgment step that the calculated word information is not reliable, then the connected unit method is used to calculate the word baseline and upper line of the selected word.
这种鉴别输入文本行的区间类型的方法组合两种方式以改善X高度值计算的精确度。因为投影方法远快于连通单元方法,因此投影方法首先用于计算词上线,以及在词的可靠的线信息不能通过投影方法获得时使用连通单元方法。执行严格检验和核实几次以使计算的词上线越来越精确。最后,获得了可靠词X高度值和区间类型。This method of identifying the range type of an input text line combines two approaches to improve the accuracy of the X-height value calculation. Because the projection method is much faster than the connected unit method, the projection method is firstly used to calculate the word on-line, and the connected unit method is used when the reliable line information of the word cannot be obtained by the projection method. Perform rigorous checks and checks several times to get more and more accurate counted words on the line. Finally, the reliable word X height value and interval type are obtained.
优选地,行信息计算步骤包括:在整个字区域中执行垂直投影以获得文本行的基线和上线。以及行信息计算步骤进一步包括:检验获得的行基线的可靠性。检验获得的行基线的可靠性包括:如果位于从行信息计算步骤中获得的行基线和行底线之间的下部区间的高度不小于位于从行信息计算步骤中获得的行上线和行基线之间的中部区间的高度,则在粗略基线之下的区间中再次执行垂直投影方法以获得新的行基线。Preferably, the line information calculation step includes: performing vertical projection in the entire word area to obtain the baseline and upper line of the text line. And the row information calculating step further includes: checking the reliability of the obtained row baseline. Checking the reliability of the obtained row baseline includes: if the height of the lower interval located between the row baseline obtained from the row information calculation step and the row bottom line is not less than the height between the row upper line obtained from the row information calculation step and the row baseline The height of the middle interval of , then perform the vertical projection method again in the interval below the rough baseline to obtain a new row baseline.
优选地,词信息计算步骤包括:使用垂直投影直方图以计算在上半词区间中的词上线;以及基于在词的高度和行基线的y坐标之间的关系计算词基线的步骤,以及如果行基线接近于词的底部则直接确定词基线等于词的底部。Preferably, the word information calculation step includes: using a vertical projection histogram to calculate the word upper line in the upper half of the word interval; and the step of calculating the word baseline based on the relationship between the height of the word and the y coordinate of the row baseline, and if A row baseline close to the bottom of the word directly determines that the word baseline is equal to the bottom of the word.
优选地,可靠性判断步骤通过确定在行信息和词信息之间的关系判断词线是否可靠。Preferably, the reliability judging step judges whether the word line is reliable by determining the relationship between the line information and the word information.
优选地,区间类型标记步骤包括使用基于行信息和词信息的分类条件对所选择的词的区间类型进行初步标记。所述分类条件包括从下列组中选择至少一个特征:Preferably, the section type marking step includes preliminary marking of the selected word's section type using classification conditions based on line information and word information. The classification criteria include selecting at least one feature from the following group:
z1/(z1+z2);z3/(z3+z2);nzw;th;|ul_w-ul_l|/(z1+z2);z1/(z1+z2);z3/(z3+z2);nzw;th;|ul_w-ul_l|/(z1+z2);
这里,z1指词上部区间的高度,z2指在词中部区间的高度;z3指词下部区间的高度;ul_w指词上线的y坐标(在词坐标系中);ul_l指行上线的y坐标(在词坐标系中);nzw指w1与w2的比率;w1指在词上部区间中的水平投影直方图的非零区间的宽度;w2指词的宽度;th指max1与max2的比率;max1指Pv〔i〕的最大值{i以便接近词顶线};max2指Pv〔i〕的最大值{i以便接近词上线}。优选地,特征nzw与另一特征th组合使用。Here, z1 refers to the height of the upper interval of the word, and z2 refers to the height of the interval in the middle of the word; z3 refers to the height of the lower interval of the word; ul_w refers to the y coordinate of the upper line of the word (in the word coordinate system); ul_l refers to the y coordinate of the upper line of the line ( In the word coordinate system); nzw refers to the ratio of w1 to w2; w1 refers to the width of the non-zero interval of the horizontal projection histogram in the upper interval of the word; w2 refers to the width of the word; th refers to the ratio of max1 to max2; max1 refers to The maximum value of Pv [i] {i so that it is close to the top line of the word}; max2 refers to the maximum value of Pv [i] {i so that it is close to the top line of the word}. Preferably, the feature nzw is used in combination with another feature th.
按标记选择的词的区间类型可包括;X高度词;上行字母或全词;下行字母或全词;其上线接近行上线的高于X高度的词;高于X高度的词;和未知的词。Interval types of words selected by markers may include: X-height words; ascenders or whole words; descenders or whole words; words above X-height whose upper line is close to the line's upper line; words above X-height; and unknown word.
优选地,识别输入文字线的区间类型的方法可以进一步包括判断文本行中所选择的词是否是短词并通过在词区间的指定区域中使用投影方法分别计算所判断短词的词信息的步骤。因此,使用输入文本行的区间类型的方法可以进一步包括核实在词信息计算步骤中计算的词上线的步骤,如果在可靠性判断步骤中判断计算的词信息可靠,则在核实了上述词上线的步骤之后判断文本行中的词是否是短词的步骤。Preferably, the method for identifying the interval type of the input text line may further include the step of judging whether the selected word in the text line is a short word and calculating the word information of the judged short word by using a projection method in a designated area of the word interval . Therefore, the method of using the interval type of the input text line can further include the step of verifying that the word information calculated in the word information calculation step is online, and if the word information calculated in the reliability judgment step is reliable, then after verifying the above-mentioned word online After the step, it is judged whether the word in the text line is a short word.
优选地,连通单元方法包括至少一个如下的步骤:Preferably, the connected unit method comprises at least one of the following steps:
因为这些连通单元被看作噪声,因此删除小于预定的较小面积的连通单元;Because these connected units are regarded as noise, delete connected units smaller than a predetermined smaller area;
删除在词中线之下的连通单元;和Delete connected units below the word midline; and
删除在词信息计算步骤中获得的词中线之上的连通单元。Connected units above the word midline obtained in the word information calculation step are deleted.
优选地,区间类型标记步骤进一步包括根据标记的区间类型判断是否校正字线信息的步骤。Preferably, the section type marking step further includes a step of judging whether to correct word line information according to the marked section type.
基于通过上述方法获得的区间类型,容易计算在输入文本行中的任何词的X高度。Based on the interval type obtained by the above method, it is easy to calculate the X-height of any word in the input text line.
此外,本发明进一步提供一种光学字体识别设备,包括:In addition, the present invention further provides an optical font recognition device, comprising:
将输入文本行图像的词划分为词对的划分装置;dividing the words of the input text line image into word pairs;
识别每个词对中较长词的字体信息的字体识别装置;font recognition means for recognizing the font information of the longer word in each word pair;
基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别装置;Discriminating the font information of the shorter word in each word pair based on the font information of the longer word in the word pair that includes the shorter word and the font information of the longer word in the word pair adjacent to the shorter word. font identification device;
根据相邻词的字体信息调节词的字体信息的细调节装置;和Fine adjustment means for adjusting font information of words according to font information of adjacent words; and
根据行中的字体信息分布调节行的字体信息的粗调节装置;和coarse adjustment means for adjusting the font information of the row according to the font information distribution in the row; and
识别该行中词的字号的识别装置。A recognition device that recognizes the font size of the words in the line.
相应地,本发明进一步提供一种字体识别设备,包括:Correspondingly, the present invention further provides a font recognition device, comprising:
将输入图像的词归一化到预定高度的归一化装置;normalization means for normalizing the words of the input image to a predetermined height;
从归一化的词中提取特征的特征提取装置;A feature extraction device for extracting features from the normalized words;
判断词的区间类型的判断装置;A judging device for judging the interval type of a word;
分类装置,和sorting device, and
基于由分类装置获得的结果识别词的字体信息的识别装置。Recognition means for recognizing font information of words based on the results obtained by the classifying means.
可替换地,上述的字体识别设备可以被修改以包括:将输入图像归一化到预定高度的归一化装置;从归一化的字中提取特征的特征提取装置;直接使用多个字典的细分类装置;和基于通过细分类装置获得的结果识别词的字体信息的识别装置。Alternatively, the above font recognition device may be modified to include: normalization means for normalizing the input image to a predetermined height; feature extraction means for extracting features from the normalized characters; direct use of multiple dictionaries fine classifying means; and recognizing means for recognizing font information of words based on the result obtained by the fine classifying means.
因此,本发明进一步提供一种鉴别输入文本行的区间类型的设备,包括:Therefore, the present invention further provides a device for identifying the interval type of an input text line, comprising:
使用投影法计算输入文本行的行信息的行信息计算装置;A line information calculation device for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择的词的词信息的第一词信息计算装置;a first word information calculating means for calculating word information of words selected in the text line using a projection method;
判断词信息是否可靠的可靠性判断装置;A reliability judging device for judging whether word information is reliable;
通过使用连通单元方法计算未被可靠性判断装置判断为可靠的所选择的词的词基线和上线的第二词信息计算装置;和second word information calculating means for calculating word baselines and upper lines of selected words not judged as reliable by the reliability judging means by using a connected unit method; and
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记装置;A section type marking device for marking the section type of the word according to the line information and the word information;
因此,本发明进一步提供一种使用通过识别输入文本行的区间类型的设备获得的区间类型根据识别输入文本行的区间类型的设备计算在输入文本行中词的X高度的设备。Therefore, the present invention further provides a device for calculating the X-height of a word in an input text line from the device for identifying the range type of the input text line using the interval type obtained by the device for identifying the interval type of the input text line.
本发明也提供一种执行光学字体识别的步骤的程序,所述步骤包括:The present invention also provides a program for performing the steps of optical font recognition, the steps comprising:
将输入文本行图像的词划分为词对的划分步骤;The step of dividing the words of the input text line image into word pairs;
识别每个词对中较长词的字体信息的字体识别步骤;A font recognition step that identifies font information of the longer word in each word pair;
基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别步骤;Discriminating the font information of the shorter word in each word pair based on the font information of the longer word in the word pair that includes the shorter word and the font information of the longer word in the word pair adjacent to the shorter word. font identification step;
根据相邻词的字体信息调节词的字体信息的细调节步骤;和A fine adjustment step of adjusting the font information of a word according to the font information of adjacent words; and
根据行中的字体信息分布调节行的字体信息的粗调节步骤;和a coarse adjustment step of adjusting the font information of the row according to the font information distribution in the row; and
识别该行中词的字号的识别步骤。A recognition step that recognizes the font size of the words in the line.
本发明也提供一种执行字体识别的步骤的程序,所述步骤包括:The present invention also provides a program for performing the steps of font recognition, the steps comprising:
将输入图像的词归一化到预定高度的归一化步骤;A normalization step that normalizes the words of the input image to a predetermined height;
从归一化的词中提取特征的特征提取步骤;A feature extraction step that extracts features from normalized words;
判断词的区间类型的判断步骤;The step of judging the interval type of the judgment word;
分类步骤,和classification steps, and
基于通过分类步骤获得的结果识别词的字体信息的识别步骤。A recognition step of recognizing font information of words based on the result obtained by the classification step.
可替换地,本发明也提供一种执行字体识别的步骤的程序,所述步骤包括:Alternatively, the present invention also provides a program for performing the steps of font recognition, the steps comprising:
将输入图像归一化到预定的高度的归一化步骤;a normalization step that normalizes the input image to a predetermined height;
从归一化的字中提取特征的特征提取步骤;A feature extraction step that extracts features from the normalized word;
直接使用多个字典的细分类步骤;和基于通过细分类步骤获得的结果识别词的字体信息的识别步骤。a subdivision step of directly using a plurality of dictionaries; and a recognition step of recognizing font information of a word based on a result obtained through the subdivision step.
本发明也提供一种执行识别在输入文本行的区间类型的如下步骤的程序,所述步骤包括:The present invention also provides a program for performing the steps of identifying interval types in input text lines, the steps comprising:
使用投影法计算输入文本行的行信息的行信息计算步骤;A line information calculation step for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择的词的词信息的第一词信息计算步骤;A first word information calculation step of calculating the word information of the selected word in the text row using a projection method;
判断词信息是否可靠的可靠性判断步骤;A reliability judgment step for judging whether the word information is reliable;
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;和The interval type marking step of marking the interval type of the word according to the line information and the word information; and
其中如果在可靠判断步骤中判断所计算的词信息不可靠,则使用连通单元方法计算所选择的词的词基线和上线。Wherein if the calculated word information is judged to be unreliable in the reliable judging step, the word baseline and upper line of the selected word are calculated using the connected unit method.
本发明也提供一种执行计算在输入文本行中词的X高度的如下步骤的程序,所述步骤包括:The present invention also provides a program for performing the steps of calculating the X-height of a word in a line of input text, said steps comprising:
使用投影法计算输入文本行的行信息的行信息计算步骤;A line information calculation step for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择的词的词信息的第一词信息计算步骤;A first word information calculation step of calculating the word information of the selected word in the text row using a projection method;
判断词信息是否可靠的可靠性判断步骤;A reliability judgment step for judging whether the word information is reliable;
如果在可靠性判断步骤中判断计算的词信息不可靠,使用连通单元计算所选择的词的词基线和上线;If in the reliability judgment step, it is judged that the word information calculated is unreliable, the word baseline and the upper line of the selected word are calculated using connected units;
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;和The interval type marking step of marking the interval type of the word according to the line information and the word information; and
使用所获得的区间类型计算词的X高度。Calculate the X-height of the word using the obtained interval type.
在另一方面中,本发明也提供一种存储执行光学字体识别的如下步骤的程序的存储媒体,包括:In another aspect, the present invention also provides a storage medium storing a program for performing the following steps of optical font recognition, including:
将输入文本行图像的词划分为词对的划分步骤;The step of dividing the words of the input text line image into word pairs;
识别每个词对中较长词的字体信息的字体识别步骤;A font recognition step that identifies font information of the longer word in each word pair;
基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别步骤;Discriminating the font information of the shorter word in each word pair based on the font information of the longer word in the word pair that includes the shorter word and the font information of the longer word in the word pair adjacent to the shorter word. font identification step;
根据相邻词的字体信息调节词的字体信息的细调节步骤;和A fine adjustment step of adjusting the font information of a word according to the font information of adjacent words; and
根据行中的字体信息分布调节行的字体信息的粗调节步骤。A coarse adjustment step that adjusts the font information of a row according to the distribution of font information in the row.
在另一方面,本发明也提供一种存储执行输入文本行的区间类型的如下步骤的程序的存储媒体,包括:In another aspect, the present invention also provides a storage medium storing a program for executing the following steps of the interval type of input text line, including:
使用投影法计算输入文本行的行信息的行信息计算步骤;A line information calculation step for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择的词的词信息的第一词信息计算步骤;A first word information calculation step of calculating the word information of the selected word in the text row using a projection method;
判断词信息是否可靠的可靠性判断步骤;A reliability judgment step for judging whether the word information is reliable;
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;和The interval type marking step of marking the interval type of the word according to the line information and the word information; and
其中如果在可靠判断步骤中判断所计算的词信息不可靠,则使用连通单元方法计算所选择的词的词基线和上线。Wherein if the calculated word information is judged to be unreliable in the reliable judging step, the word baseline and upper line of the selected word are calculated using the connected unit method.
在另一方面,本发明也提供一种存储执行计算在输入文本行中词的X高度的如下步骤的程序的存储媒体,所述步骤包括:In another aspect, the present invention also provides a storage medium storing a program for performing the steps of calculating the X-height of a word in a line of input text, the steps comprising:
使用投影法计算输入文本行的行信息的行信息计算步骤;A line information calculation step for calculating line information of an input text line using a projection method;
使用投影法计算文本行中所选择的词的词信息的第一词信息计算步骤;A first word information calculation step of calculating the word information of the selected word in the text row using a projection method;
判断词信息是否可靠的可靠性判断步骤;A reliability judgment step for judging whether the word information is reliable;
如果在可靠性判断步骤中判断计算的词信息不可靠,使用连通单元计算所选择的词的词基线和上线;If in the reliability judgment step, it is judged that the word information calculated is unreliable, the word baseline and the upper line of the selected word are calculated using connected units;
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;和The interval type marking step of marking the interval type of the word according to the line information and the word information; and
使用所获得的区间类型计算词的X高度。Calculate the X-height of the word using the obtained interval type.
通过下文结合附图的描述将会清楚本发明的其它特征和优点。Other features and advantages of the present invention will become apparent from the following description in conjunction with the accompanying drawings.
附图说明 Description of drawings
并入在本说明书中并构成它的一部分的附图示出了本发明的实施例,并连同描述一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.
附图1所示为根据第一实施例的X高度值计算和区间类型鉴别的总体流程图。Accompanying drawing 1 shows the overall flow chart of X-height value calculation and interval type identification according to the first embodiment.
附图2所示为X高度值计算的垂直投影方法的细节。Figure 2 shows the details of the vertical projection method for the calculation of the X-height value.
附图3和4所示为投影方法的行信息结果,在详细的词区间类型的标记过程中分别使用两个特征nzw和th。Figures 3 and 4 show the line information results of the projection method, and two features nzw and th are used in the detailed word interval type marking process.
附图5所示为分别使用“正常投影方法”和“在短词指定区中投影”的一个短词“The”的上线。Figure 5 shows the upper line of a short word "The" using "normal projection method" and "projection in the short word designation area" respectively.
附图6所示为包含属于短词类型2的短词“He”的字线。Accompanying drawing 6 shows the character line containing the short word "He" belonging to the short word type 2.
附图7所示为使用垂直投影直方图处理的短词“He”。Figure 7 shows the short word "He" processed using a vertical projection histogram.
附图8所示为在附图1中步骤600的扩展流程图。FIG. 8 is an expanded flowchart of
附图9所示为根据本发明的第一实施例包含的鉴别的区间类型的一种应用。Figure 9 shows an application of the identified interval type involved in the first embodiment of the present invention.
附图10所示为根据本发明的第一实施例获得的英文单词的计算的X高度值的一种应用。Figure 10 shows an application of the calculated X-height value of English words obtained according to the first embodiment of the present invention.
附图11所示为本发明的第二实施例的主流程图。Accompanying drawing 11 shows the main flowchart of the second embodiment of the present invention.
附图12所示为附图11的步骤400的扩展流程图。FIG. 12 is an extended flowchart of
附图13所示为附图11的步骤600的扩展流程图。FIG. 13 is an extended flowchart of
附图14所示为本发明的第二实施例的应用的流程图。Accompanying drawing 14 shows the flowchart of the application of the second embodiment of the present invention.
附图15所示为本发明的第三实施例的主流程图。Accompanying drawing 15 shows the main flowchart of the third embodiment of the present invention.
附图16所示为根据本发明的第三实施例附图15的步骤400的详细的主流程图。Fig. 16 is a detailed main flowchart of
附图17所示为根据本发明的第三实施例附图15的步骤500的详细流程图。Fig. 17 is a detailed flowchart of
附图18所示为根据本发明的第三实施例附图15的步骤600的详细流程图。FIG. 18 is a detailed flowchart of
附图19所示为根据本发明的第三实施例附图15的步骤700的详细流程图。FIG. 19 is a detailed flowchart of
附图20所示为本发明的第三实施例的应用的流程图。Accompanying drawing 20 shows the flowchart of the application of the third embodiment of the present invention.
附图21所示为根据本发明的第四实施例的光学字符识别的流程图。Fig. 21 is a flow chart of OCR according to the fourth embodiment of the present invention.
附图22所示为实施根据本发明的第三实施例的光学字体识别的方法的设备。Fig. 22 shows the device for implementing the method for optical character recognition according to the third embodiment of the present invention.
附图23所示为实施根据本发明的第二实施例的字体识别方法的设备。Fig. 23 shows the device for implementing the font recognition method according to the second embodiment of the present invention.
附图24所示为实施根据本发明的第二实施例的字体识别方法的设备。Fig. 24 shows the device for implementing the font recognition method according to the second embodiment of the present invention.
附图25所示为鉴别根据本发明的第一实施例的输入文本行的区间类型的方法的设备。Fig. 25 is a diagram showing the apparatus for the method of identifying the interval type of an input text line according to the first embodiment of the present invention.
具体实施方式 Detailed ways
下文通过参考附图描述根据本发明的光学字符识别方法和设备。The optical character recognition method and apparatus according to the present invention are described below by referring to the accompanying drawings.
在本发明中出现的术语解释如下:Terms appearing in the present invention are explained as follows:
·OFR:光学字体识别。· OFR: Optical Font Recognition.
·OCR:光学字符识别OCR: Optical Character Recognition
·ICR:单个字符识别(单个字符的OCR)。• ICR: Individual character recognition (OCR of individual characters).
·X高度:该高度是具有同一字形的小写字母x相同高度的字母的部分。• X-height: the height is the portion of a letter of the same height as a lowercase letter x of the same glyph.
·上行字母:小写字母上升到X高度之上的部分为上伸部分。·Ascending letters: The part of the lowercase letter that rises above the X height is the ascending part.
·下行字母:小写字母在基线之下伸出的部分为下伸部分。Descenders: The part of a lowercase letter that protrudes below the baseline is the descender.
·字形:在字族类型中式样的一种变化形式,比如罗马字体、斜体、粗体、超体、简体、扩展、外形、轮廓等。Glyph: A variation of a style within a font family type, such as Roman, Italic, Bold, Ultra, Simplified, Expanded, Shaped, Outlined, etc.
·磅数:字符的磅数是通过其描边(笔划)厚度与其总体高度的关系确定的。大部分字形被设置成为两种磅数,正常体和粗体。Point size: The point size of a character is determined by the relationship between the thickness of its stroke (stroke) and its overall height. Most glyphs are set to two weights, normal and bold.
·斜度:斜度指示字母的主要描边的取向。字体可以是罗马体或者斜体。• Slope: The slope indicates the orientation of the main stroke of the letter. The font can be roman or italic.
·衬线:在某些字形的上行和下行字母的端部处的短交叉线。• Serifs: the short cross-hatches at the ends of the ascenders and descenders of certain glyphs.
·间距:间距指某字形的每个字符所要求的水平空间。因为不同的字符具有不同的宽度,因此不同间距调节字符之间的间隙。单一间距给每个字符提供了相同的空间而与宽度无关。· Spacing: Spacing refers to the horizontal space required by each character of a glyph. Because different characters have different widths, different spacing adjusts the gap between characters. Single spacing gives each character the same space regardless of width.
〔实施例1〕[Example 1]
实施例1精确地计算英文单词的X高度值,并可靠地鉴别英文单词的区间类型(上行字母、下行字母、全高度或X高度)。Embodiment 1 accurately calculates the X-height value of an English word, and reliably identifies the interval type (ascending letter, descending letter, full height or X-height) of an English word.
本实施例改善了投影方法和连通单元(CC)方法,并组合它们以改善X高度值计算的精确度。This embodiment improves the projection method and connected cell (CC) method, and combines them to improve the accuracy of X-height value calculation.
公知的是,如在Optical font recognition from projection profiles(Electronic publishing,VOL.6(3),249-260(September 1993))中公开的垂直投影分布仅能处理词或文本行具有理想的垂直投影分布的情况,并且在处理短词和区别X高度和大写字符时其性能变差。在文本行斜歪时,这种方法可能失效。此外,很难仅通过垂直投影分布鉴别英文单词的区间类型,因为有时具有不同区间类型的词的投影分布在形状上类似,因此难以分类。It is well known that vertical projection distributions as disclosed in Optical font recognition from projection profiles (Electronic publishing, VOL.6(3), 249-260 (September 1993)) can only handle words or lines of text with ideal vertical projection distributions , and its performance deteriorates when dealing with short words and distinguishing between X-height and uppercase characters. This method may fail when the text lines are skewed. In addition, it is difficult to identify the interval types of English words only by the vertical projection distribution, because sometimes the projection distributions of words with different interval types are similar in shape and thus difficult to classify.
然而,因为投影方法比连通单元方法快得多,因此投影方法用于快速地计算词上线,而通过投影方法不能获得词的可靠行信息时使用连通单元方法。严格检查和核实执行多次以使计算的词上线越来越精确。最后,获得了可靠的词X高度值和区间类型。However, because the projection method is much faster than the connected unit method, the projection method is used to quickly calculate the word on-line, and the connected unit method is used when the reliable line information of the word cannot be obtained by the projection method. Rigorous checks and verifications are performed multiple times to make the counted words on the line more and more accurate. Finally, reliable word X-height values and interval types are obtained.
附图1所示为根据第一实施例的X高度值计算和区间类型鉴别的总体流程图。Accompanying drawing 1 shows the overall flow chart of X-height value calculation and interval type identification according to the first embodiment.
在步骤100中,使用投影方法灵活地计算行基线以便避免侧斜的情况。附图2所示为用于X高度值计算的垂直投影方法的细节,行的顶线的高度(Htop)、底线的高度(Hbottom)、基线的高度(Hbase)和上线的高度(Hupper)计算如下:In
Htop=max{is以使Pv[i]>0};Htop=max{is to make Pv[i]>0};
Hbottom=min{i以使Pv[i]>0};Hbottom=min{i so that Pv[i]>0};
Hbase=i以使Pv[i]-Pv[i-1]最大;Hbase=i to maximize Pv[i]-Pv[i-1];
Hupper=i以使Pv[i]-Pv[i-1]最小。Hupper=i to minimize Pv[i]-Pv[i-1].
首先,在整个文字区域中执行垂直投影以获得文本行的粗略基线和上线。First, a vertical projection is performed across the text area to obtain a rough baseline and upper line of the text line.
然后,检查行基线的有效性。如果位于行基线和行底线之间的下部区间的高度小于位于行上线和行基线之间的中部区间的高度,则行信息可能正确。否则,在粗略基线之下的区间中再次执行垂直投影方法以获得更加可靠的行基线,以便避免行歪斜的影响。Then, check the validity of the row baseline. If the height of the lower interval between the row baseline and the row bottom line is smaller than the height of the middle interval between the row upper line and the row baseline, the row information may be correct. Otherwise, perform the vertical projection method again in the interval below the coarse baseline to obtain a more reliable row baseline in order to avoid the effect of row skew.
在步骤200中,垂直投影直方图用于计算在上半词区间中的词上线。In
在步骤300中,这个实施例根据不同的行基线结果使用不同的方法计算词基线。细节在表I中示出。In
表ITable I
这里,here,
Hw:词的高度Hw: word height
Y(基线):行基线的y坐标Y (baseline): the y-coordinate of the row's baseline
如果0.57*Hw-1<Y(基线)<Hw-2,这意味着行基线处于下半词区间中,则在行基线周围(在基线的±20%以内)的词区间中使用垂直投影直方图以计算词基线;以及If 0.57*Hw-1 < Y(baseline) < Hw-2, which means the row baseline is in the lower half of the word interval, then use the vertical projection histogram in the word interval around the row baseline (within ±20% of the baseline) graph to calculate word baselines; and
如果Y(基线)-(Hw-1)|≤1,这意味着行基线接近于词底部,则确定词基线等于词的底部;否则,If Y(baseline)-(Hw-1)|≤1, which means that the row baseline is close to the bottom of the word, determine that the word baseline is equal to the bottom of the word; otherwise,
在其高度范围在词高度的55%至100%的词区间中使用垂直投影直方图以计算词基线。Vertically projected histograms were used to calculate word baselines in word intervals whose heights ranged from 55% to 100% of word height.
在步骤350中,判断所述行信息肯定是错的,所使用的判断标准定义如下:In
在不满足如下的3个标准中的一个时,肯定行信息有误:When one of the following three criteria is not met, the information in the certain line is wrong:
·词上线不是非常接近行的顶部;The word on-line is not very close to the top of the line;
·在词基线之下的行区间的高度不超过在行上线和词基线之间的行区间的高度;The height of the line interval below the word baseline does not exceed the height of the line interval between the upper line line and the word baseline;
·词基线处于词高度57%和100%之间的区间中。• The word baseline is in the interval between 57% and 100% of the word height.
如果在步骤350确定行信息有误,则转到步骤600,并使用连通单元(CC)方法以计算词基线和上线(将在下文中详细介绍)。如果在步骤350确定行信息没有错误,则使用步骤380以根据上文所述的行和词信息对详细区间类型进行标记。细节在表II中示出。If it is determined in
表II和本说明书的其它地方使用如下的6种区间类型:区间类型No.1指X高度词的区间类型;区间类型No.2指上行字母或整个词(全词)的区间类型;区间类型No.3指下行字母或全词的区间类型;区间类型No.4指其上线接近行的上线的高于X高度的词的区间类型;区间类型No.5指高于X高度的词的区间类型;和区间类型No.6指未知的词。Table II and other places in this specification use the following 6 kinds of interval types: interval type No.1 refers to the interval type of X height words; interval type No.2 refers to the interval type of ascending letters or whole words (full words); interval type No.3 refers to the interval type of descending letters or full words; interval type No.4 refers to the interval type of words whose upper line is close to the upper line of the line and is higher than the X height; interval type No.5 refers to the interval of words higher than the X height type; and interval type No. 6 refers to an unknown word.
表II:Table II:
这里,here,
z1:指在行顶线和行上线之间的词上部区间的高度;z1: refers to the height of the upper section of the word between the line top line and the line top line;
z2:指在词中部区间的高度;z2: refers to the height of the interval in the middle of the word;
z3:指词下部区间的高度;z3: refers to the height of the lower part of the word;
ul_w:指词上线的y坐标(在词坐标系中);ul_w: refers to the y-coordinate of the upper line of the word (in the word coordinate system);
ul_l:指行上线的y坐标(在词坐标系中);ul_l: refers to the y coordinate of the upper line (in the word coordinate system);
nzw:指w1与w2的比率nzw: refers to the ratio of w1 to w2
w1:指在词上部区间中的水平投影直方图的非零区间的宽度;w1: refers to the width of the non-zero interval of the horizontal projection histogram in the upper interval of the word;
w2:指词的宽度;w2: refers to the width of the word;
th:指max1与max2的比率;th: refers to the ratio of max1 to max2;
max1:指Pv〔i〕的最大值{i以便接近词顶线};max1: refers to the maximum value of Pv [i] {i so as to be close to the top line of the word};
max2:指Pv〔i〕的最大值{i以便接近词上线}max2: Refers to the maximum value of Pv[i] {i so as to be close to the word line}
在此,本实施例在对详细词区间类型的标记中使用两个特征nzw和th。这两个特征被指定用于检查投影方法的行信息结果,但在检查连通单元方法的结果时是无效的。在特征nzw和特征th被分别用作在分类情况中的特征时,可满意地获得详细的区间类型。Here, this embodiment uses two features nzw and th in marking the detailed word interval type. These two features are specified for checking the row information results of the projection method, but are invalid when checking the results of the connected unit method. Detailed interval types can be obtained satisfactorily when the features nzw and feature th are respectively used as features in the case of classification.
附图3所示为投影方法的行信息结果。在词“were”和“labor”中每个的上部细线被作为词顶线计算,横穿每个词“were”和“labor”的下部细线是词上线。显然,词“were”的结果错误,但词“labor”的结果正确。下文解释两个特征nzw和th如何工作。Figure 3 shows the line information results of the projection method. The upper hairline in each of the words "were" and "labor" is counted as the word top line, and the lower hairline across each of the words "were" and "labor" is the word top line. Obviously, the word "were" is wrong, but the word "labor" is correct. The following explains how the two features nzw and th work.
在附图4中,水平投影直方图应用到词上部区间中,以便到达“w1”,而不管词是“were”还是“labor”。然后,容易获得“nzw”。词“were”的“nzw”值是0.73,而其它词的值是0.16。这些结果完全不同。因此,本实施例设定阈值以判断投影方法的行信息结果是否正确。In Figure 4, the horizontally projected histogram is applied in the upper interval of the word in order to arrive at "w1", regardless of whether the word is "were" or "labor". Then, get "nzw" easily. The "nzw" value of the word "were" is 0.73, while the value of other words is 0.16. These results are quite different. Therefore, in this embodiment, a threshold is set to determine whether the line information result of the projection method is correct.
在附图3的右边,应用垂直投影直方图以逐字地获得在词顶线附近和在词上线附近的直方图的最大值。然后,确定所述词“were”的“th”是0.90,其它的值是0.29。这些结果也完全不同。因此,容易设定阈值以判断结果是否正确。On the right side of Fig. 3, vertically projected histograms are applied to obtain the maxima of the histograms near the top line of words and near the top line of words, literally. Then, it is determined that the "th" of the word "were" is 0.90, and the other values are 0.29. These results are also quite different. Therefore, it is easy to set a threshold to judge whether the result is correct or not.
总之,为避免单个特征的判断太严格而容易产生错误,组合这两个特征进行判断。如果两个特征“nzw”和“th”都大于阈值,则投影方法的行信息结果必定错误,否则它是可靠的。In short, in order to avoid the judgment of a single feature being too strict and prone to errors, the two features are combined for judgment. If both features "nzw" and "th" are larger than the threshold, the row information result of the projection method must be wrong, otherwise it is reliable.
在如附图1所示的步骤400中,在此所使用的判断标准如表III所示核实词上线的附加条件。In
表IIITable III
这里,here,
R:纵横比=词宽度/(z1+z2)R: aspect ratio = word width/(z1+z2)
zz1:行上部区间的高度zz1: the height of the upper section of the line
zz2:行中部区间的高度zz2: the height of the interval in the middle of the row
如果确定可在步骤400中核实词上线,则转到步骤800并确定最后的信息是否正确。然而,如果确定在步骤400中不能核实词上线,则转到步骤450,使用如下的判断标准以判断词是否是“短词”。If it is determined that the words can be verified in
表IVTable IV
这里,here,
max1:指Pv〔i〕的最大值{i以便中部区间和上部区间的上部三分之一};max1: refers to the maximum value of Pv[i] {i for the middle interval and the upper third of the upper interval};
max2:指Pv〔i〕的最大值{i以便中部区间和上部区间的中间三分之一}max2: Refers to the maximum value of Pv[i] {i so that the middle third of the middle interval and the upper interval}
ul_l:行上线的y坐标(在词坐标系中);ul_l: the y coordinate of the upper line (in the word coordinate system);
R:纵横比=词宽度/(z1+z2)R: aspect ratio = word width/(z1+z2)
zz1:行上部区间的高度zz1: the height of the upper section of the row
zz2:行中部区间的高度zz2: the height of the interval in the middle of the row
nzw和th:与在步骤380中使用的判断标准相同。nzw and th: the same judgment criteria used in
如果满足上述附加条件,则词是“短词”。A word is a "short word" if the above additional conditions are satisfied.
如果确定词不是“短词”,则转到步骤600并使用连通单元(CC)方法以计算词基线和上线(将在下文详细描述)。如果确定词是“短词”,则步骤500在指定的区域(比如在附图5的右边的区域)中以投影方法计算短词的词上线,然后在步骤700中根据词上线再次对详细区间类型进行标记。If it is determined that the word is not a "short word", go to step 600 and use the Connected Cell (CC) method to calculate the word baseline and upperline (described in detail below). If it is determined that the word is a "short word", then step 500 calculates the word on-line of the short word with projection method in the designated area (such as the area on the right of accompanying drawing 5), then in
表VTable V
附图5所示为词“The”的上线。在该词上的上部细线是正常投影方法的结果,而横穿该词的下部细线是“在短词指定区中投影”的结果。因此,容易判断前者是错误而后者是正确的。Figure 5 shows the upper line for the word "The". The upper thin line on the word is the result of the normal projection method, while the lower thin line across the word is the result of "projection in the short word designation area". Therefore, it is easy to judge that the former is wrong and the latter is correct.
如附图5所示,max1小于max2,因此短词“The”属于短词类型1。然后根据表V使用对应的处理方法:在横穿词的下部细线周围的白特定词区间中的垂直投影直方图。然后,可以获得正确的词上线。As shown in Figure 5, max1 is smaller than max2, so the short word "The" belongs to short word type 1. Then use the corresponding processing method according to Table V: vertical projection histogram in the white specific word interval around the lower thin line across the word. Then, the correct word can be obtained on the line.
如附图7所示,短词“He”的顶线接近行顶线,整个行的行信息满足指定的情况,因此该词属于短词类型2.As shown in Figure 7, the top line of the short word "He" is close to the top line of the line, and the line information of the entire line meets the specified conditions, so the word belongs to short word type 2.
然后使用对应的处理方法:如从附图7的上部周围投影的白色特定区间所示的行上线附近的词区间中垂直投影直方图。最后,获得正确的词上线,如直接在附图7中所示的字母“e”的顶部上的上部细线所示,同时如果使用正常的投影方法则获得错误的词上线。Then use the corresponding processing method: vertically project the histogram in the word interval near the line on the row as shown by the white specific interval projected from the upper periphery of Fig. 7 . Finally, the correct word overline is obtained, as shown by the upper thin line directly on top of the letter "e" shown in Figure 7, while the wrong word overline is obtained if the normal projection method is used.
下文解释步骤600,其中使用连通单元(CC)方法计算词基线和上线,该步骤执行的条件如下:如果在步骤350确定行信息是错误的,或者如果在步骤450确定词不是“短词”,即在通过投影方法不能获得词的可靠的行信息时。在本领域中已经广泛地使用连通单元(CC)方法。在此,本实施例提供一种新的连通单元方法,其中将具有不适当的面积或位置的连通单元看作噪声或标点,然后消除它以获得更加精确的词基线和上线。Below explain
附图8所示为在附图1中的步骤600的扩展流程图。FIG. 8 shows an expanded flowchart of
在附图8的步骤610中,在对词连通单元(CC)信息进行标记之后删除其面积小于SmallArea的连通单元(CC)。在此,选择SmallArea为等于词面积的0.015,因为其面积小于SmallArea的连通单元被看作噪声并且应该被删除。SmallArea与词面积的比率根据经验或实际的要求选择。In
在附图8的步骤620中,删除在词中线之下的连通单元。该步骤620被设计为删除“下标点”,比如逗号和句号等。In
然后,从剩余的连通单元中计算MTop(顶部的平均),以及在附图8的步骤630中,删除在词中线之上的连通单元。这个步骤630被设计为删除“上标点”,比如引号等。Then, MTop (mean of top) is calculated from the remaining connected units, and in
然后从剩余的连通单元中计算MBot(底部的平均),然后从其底部不大于Mbot的剩余的连通单元中计算平均底部值以获得基线。MBot (average of bottom) is then computed from the remaining connected units, and then the average bottom value is computed from the remaining connected units whose bottoms are not larger than Mbot to obtain a baseline.
此外,从其顶部不小于MTop的剩余的连通单元中基于获得的MTop计算平均顶部值以获得上线。Furthermore, an average top value is calculated based on the obtained MTop from the remaining connected units whose top is not smaller than the MTop to obtain the upper line.
在附图1的步骤500中或者在附图1的步骤600中计算了词上线之后,根据用于行信息的最终检查和最终词区间类型标记的草拟的词上线对附图1的步骤700中的详细词区间类型进行标记。After calculating the word upper line in
在附图1的步骤800中,根据详细的区间类型判断最终行信息是否正确:In
·在区间类型是3(同时不可能是2至5)时,词上线必须满足如下标准:·When the interval type is 3 (it cannot be 2 to 5 at the same time), the word online must meet the following criteria:
((z1-1)/(z2+1)<0.15)和(z3/(z2+1)<0.83).((z1-1)/(z2+1)<0.15) and (z3/(z2+1)<0.83).
否则,词的行信息是错误的。Otherwise, the line information of the word is wrong.
·在区间类型是2或5时,词上线必须满足如下标准:·When the interval type is 2 or 5, the word online must meet the following criteria:
((z1-1)/(z2+1)<0.83)和(z3/(z2+1)<0.83).((z1-1)/(z2+1)<0.83) and (z3/(z2+1)<0.83).
否则词的行信息是错误的,并且区间类型和X高度未知。Otherwise the line information of the word is wrong, and the interval type and X height are unknown.
这里,here,
z1:词上部区间的高度z1: the height of the upper section of the word
z2:词中部区间的高度z2: the height of the interval in the middle of the word
z3:词下部区间的高度z3: the height of the lower part of the word
区间类型(Zone type):指如表II中所列的类型。Zone type: Refers to the types listed in Table II.
在附图1的步骤900中,根据表VI标记最终词区间类型。In
表VITable VI
这里,here,
区间类型:区间类型No.1指X高度词的区间类型;区间类型No.2指上行字母或全词的区间类型;区间类型No.3指下行字母或全词的区间类型;区间类型No.4指其上线接近行的上线的高于X高度的词(X-height-plus word)的区间类型;区间类型No.5指高于X高度的词的区间类型;和区间类型No.6指未知的词。Interval type: Interval type No.1 refers to the interval type of X-height words; Interval type No.2 refers to the interval type of ascending letters or full words; Interval type No.3 refers to the interval type of descending letters or full words; Interval type No. 4 refers to the interval type of words (X-height-plus word) whose upper line is close to the upper line of the line; interval type No.5 refers to the interval type of words higher than X height; and interval type No.6 refers to unknown word.
实施例1的应用Application of Embodiment 1
英文单词(上行字母、下行字母、全高度或X高度)的所计算X高度值和鉴别的区间类型在OCRh和OFR过程中都是非常有用的信息。Both the calculated X-height value and the identified interval type of an English word (ascender, descender, full height or X-height) are very useful information in the OCRh and OFR processes.
例如,鉴别的区间类型可用于英文OCR的字典分类(根据区间类型)。通过减少模式匹配的时间它可以改善ICR引擎的速度,因为在字典中有更少的候选字符。它也有利于准确度,因为一个字符从来都不会错误地被识别为具有其它区间类型的候选字符。附图9所示为根据本实施例获得的所鉴别区间类型的一种应用,它分别判断英文字符图像是否是X高度区间类型、上行字母区间类型、下行字母区间类型和全高度区间类型,否则以所有的区间类型的字典识别,由此获得了结果字符代码和可信的值。For example, the identified interval types can be used for dictionary classification (according to interval types) for English OCR. It can improve the speed of the ICR engine by reducing the pattern matching time, because there are fewer candidate characters in the dictionary. It also benefits accuracy, since a character is never mistakenly identified as a candidate for another interval type. Accompanying drawing 9 is shown as a kind of application of the identified interval type obtained according to the present embodiment, and it judges whether the English character image is the X height interval type, the ascending letter interval type, the descending letter interval type and the full height interval type, otherwise Dictionary identification with all range types, from which the resulting character codes and trusted values are obtained.
区间类型信息也可用于OCR后处理,它可用于校正大写字母和相应的小写字母字符的混淆。Interval type information is also available in OCR post-processing, which can be used to correct confusion of uppercase and corresponding lowercase characters.
在英文OCR字符分段中,区间类型信息可用于判断“分离路径”(可能的字符分离结果中的一种)是否是错误的。In English OCR character segmentation, interval type information can be used to determine whether the "separation path" (one of the possible character separation results) is wrong.
关于所计算的英文单词的X高度值,在OCR和OFR两种处理过程中X高度值在英文单词的归一化中非常有用。不同尺寸的英文单词可以根据X高度值被归一化到指定的高度。因此相同等级的特征(OCR和OFR两者)都可以从归一化的图像提取以避免字号的影响。Regarding the calculated X-height values of English words, the X-height values are very useful in the normalization of English words in both OCR and OFR processes. English words of different sizes can be normalized to a specified height according to the X-height value. Therefore features of the same level (both OCR and OFR) can be extracted from the normalized image to avoid the effect of font size.
X高度值在英文单词的字号识别中也是重要的,同时字或行高度不能直接使用(在字号识别中),因为在一个词或行中的字符不可能占用全部3个区间。附图10所示为根据本实施例获得的英文单词X高度值的一种应用。词X高度、图像分辨率和所识别的类型可用于获得在指定的图像分辨率和字形下的不同字号的X高度列表,然后将词X高度与不同字号的X高度列表匹配,由此获得了字号。The X height value is also important in the font size recognition of English words, and the word or line height cannot be used directly (in font size recognition), because it is impossible for characters in a word or line to occupy all 3 intervals. Accompanying drawing 10 shows an application of the X height value of an English word obtained according to this embodiment. The word X-height, image resolution, and recognized type can be used to obtain a list of X-heights of different font sizes under the specified image resolution and glyph, and then match the word X-height with the list of X-heights of different font sizes, thus obtaining font size.
本实施例也评估印刷文本行上的区间类型鉴别方法的精确度。总共3302个字中,区间类型鉴别的精确度是99.67%,评估样本是具有3种不同字体的印刷文本行。This example also evaluates the accuracy of the interval type identification method on lines of printed text. Out of a total of 3302 characters, the accuracy of interval type identification is 99.67%, and the evaluation samples are printed text lines with 3 different fonts.
简言之,本实施例提供了一种鉴别输入文本行的区间类型的方法和计算在输入文本行中的词的X高度的方法,两者包括:使用投影法计算输入文本行的行信息的行信息计算步骤;使用投影法计算文本行中所选择词的词信息的词信息计算步骤;判断词信息是否可靠的可靠性判断步骤;根据行信息和词信息对所述词的区间类型进行标记的区间类型标记步骤;其中如果在可靠判断步骤中判断所计算的词信息不可靠,则使用连通单元方法计算所选择的词的词基线和上线。In short, the present embodiment provides a method for identifying the interval type of an input text line and a method for calculating the X height of a word in an input text line, both of which include: calculating line information of an input text line using a projection method Line information calculation step; the word information calculation step of using projection method to calculate the word information of the selected word in the text line; the reliability judgment step of judging whether the word information is reliable; marking the interval type of the word according to the line information and word information The interval type marking step; wherein if it is judged that the calculated word information is unreliable in the reliable judgment step, the word baseline and upper line of the selected word are calculated using the connected unit method.
本实施例在行信息提取过程中组合了投影方法和连通单元(CC)方法,其中投影方法和连通单元方法被专门设计成用于获得更好的结果,并且充分地考虑并限制了歪斜、噪声等的影响。在本实施例中,执行非常严格检查和核实几次以使所计算的词上线越来越精确。具体地,在词基线计算中一起使用行基线和投影方法以获得精确的词基线(参考步骤300);灵活使用在指定的区域中的投影直方图以计算短词的上线(参考在步骤450、步骤500和附图5、6和7中的判断标准);在新的连通单元(CC)方法中,具有不适当的面积或位置的连通单元被看作噪声或标点,并被消除以获得更加精确的词基线和上线;以及在通过投影方法不能获得词的可靠行信息时使用连通单元方法(参考步骤600和附图8)。此外,根据用于行信息的最终检查和标记最终词区间类型的草拟词上线来标记详细的词区间类型(参考步骤380、700、900和附图3和4)。This embodiment combines the projection method and the connected unit (CC) method in the row information extraction process, wherein the projection method and the connected unit method are specially designed to obtain better results, and fully consider and limit the skew, noise and so on. In this embodiment, very strict checks and verifications are performed several times to make the calculated words more and more accurate on the line. Specifically, use line baseline and projection method together in word baseline calculation to obtain accurate word baseline (refer to step 300); flexibly use the projection histogram in the designated area to calculate the upper line of short word (refer to in
因此,本实施例能够精确地计算英文单词的X高度值,并且可靠地鉴别英文单词的区间类型(上行字母、下行字母、全高度或X高度)。Therefore, this embodiment can accurately calculate the X-height value of an English word, and reliably identify the interval type (ascending letter, descending letter, full height or X-height) of an English word.
〔实施例2〕[Example 2]
第二实施例属于在词层面上的先验字体识别。它利用区间类型信息将英文单词划分为四种类型,每种类型具有不同的字典。它可以以更高的精确度和速度鉴别英文单词的字体信息(字形、衬线、磅数、斜度和间距)。The second embodiment pertains to a priori font recognition at the word level. It uses interval type information to classify English words into four types, each type has a different dictionary. It can identify font information (glyph, serif, point, slope and spacing) of English words with higher accuracy and speed.
它支持至少10种字形的识别,远多于具有文件版面恢复功能的当前流行OCR软件,比如Omnipage、FineReader等。It supports recognition of at least 10 glyphs, much more than current popular OCR software with document layout recovery function, such as Omnipage, FineReader, etc.
附图11所示为实施例2的主流程图。Accompanying drawing 11 shows the main flowchart of embodiment 2.
在附图11的步骤100中,每个词图像的高度以双线性内插被归一化为WordHeight(在此WordHeight=35),以便可以处理任何尺寸的字。这比根据X高度归一化更好;因为有时不能非常精确地获得X高度值,因此会影响后面的特征提取和分离处理。In
附图11的步骤200提取字体特征。原则上,在本步骤中可以使用任何全局纹理特征。在此使用复合2D各向同性Gabor滤波器或者其它的全局特征以从归一化的词图像中提取字体特征。2D各向同性Gabor滤波器在本领域中十分公知,并且在CN1271140A中已经提供了使用这种Gabor滤波器以提取纹理图像的特征,细节将在下文中描述。Step 200 of FIG. 11 extracts font features. In principle, any global texture feature can be used in this step. Here composite 2D isotropic Gabor filters or other global features are used to extract font features from normalized word images. The 2D isotropic Gabor filter is well known in the art, and CN1271140A has provided the use of this Gabor filter to extract the features of texture images, and the details will be described below.
具体地,使用具有6个角度(0,30,60,90,120,150度)和两个频率值(0.14,0.11)的12个各向同性Gabor滤波器。考虑到速度和精确度,本实施例根据“实部”和“虚部”将每个复合Gabor滤波器图像划分为2个图像。然后分别从每个部分中计算平均值和偏差。因此提取48个特征并形成48维特征矢量(12×2×2)。Specifically, 12 isotropic Gabor filters with 6 angles (0, 30, 60, 90, 120, 150 degrees) and two frequency values (0.14, 0.11) were used. In consideration of speed and accuracy, this embodiment divides each composite Gabor filter image into 2 images according to "real part" and "imaginary part". The mean and deviation are then calculated from each section separately. Thus 48 features are extracted and form a 48-dimensional feature vector (12x2x2).
附图11的步骤300判断词的区间类型。Step 300 of accompanying drawing 11 judges the section type of word.
根据词中各字母所占用的区间,可以将词分类为四个区间类型(X高度词、上行字母词、下行字母词和全高度词)。According to the interval occupied by each letter in a word, words can be classified into four interval types (X height words, ascending letter words, descending letter words and full height words).
通常不同区间的英文单词具有不同的纹理特征,因为在不同区间中的不同字符(描边)组合的缘故,因此提出使用包含不同区间类型的词的不同字典。本发明人通过实验已经发现下行字母词和上行字母词可以被减缩为相同的类型,因为上部区间和下部区间的高度大致相同。本发明人还发现所有的大写词具有与其它的上行字母词不同纹理特征,因此本实施例将它们从上行字母词中分离。Usually English words in different intervals have different texture features because of different combinations of characters (strokes) in different intervals, so it is proposed to use different dictionaries containing words of different interval types. The inventors have found through experiments that descenders and ascenders can be reduced to the same type because the height of the upper and lower intervals is approximately the same. The inventors also found that all capitalized words have different texture features from other ascender words, so this embodiment separates them from ascender words.
最后,有四种字典,即用于X高度词的字典、用于上行字母词和下行字母词的字典、全高度词的字典和全大写词的字典。Finally, there are four dictionaries, a dictionary for X-height words, a dictionary for ascenders and descenders, a dictionary for full-height words, and a dictionary for all-caps words.
所有的大写词通常不出现在英文文本中,并且它非常难以在不知晓词中各字母细节的情况下判断一个词是否都是大写词。因此在通常的情况下,本处理有时可以仅获得上行字母/下行字母、全高度或X高度词的区间类型。不可能总是获得英文单词的区间类型,特别是在词非常短的时候。All capitalized words do not usually appear in English text, and it is very difficult to tell whether a word is all capitalized without knowing the details of each letter in the word. Therefore, under normal circumstances, this process can sometimes only obtain the interval types of ascenders/descenders, full-height or X-height words. It is not always possible to obtain interval types for English words, especially when the words are very short.
附图12所示为附图11的步骤400的扩展的流程图。FIG. 12 is a flowchart showing an extension of
如果已知词的区间类型,则仅要求以已知区间类型进行一次识别。在指定的区间类型的每个词典中,至少有40种详细的候选字形(10个字形×2个磅数×2个斜度)。从归一化的词图像中提取的特征矢量又与在指定的词区间类型的字典中的详细候选字形的每个特征矢量匹配。此外,计算一距离值(distance value),该距离值表示从归一化的词图像中提取的特征矢量和在指定的词区间类型的一个字典中的详细候选字形的特征矢量之间的数学距离。If the interval type of the word is known, only one recognition with the known interval type is required. There are at least 40 detailed candidate glyphs (10 glyphs x 2 points x 2 slopes) in each dictionary of the specified interval type. The feature vectors extracted from the normalized word images are in turn matched against each feature vector of the detailed candidate glyphs in the dictionary of the specified word interval type. In addition, calculate a distance value (distance value), which represents the mathematical distance between the feature vector extracted from the normalized word image and the feature vector of the detailed candidate glyph in a dictionary of the specified word interval type .
在这个分类步骤中采用贝叶斯分类器,并且其它的分类器比如最小距离分类都是可适用的。包括贝叶斯分类的这些分类在本领域中都是十分公知的,在此将不再详细描述。A Bayesian classifier is employed in this classification step, and other classifiers such as minimum distance classification are applicable. These classifications, including Bayesian classification, are well known in the art and will not be described in detail here.
在附图11的步骤500中,核实步骤400的结果。D0是来自贝叶斯分类器的距离值。TH0(=-480)是由实验获得的经验值。如果结果距离值D0小于TH0,则结果可靠。否则,确定词区间类型可能不正确,执行步骤600以便以多字典识别并考虑优先级。In
附图13所示为附图11的步骤600的扩展流程图。FIG. 13 is an extended flowchart of
如附图13所示的这个步骤处理诸如不能获得词区间类型或者词区间类型可能不正确等的情况,这些情况包括在步骤300中确定区间类型是未知的和步骤400的结果在步骤500中没有被核实。如果字中所有字母的顶部都处于相同高度,则难以区分顶线和上线。类似地,如果字中所有字母的底部也具有相同的高度,则难以区别基线和底线。因此,通常,相比上行字母、下行字母词的区间类型,较难获得X高度/所有上行字母词的区间类型,而最容易获得全高度词的区间类型。因此,在这个步骤中,字典选择的优先级顺序如附图13所示。This step, as shown in FIG. 13 , handles situations such as the word interval type being unavailable or possibly incorrect, including the determination that the interval type is unknown in
距离值D1,D2,D3,D4都是来自贝叶斯分类器的距离值,并且计算表示从归一化词图像提取的特征矢量和在四个词区间类型的标准字典中的详细候选字形的特征矢量之间的数学距离,即距离值。TH1(=-500)是从实验获得的经验值。如果Di(i=1,2,3,4)小于TH1,则通过对应的字典识别的结果非常可靠,然后我们将不继续使用其他的字典进行识别,并且结果将作为最终结果被输出。否则,将距离值D1,D2,D3,D4相互比较以获得最小距离。对应的字形是(从40种中选出的)最终详细字形。The distance values D1, D2, D3, D4 are all distance values from the Bayesian classifier and are calculated to represent the feature vector extracted from the normalized word image and the detailed candidate glyphs in the standard dictionary of the four word interval types The mathematical distance between feature vectors, i.e. the distance value. TH1 (=-500) is an empirical value obtained from experiments. If Di(i=1, 2, 3, 4) is less than TH1, the result of recognition by the corresponding dictionary is very reliable, then we will not continue to use other dictionaries for recognition, and the result will be output as the final result. Otherwise, the distance values D1, D2, D3, D4 are compared with each other to obtain the minimum distance. The corresponding glyph is the final detailed glyph (selected from 40).
实施例2的应用Application of Embodiment 2
上文介绍的方法可用于页面布局分析和恢复。它可以鉴别英文文件图像中每个词的字体信息。The methods described above can be used for page layout analysis and recovery. It identifies font information for each word in an image of an English document.
它也可以用于全字体OCR系统。英文单词的字体信息可以被预测并用于选择具有指定字体的OCR字典以改善OCR精确度。It can also be used in full font OCR system. Font information of English words can be predicted and used to select an OCR dictionary with a specified font to improve OCR accuracy.
上述流程图由附图14简单示出:对英文文献图像进行块选择、行分段、字分段以便获得词图像并实施词区间类型鉴别,然后进行通过本实施例界定的处理。所获得的每个词的字体信息可用于布局恢复、单-字体OCR等。The above flow chart is simply shown in Figure 14: block selection, line segmentation, and word segmentation are performed on the English document image in order to obtain word images and implement word interval type identification, and then perform the processing defined by this embodiment. The obtained font information for each word can be used for layout restoration, single-font OCR, etc.
本实施例将本实施例2与市场上的流行的具有字体识别(布局恢复)的OCR软件进行比较。实施例2的精确度远高于它们。This embodiment compares Embodiment 2 with popular OCR software with font recognition (layout recovery) on the market. The accuracy of Example 2 is much higher than them.
字形识别的各种精确度在表VII中列表。Various accuracies of glyph recognition are tabulated in Table VII.
基准检测程序(Benchmark)1:所有的词具有均匀字体的印刷文本行(820行,总共93344个字)。Benchmark 1: Lines of printed text with uniform fonts for all words (820 lines, 93344 words in total).
基准检测程序2:具有2种不同的字体的印刷文本行(1560行,总共18009个字)。Benchmark 2: Lines of printed text (1560 lines, 18009 words in total) with 2 different fonts.
基准检测程序3:具有3种不同的字体的印刷文本行(288行,总共3637个字)。Benchmark 3: Lines of printed text (288 lines, 3637 words in total) with 3 different fonts.
简言之,本实施例提供了一种字体识别方法,包括:将输入图像的词归一化到预定高度的归一化步骤;从归一化词中提取特征的特征提取步骤;判断词的区间类型的判断步骤;分类步骤,和基于分类步骤的结果识别词的字体信息的识别步骤。In short, this embodiment provides a font recognition method, including: a normalization step of normalizing the words of the input image to a predetermined height; a feature extraction step of extracting features from the normalized words; judging the A judgment step of section type; a classification step, and a recognition step of recognizing font information of a word based on a result of the classification step.
这种字体识别方法将词图像的高度归一化到给定的高度,由此以相同的等级提取具有相同区间类型的词图像的特征,因为在字形识别中,根据词的区间类型对字典进行分类(参考附图11中的步骤100)。首先考虑词区间类型以在字形分类中选择对应的“区间类型”字典(参考附图12)。在结果距离大于给定的阈值或者词区间类型是未知的时,将采用先验排列的多字典识别处理(参考附图13)。因此,这种实施例将充分利用关于区间类型的信息,并且整个OCR处理变得更加有效。This font recognition method normalizes the height of the word image to a given height, thereby extracting the features of the word image with the same interval type at the same level, because in font recognition, the dictionary is evaluated according to the interval type of the word Classification (refer to step 100 in Figure 11). First consider the word interval type to select the corresponding "interval type" dictionary in the grapheme classification (refer to accompanying drawing 12). When the resulting distance is greater than a given threshold or the word interval type is unknown, a priori arranged multi-dictionary recognition process will be used (refer to FIG. 13 ). Therefore, such an embodiment will take full advantage of the information about the interval type, and the overall OCR process becomes more efficient.
[实施例3][Example 3]
实施例3在行图像的OFR处理过程中采用词对机理,该词对机理基于英文单词的字体分类方法并同时考虑在实际的英文文本中字体分布的特性。它使用基于上下文字体信息的两级结果调节技术,这种技术能够在文本行中实现更高精确度和更加整齐的输出。在本实施例中采用精确的尺寸识别方法。Embodiment 3 adopts the word pair mechanism in the OFR process of the line image. The word pair mechanism is based on the font classification method of English words and simultaneously considers the characteristics of the font distribution in the actual English text. It uses a two-stage result conditioning technique based on contextual font information, which enables greater precision and neater output within lines of text. An accurate size recognition method is used in this embodiment.
它支持10种字形的识别,远多于具有文件版面恢复功能的普通OCR软件(比如Omnipage、FineReader等)。它能够比其它的软件实现高得多的精确度(字形和字号两者)。It supports the recognition of 10 glyphs, much more than ordinary OCR software (such as Omnipage, FineReader, etc.) with file layout recovery function. It can achieve much higher accuracy (both font style and font size) than other software.
附图15所示为本实施例3的主流程图。Accompanying drawing 15 shows the main flowchart of this embodiment 3.
在附图15的步骤100中,首先计算在行中的所有词的X高度值和区间类型。优选使用通过上述实施例1所描述方法获得的词的X高度值和区间类型,或者使用在已有技术中的那些方法,比如在US5883974中公开的字符高度分布直方图法和例如在Optical fontrecognition from projection profiles(Electronic publishing,VOL.6(3),249-260(September 1993))公开的垂直投影分布法。In
在附图15的步骤200中,本实施例提出了将行中的所有词从行中第一词起划分为各词对。In
通过实验,已知较长词的识别结果(具有较大的词宽度)比较短词的更加可靠。Through experiments, it is known that the recognition results of longer words (with larger word width) are more reliable than those of shorter words.
每个词对包括两个相邻的词(如果行中各词的数量为奇数,则最后的词对仅包括一个词)。以下将用不同的方法鉴别较长词和较短词的字体。Each word pair consists of two adjacent words (if the number of words in the row is odd, the last word pair consists of only one word). The fonts for longer and shorter words will be identified in different ways below.
在附图15的步骤300中,每个词对的较长词可以通过使用英文单词的字体分类的各种方法识别。这些方法包括上述实施例2中公开的方法、如在Script identification in printed bilingual documents to DDHANYA,A G RAMAKRISHNAN_ and PEETA BASA PATI(Sadhana Vol.27,Part 1,February 2002,pp.73-82.Printed inIndia)中公开的基于词的“英文单词识别”的方法或者如在US6,337924和US6,496,600中公开的后验字体识别法。In
在附图15的步骤400中,鉴别在每个词对中的较短词的字体。In
在本步骤中,有两种必须首先考虑的事实。In this step, there are two facts that must be considered first.
1)实际的文本行,如果在较短词的周围(左右)的两个较长词的详细字形都相同,则较短词的详细字形也非常可能相同。1) The actual text line, if the detailed glyphs of the two longer words around (left and right) the shorter word are the same, then the detailed glyphs of the shorter word are also very likely to be the same.
2)如步骤200中所提及,较长词的字体识别结果(具有较长词宽度)比在英文单词字体分类中的较短词的字体识别更加可靠。2) As mentioned in
考虑这些事实,在附图16中所示的流程图用于执行附图15的步骤400.Considering these facts, the flowchart shown in accompanying drawing 16 is used to carry out the
在附图16的步骤410中,首先判断要识别的词是否处于文本行的边缘?如果不是这样,则要识别的词的周围必定有两个较长词。In
从附图16的步骤420到步骤460,获得并相互比较在待识别词周围的两个较长词的详细字形。(即字形1由同一词对中的较长词的字形设定,字形2由上一个或下一个词对中的较长词的字形设定。)From
在步骤420确定当前词对中的较短词是否是词对中的第一个词。如果步骤420的判定结果为是,则将上一个词对中较长词的详细字形设定为“字形2”,否则在步骤440将下一个词对中较长词的详细字形被设定为“字形2”。At
在步骤450当前词对中较长词的详细字形设定为“字形1”。In
在步骤460确定所设定的“字形1”是否等于所设定的“字形2”。如果较长词的详细字形相同,则我们不必识别当前词对中的该较短词,其详细字形也是相同的“字形1”(步骤480)。否则,需要通过使用英文单词的字体分类识别当前词对的较短词(步骤470)。在描述步骤300时已经介绍了这种方法,因此在此省去对它们的描述。In step 460 it is determined whether the set "Font 1" is equal to the set "Font 2". If the detailed grapheme of the longer word is the same, then we don't have to identify the shorter word in the current word pair, whose detailed grapheme is also the same "grapheme 1" (step 480). Otherwise, the shorter word of the current word pair needs to be identified by using the font classification of the English word (step 470). This method has been introduced when describing
通过这个流程图,对每个词对中的较短词的字体鉴别可以实现更高的精确度、更高速度和更整齐输出。Through this flowchart, font identification of the shorter word in each word pair can achieve higher accuracy, higher speed, and cleaner output.
现在已经获得了每个词的字体结果,附图15的步骤500基于相邻词的字体信息开始调节字体结果。Now that the font results for each word have been obtained,
附图17所示为附图15的步骤500的扩展流程图。Figure 17 is an extended flow chart of
通过比较每个词的字体结果和包含标准模式的多个字体的字典,以相似度的顺序为每个词提取三种候选字体,其中第一候选字体相对于每个词的字体结果具有最大的匹配相似度,因此获得了相应的距离值。在此通过实践优选三个候选字体。然而根据实际的要求在本实施例中也可采用两个候选字体或多于三个候选字体。Three candidate fonts are extracted for each word in order of similarity by comparing the font result of each word with a dictionary containing multiple fonts of the standard pattern, where the first candidate font has the largest relative to the font result of each word Similarities are matched, so corresponding distance values are obtained. Here, three candidate fonts are preferred by practice. However, according to actual requirements, two candidate fonts or more than three candidate fonts may also be used in this embodiment.
在附图17的步骤510中,根据词的第一候选字体对行中的详细字形的分布进行计数。In step 510 of FIG. 17, the distribution of detailed glyphs in the row is counted according to the first candidate font for the word.
然后对该行中的每个词循环剩余的步骤。Then loop through the remaining steps for each word in the line.
当当前词是行中的第一或最后的词时,本实施例的处理将判断行的详细字形是否大致一致(附图17的步骤520)。其标准是在行中80%以上的词在任何时候具有相同的详细字形或者在行中词的数量不超过6时,行中有60%以上的词具有相同的详细字形。如果行的详细字形基本上不均匀一致,则本实施例的处理不执行对当前词的调节并继续针对行中其它词的循环处理。When the current word is the first or last word in the row, the processing of this embodiment will judge whether the detailed fonts of the row are roughly consistent (step 520 of accompanying drawing 17). Its standard is that more than 80% of the words in the row have the same detailed font at any time or when the number of words in the row is no more than 6, more than 60% of the words in the row have the same detailed font. If the detailed glyphs of the row are substantially uneven, the processing of this embodiment does not perform adjustments to the current word and continues the loop processing for other words in the row.
如果在步骤520中行的详细字形大致一致,则转到附图17的步骤521以判断当前词的第一候选字体是否与该行的主要字形一致。如果步骤521判断当前词的第一候选字体与该行的主要字形一致,则针对在该行中其它词执行循环处理。If in step 520 the detailed font of the row is roughly consistent, then go to step 521 of accompanying drawing 17 to judge whether the first candidate font of current word is consistent with the main font of this row. If step 521 judges that the first candidate font of the current word is consistent with the main glyph of the row, a loop process is executed for other words in the row.
如果在步骤521中判断当前词的第一候选字体与该行的主要字形不一致,则转到附图17的步骤522以判断当前词的第二候选字体是否与该行的主要字形是否一致以及对应的距离是否也满足指定的条件。如果判断当前词的第二候选字体与该行的主要字形一致并且对应的距离也满足指定的条件,则使当前的第二候选字体与第一候选字体交换,否则转到附图17的步骤523以判断当前的第三候选字体是否与该行的主要字形一致以及对应的距离是否也满足指定的条件。If in step 521 it is judged that the first candidate font of the current word is inconsistent with the main font of the row, then go to step 522 of accompanying drawing 17 to judge whether the second candidate font of the current word is consistent with the main font of the row and corresponding Whether the distance also satisfies the specified condition. If it is judged that the second candidate font of the current word is consistent with the main font of the row and the corresponding distance also satisfies the specified condition, then the current second candidate font is exchanged with the first candidate font, otherwise go to step 523 of accompanying drawing 17 To determine whether the current third candidate font is consistent with the main glyph of the line and whether the corresponding distance also meets the specified condition.
如果步骤523判断当前的第三候选字体与该行的主要字形一致并且对应的距离也满足指定的条件,则交换当前的第三候选字体和第一候选字体,否则转到附图17的步骤530以判断其候选字体是否可靠。如果在确定所有的候选字体的距离较大时其候选字体不可靠,则认为没有很好地完成识别处理,随后设定第一候选字体与其相邻词的候选字体相同,即以相邻词的字体设定第一候选字体(附图17的步骤540)。If step 523 judges that the current third candidate font is consistent with the main glyph of the row and the corresponding distance also satisfies the specified condition, then exchange the current third candidate font and the first candidate font, otherwise go to step 530 of accompanying drawing 17 To judge whether its candidate fonts are reliable. If the candidate fonts are unreliable when it is determined that the distance between all the candidate fonts is large, then it is considered that the recognition process has not been completed well, and then the first candidate font is set to be the same as the candidate fonts of its adjacent words, that is, the adjacent word Font sets the first candidate font (step 540 of FIG. 17).
在当前的词并非该行中的第一或最后的词时,本实施例的处理执行附图17的步骤550以判断在该行中详细字形是否大致一致以及当前词的第一候选字体是否与该行的主要字形一致。判断条件是在该行中80%以上的词具有相同的详细字形。When the current word is not the first or last word in the row, the processing of this embodiment executes step 550 of accompanying drawing 17 to judge whether the detailed font in the row is roughly consistent and whether the first candidate font of the current word is consistent with The main glyphs of the line are consistent. The judgment condition is that more than 80% of the words in the line have the same detailed font.
附图17的步骤551进一步判断其相邻2个词的详细字形是否相同以及第一候选字体是否与它们不同。Step 551 of accompanying drawing 17 further judges whether the detailed font of its adjacent 2 words is the same and whether the first candidate font is different from them.
附图17的步骤552和553进一步判断第二和第三候选字体是否与2个相邻词一致以及候选字体的距离是否满足指定的条件。Steps 552 and 553 of accompanying drawing 17 further judge whether the second and the third candidate fonts are consistent with 2 adjacent words and whether the distance of the candidate fonts satisfies the specified condition.
如果在步骤552中判断当前词的第二候选字体与2个相邻词一致并且对应的距离也满足指定的条件,则交换当前的第二候选字体和第一候选字体,否则转到步骤553以判断当前的第三候选字体。如果在步骤553中判断当前词的第三候选字体与2个相邻词一致并且对应的距离也满足指定的条件,则交换当前第三候选字体和第一候选字体,否则转到附图17的步骤560以判断其候选字体是否可靠。如果在确定所有候选字体的距离都较大时其候选字体不可靠,则认为没有很好地完成识别处理,因此设定第一候选字体与其相邻词的候选字体相同,即以相邻词的字体设定第一候选字体(附图17的步骤570)。If in step 552 it is judged that the second candidate font of the current word is consistent with 2 adjacent words and the corresponding distance also satisfies the specified condition, then exchange the current second candidate font and the first candidate font, otherwise go to step 553 to Determine the current third candidate font. If judge in step 553 that the 3rd candidate font of current word is consistent with 2 adjacent words and corresponding distance also satisfies specified condition, then exchange current 3rd candidate font and first candidate font, otherwise turn to accompanying drawing 17 Step 560 to determine whether the candidate font is reliable. If the candidate fonts are unreliable when it is determined that the distances of all candidate fonts are large, it is considered that the recognition process has not been completed well, so the first candidate font is set to be the same as the candidate fonts of its adjacent words, that is, the Font sets the first candidate font (step 570 of FIG. 17).
附图15的步骤600进一步基于该行的字体分布信息调节字体结果。Step 600 of FIG. 15 further adjusts the font results based on the row's font distribution information.
附图18所示为附图15的步骤600的扩展流程图。FIG. 18 is an extended flowchart of
在附图18的步骤610中,根据调节后的词的第一候选字体对该行中详细字形的分布、衬线和间距进行计数。In
附图18的步骤620判断在该行中衬线和间距是否绝对均匀一致并且详细字形是否大致一致。Step 620 of FIG. 18 judges whether the serifs and spacing are absolutely uniform and whether the detailed glyphs are substantially consistent in the row.
所使用的条件如下:The conditions used are as follows:
在该行中的所有词的第一候选字体的衬线相同;All words in the line have the same serif of the first candidate font;
在该行中的所有词的第一候选字体的间距相同;All words in the line have the same spacing in the first candidate font;
在该行中的75%以上但并非100%的词具有相同的详细字形。More than 75% but not 100% of the words in the line have the same detailed glyph.
在所有三个条件都满足时,以主要详细字形设定该行中的第一候选字体并使用每个词的第一候选字体作为识别的字体结果。但在上述三个条件中的一个不满足时,步骤620的结果为否,则该实施例的处理不执行调节并转而使用每个词的当前第一候选字体作为所识别的字体结果。When all three conditions are met, set the first candidate font in the line with the main detailed glyph and use the first candidate font for each word as the recognized font result. But when one of the above three conditions is not satisfied, the result of
在附图15的步骤700中,从图像分辨率、识别的字形和词X高度(或词区间类型)中计算字号。In
附图19所示为附图15的步骤700的扩展流程图。FIG. 19 is an expanded flowchart of
如果词的X高度可用(附图15的步骤100),则将执行左分支。If the X height of the word is available (step 100 of Figure 15), then the left branch will be executed.
在附图19的步骤720中,通过在输入时已知的输入图像分辨率和所识别的字形查询字号字典中的“图像分辨率/字形/字号/X高度”表,并获得不同字号的X高度列表。In
在附图19的步骤730中,根据输入词的X高度搜索在列表中最近的X高度。在列表中具有最小|x-xi|/xi的X高度是最近的一个。对应的字号是所识别的尺寸。在此“x”指输入X高度的值,“xi”指列表中第i个X高度的值。In
否则,如果词的X高度不可用或者不准确,则执行右分支。Otherwise, if the word's X-height is not available or accurate, the right branch is taken.
数据740是来自字形分类的词区间类型(所选择的字典的)。Data 740 is the word interval type (of the selected dictionary) from the grapheme class.
附图19的步骤750和760类似于附图19的步骤720和730.区别仅在于它们基于词高度,而不是X高度。附图19的步骤750通过在输入时已知的输入图像分辨率和所识别的字形查询包含“图像分辨率/字形/字号/词高度”表的先验字号字典,并获得不同字号的词高度列表;附图19的步骤760匹配搜索在输入词的词高度列表中的最近的词高度;和
识别对应的字号作为所识别的尺寸。Identify the corresponding font size as the identified size.
实施例3的应用Application of Embodiment 3
在实施例3中介绍的方法可用于页面布局分析和恢复。它可以更高的精确度和速度鉴别英文文本行的字体信息。The method described in Example 3 can be used for page layout analysis and recovery. It can identify font information of English text lines with higher accuracy and speed.
它也可用于全字体OCR系统中。字体信息可以被预测并用于选择具有指定字体的OCR字典以改善OCR精确度。It can also be used in full font OCR system. Font information can be predicted and used to select an OCR dictionary with a specified font to improve OCR accuracy.
流程图可如附图20地简化:对英文文件图像进行块选择、行分段以便获得行图像并实施如本实施例所定义的处理。所获得的字体信息可用于版面恢复、单-字体OCR等。The flow chart can be simplified as shown in FIG. 20: perform block selection on an English document image, line segmentation to obtain a line image, and implement processing as defined in this embodiment. The obtained font information can be used for layout recovery, single-font OCR, etc.
本申请人比较了本实施例和在市场上具有字体识别(版面恢复)功能的流行OCR软件。在字形和字号方面本实施例的精确度远高于它们。此外本申请人也比较了本实施例和其中仅使用单个英文单词字体分类替代步骤200-600的上述实施例2。本实施例改善了精确度和速度两者。The applicant compared this embodiment with popular OCR software with font recognition (layout recovery) functions on the market. The accuracy of this embodiment is much higher than them in font style and font size. In addition, the applicant also compared this embodiment with the above-mentioned embodiment 2 in which only a single English word font classification is used to replace steps 200-600. This embodiment improves both accuracy and speed.
表VIII显示了字形识别的精确度的比较结果:Table VIII shows the comparison results of the accuracy of glyph recognition:
表IX显示了字号识别的精确度的比较结果:Table IX shows the comparison results of the accuracy of font size recognition:
在“基准检测程序速度”方面本实施例的速度比上述实施例2的速度高28.9%。The speed of this embodiment is 28.9% higher than that of the above-mentioned embodiment 2 in terms of "baseline detection program speed".
基准检测程序1:对于所有的词具有均匀字体的印刷文本行(8020行,总共93344字)Benchmark 1: Lines of printed text with uniform font for all words (8020 lines, 93344 words in total)
基准检测程序2:具有2种不同的字体的印刷文本行(1560行,总共18009字)Benchmark 2: Lines of printed text with 2 different fonts (1560 lines, 18009 words in total)
基准检测程序3:具有3种不同的字体的印刷文本行(288行,总共3637字)Benchmark 3: Lines of printed text with 3 different fonts (288 lines, 3637 words in total)
基准检测程序速度:从基准检测程序1-3中选择的印刷文本行(201行,总共2659字)Benchmarking Program Speed: Selected lines of printed text from Benchmarking Programs 1-3 (201 lines, 2659 words total)
简言之,本实施例介绍了光学字体识别的方法,包括:通过行将输入文本图像的词划分为词对的划分步骤;识别每个词对中较长词的字体信息的字体识别步骤;基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别步骤;根据相邻词的字体信息调节词的字体信息的细调节步骤;和根据行中的字体信息分布调节行的字体信息的粗调节步骤;和识别行的词的字号的识别步骤。In short, this embodiment introduces a method for optical font recognition, including: a division step of dividing words of an input text image into word pairs by lines; a font recognition step of identifying font information of longer words in each word pair; based on The font information of the longer word in the word pair including the shorter word and the font information of the longer word in the word pair adjacent to the shorter word identify the font of the font information of the shorter word in each word pair A discriminating step; a fine adjustment step of adjusting font information of a word according to font information of adjacent words; a coarse adjustment step of adjusting font information of a row according to distribution of font information in the row; and a recognition step of identifying a font size of a word of a row.
本实施例在行字体识别过程中介绍了词对机理,这种词对机理考虑了词一级的字体识别特征和在实际英文文本中字体分布的特点(参考步骤200,300和附图16)。这个实施例也介绍了后处理,这种后处理采用两级(分别基于行的相邻词的字体信息和字体分布信息)结果调节技术,这种技术能够在文本行中实现更高的精确度和更加整齐的输出(参考附图17和附图18)。本实施例也采用精确的尺寸识别技术结合“区间类型信息”和“X高度值”(参考附图19)。因此,本实施例能够在文本行中实现更高的精确度和更整齐的输出,并且如在本实施例中介绍的方法支持10个字形的识别,远多于具有文件版面恢复功能的流行的OCR软件,比如Omnipage、FineReader等,并且能够实现比其它软件高得多的精确度(字形和字号两者)。The present embodiment introduces the word pair mechanism in the line font recognition process, and this word pair mechanism has considered the font recognition feature of the word level and the characteristics of the font distribution in the actual English text (referring to
〔实施例4〕[Example 4]
如上文所述,第一、第二和第三实施例可以单独使用,或者可以组合使用。附图21所示为根据本发明的第四实施例用于光学字符识别的流程图,其中可以应用上述三个实施例。As described above, the first, second and third embodiments may be used alone, or may be used in combination. Fig. 21 is a flow chart for optical character recognition according to the fourth embodiment of the present invention, wherein the above three embodiments can be applied.
实施例1可以用于词图像的区间类型鉴别和X高度计算,在步骤100中使用实施例1计算词X高度和区间类型。它们是字形分类和字体识别中非常有用的信息。Embodiment 1 can be used for section type identification and X-height calculation of a word image. In
步骤200,300,400涉及在实施例2中已经描述的词图像归一化、字体特征提取和字形分类。在字形分类之后可获得详细的字形(综合考虑磅数和斜度),并且它们也可用于(与更新的词区间类型一起)字号识别,例如步骤600。同时英文单词的字体信息(字形、字号、磅数、斜度、间距和衬线)可精确地获得,例如步骤500。
此外,实施例1-4不限于英文单词,基于其处理原理,这些实施例可直接涉及其它的罗马字。In addition, embodiments 1-4 are not limited to English words, based on their processing principles, these embodiments can directly relate to other Roman characters.
在上述的描述中,已经在作为方法或软件程序的优选实施例中描述了本发明。针对本发明,容易理解的是本发明优选用于任何公知的计算机系统比如个人计算机上。因此,计算机系统将不再详细讨论。还要指出的是图像可以直接输入到计算机系统(例如通过数字照相机)或在输入到计算机系统之前数字化(例如通过扫描)。In the foregoing description, the present invention has been described in preferred embodiments as methods or software programs. With regard to the present invention, it is readily understood that the present invention is preferably used on any known computer system such as a personal computer. Therefore, the computer system will not be discussed in detail. It is also to be noted that the image may be input directly into the computer system (eg by a digital camera) or digitized (eg by scanning) prior to input into the computer system.
此外,正如在此所使用,具有在其中存储用于执行上述方法的计算机程序的计算机可读存储媒体例如可以包括磁性存储媒体比如磁盘(比如软盘)或磁带;光存储媒体比如光盘、光带或机器可读的条形码;固态电子存储器件比如随机存取存储器(RAM);或用于存储计算机程序的任何其它物理器件或媒体。Furthermore, as used herein, a computer-readable storage medium having stored therein a computer program for performing the methods described above may include, for example, magnetic storage media such as magnetic disks (such as floppy disks) or magnetic tapes; optical storage media such as optical disks, optical tapes, or A machine-readable bar code; a solid-state electronic storage device such as random access memory (RAM); or any other physical device or medium used to store computer programs.
此外,本领域普通技术人员容易认识到也可以以硬件设计上述的软件的等同物。Furthermore, those skilled in the art readily recognize that equivalents of the software described above can also be designed in hardware.
参考实施例3,附图22所示为被构造成实施所描述的光学字体识别的上述过程的设备,简言之,该设备包括:With reference to embodiment 3, accompanying drawing 22 shows the device that is configured to implement the above-mentioned process of described optical character recognition, in brief, this device comprises:
通过行将输入文本图像的词划分为词对的划分装置100;A
识别每个词对中较长词的字体信息的字体识别装置101;A font recognition device 101 that recognizes font information of longer words in each word pair;
基于包括较短词的词对中的较长词的字体信息以及在与所述较短词相邻的词对中的较长词的字体信息鉴别每个词对中较短词的字体信息的字体鉴别装置102;Discriminating the font information of the shorter word in each word pair based on the font information of the longer word in the word pair that includes the shorter word and the font information of the longer word in the word pair adjacent to the shorter word. font identification device 102;
根据相邻词的字体信息调节词的字体信息的细调节装置103;和A fine adjustment device 103 for adjusting the font information of words according to the font information of adjacent words; and
根据行中的字体信息分布调节行的字体信息的粗调节装置104;和coarse adjustment means 104 for adjusting the font information of the row according to the distribution of the font information in the row; and
识别行的词的字号的识别装置105。Recognition means 105 for recognizing the font size of the words of the line.
参考实施例2,附图23所示为被构造成实施所描述的上述用于字体识别的过程的设备,简言之,该设备包括:Referring to Embodiment 2, accompanying drawing 23 shows the device configured to implement the above-mentioned process for font recognition described, in short, the device includes:
将输入图像的词归一化到预定高度的归一化装置200;A
从归一化词中提取特征的特征提取装置201;A feature extraction device 201 for extracting features from normalized words;
判断词的区间类型的判断装置202;A judging device 202 for judging the interval type of a word;
分类装置203,和sorting means 203, and
基于由分类装置获得的结果识别词的字体信息的识别装置204。Recognition means 204 that recognize font information of words based on the results obtained by the classification means.
分类装置203进一步包括将归一化词的提取特征与所判断的词区间类型的字典中的候选字形的特征进行比较的比较装置,以及从比较装置中获得距离值的获得装置,因此,实施字体识别的方法的设备可以一步包括以预定的阈值核实所获得的距离值的核实装置。如果检验装置确定距离值小于预定的阈值,则细分类装置使用多个字典,该细分类装置被构造成;The classification device 203 further includes a comparison device that compares the extracted features of the normalized word with the features of the candidate glyphs in the dictionary of the judged word interval type, and an acquisition device that obtains a distance value from the comparison device, so that the font The apparatus of the method of identification may comprise, in a step, verification means for verifying the obtained distance value with a predetermined threshold value. If the checking means determines that the distance value is less than a predetermined threshold, then the fine-graining means uses a plurality of dictionaries, the fine-graining means being configured to;
比较归一化词的提取特征和在除了所判断的词区间类型之外的至少一个其它词区间类型的字典中的候选字形的特征,comparing the extracted features of the normalized word with the features of the candidate glyphs in the dictionary of at least one other word interval type except the judged word interval type,
分别从至少一个比较步骤中获得距离值,以及obtain distance values from at least one comparison step, respectively, and
以另一预定阈值核实该距离值,以便基于核实步骤的结果识别词的字体信息。The distance value is verified against another predetermined threshold to identify the font information of the word based on the result of the verification step.
此外,如果判断装置202不能成功地判断词的区间类型,则使用细分类装置。In addition, if the judging means 202 cannot successfully judge the interval type of the word, the subdivided classifying means is used.
考虑上述情况,附图23所示为实施字体识别方法的变型设备,包括:将输入图像归一化到预定的高度的归一化装置300;从归一化的字中提取特征的特征提取装置301;直接使用多个字典的细分类装置302;和基于通过细分类装置获得的结果识别词的字体信息的识别装置303。Considering the foregoing, accompanying drawing 23 shows a modified device for implementing the font recognition method, including: a
参考实施例1,附图25所示为被构造成实施所描述的鉴别输入文本行的区间类型的所有上述过程的设备,简言之,该设备包括:With reference to Embodiment 1, accompanying drawing 25 shows the device that is configured to implement all the above-mentioned processes of the described interval type of discriminating input text line, in brief, this device comprises:
使用投影法计算输入文本行的行信息的行信息计算装置400;A line
使用投影法计算文本行中所选择的词的词信息的第一词信息计算装置401;A first word information calculating means 401 for calculating the word information of the selected word in the text line using a projection method;
判断词信息是否可靠的可靠性判断装置402;A reliability judging device 402 for judging whether word information is reliable;
通过使用连通单元方法计算未被可靠性判断装置判断为可靠的所选择的词的词基线和上线的第二词信息计算装置403;和By using the connected unit method to calculate the word base line and the upper line of the word baseline and the upper line of the selected word that is not judged as reliable by the reliability judging means; and
根据行信息和词信息对所述词的区间类型进行标记的区间类型标记装置404。The interval type marking means 404 for marking the interval type of the word according to the line information and the word information.
鉴别输入文本行的区间类型的设备可以进一步包括判断文本行中所选择的词是否是短词的判断装置,以及第一词信息计算装置401通过在词区间的指定区域中使用投影方法单独地计算所判断短词的词信息。The device for discriminating the interval type of the input text line may further include judging means for judging whether the selected words in the text line are short words, and the first word information calculation means 401 calculates separately by using the projection method in the designated area of the word interval The word information of the judged short word.
此外,鉴别输入文本行的区间类型的设备显然可被修改为通过使用由所述鉴别输入文本行之区间类型的设备获得的区间类型计算在输入文本行中词的X高度的设备。Furthermore, the means for discriminating the section type of an input text line can obviously be modified as a means for calculating the X-height of a word in an input text line by using the section type obtained by said means for discriminating the section type of an input text line.
参考特定的实施例已经描述了本发明。应该理解的是本发明并不限于上文的描述,本领域普通技术人员在不脱离本发明的精神和范围的前提下可对本发明进行各种改变和修改。The invention has been described with reference to specific embodiments. It should be understood that the present invention is not limited to the above description, and those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention.
Claims (25)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNB2005100228818A CN100550040C (en) | 2005-12-09 | 2005-12-09 | Optical character recognition method and equipment and character recognition method and equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNB2005100228818A CN100550040C (en) | 2005-12-09 | 2005-12-09 | Optical character recognition method and equipment and character recognition method and equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN1979529A CN1979529A (en) | 2007-06-13 |
| CN100550040C true CN100550040C (en) | 2009-10-14 |
Family
ID=38130684
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CNB2005100228818A Expired - Fee Related CN100550040C (en) | 2005-12-09 | 2005-12-09 | Optical character recognition method and equipment and character recognition method and equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN100550040C (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN100535930C (en) * | 2007-10-23 | 2009-09-02 | 北京大学 | Complex structure file image inclination quick detection method |
| US8401293B2 (en) * | 2010-05-03 | 2013-03-19 | Microsoft Corporation | Word recognition of text undergoing an OCR process |
| CN107305446B (en) * | 2016-04-25 | 2020-08-14 | 北京字节跳动网络技术有限公司 | Method and device for obtaining keywords in pressure-sensitive area |
| CN109447055B (en) * | 2018-10-17 | 2022-05-03 | 中电万维信息技术有限责任公司 | OCR (optical character recognition) -based character similarity recognition method |
| US10984279B2 (en) | 2019-06-13 | 2021-04-20 | Wipro Limited | System and method for machine translation of text |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH10134149A (en) | 1996-10-30 | 1998-05-22 | Ricoh Co Ltd | Font identification device |
| DE19953610A1 (en) * | 1999-02-26 | 2000-09-07 | Hewlett Packard Co | Font detection device for optical character recognition system selects system font from table whose width best corresponds to width of font in image |
| US6272238B1 (en) * | 1992-12-28 | 2001-08-07 | Canon Kabushiki Kaisha | Character recognizing method and apparatus |
| CN1460244A (en) * | 2001-02-01 | 2003-12-03 | 松下电器产业株式会社 | Sentence recognition device, sentence recognition method, program and medium |
-
2005
- 2005-12-09 CN CNB2005100228818A patent/CN100550040C/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6272238B1 (en) * | 1992-12-28 | 2001-08-07 | Canon Kabushiki Kaisha | Character recognizing method and apparatus |
| JPH10134149A (en) | 1996-10-30 | 1998-05-22 | Ricoh Co Ltd | Font identification device |
| DE19953610A1 (en) * | 1999-02-26 | 2000-09-07 | Hewlett Packard Co | Font detection device for optical character recognition system selects system font from table whose width best corresponds to width of font in image |
| CN1460244A (en) * | 2001-02-01 | 2003-12-03 | 松下电器产业株式会社 | Sentence recognition device, sentence recognition method, program and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN1979529A (en) | 2007-06-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Sahare et al. | Multilingual character segmentation and recognition schemes for Indian document images | |
| US8045798B2 (en) | Features generation and spotting methods and systems using same | |
| US9014481B1 (en) | Method and apparatus for Arabic and Farsi font recognition | |
| Agrawal et al. | Voronoi++: A dynamic page segmentation approach based on voronoi and docstrum features | |
| CN101452532B (en) | Text-independent handwriting identification method and device | |
| Zahedi et al. | Farsi/Arabic optical font recognition using SIFT features | |
| US9002115B2 (en) | Dictionary data registration apparatus for image recognition, method therefor, and program | |
| JPH11203415A (en) | Device and method for preparing similar pattern category discrimination dictionary | |
| US8340428B2 (en) | Unsupervised writer style adaptation for handwritten word spotting | |
| CN100550040C (en) | Optical character recognition method and equipment and character recognition method and equipment | |
| Daniels et al. | Discriminating features for writer identification | |
| CN104899551B (en) | A kind of form image sorting technique | |
| Kumar et al. | Shape codebook based handwritten and machine printed text zone extraction | |
| Nguyen et al. | A segmentation method of single-and multiple-touching characters in offline handwritten japanese text recognition | |
| Cao et al. | Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches | |
| Gao et al. | Building compact recognizer with recognition rate maintained for on-line handwritten Japanese text recognition | |
| JP3187899B2 (en) | Character recognition device | |
| CN108596182B (en) | Manchu parts segmentation method | |
| Chen et al. | Detection and location of multicharacter sequences in lines of imaged text | |
| JP2004046723A (en) | Character recognition method, program used to execute the method, and character recognition device | |
| CN102279927B (en) | Rejection method and device | |
| JP3374762B2 (en) | Character recognition method and apparatus | |
| CN108564078B (en) | A method of extracting the axis of Manchu word image | |
| Bhardwaj et al. | An OCR based approach for word spotting in Devanagari documents | |
| Terasawa et al. | Automatic keyword extraction from historical document images |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20091014 Termination date: 20161209 |