Postal coding numberical string identifying method
Technical field
The present invention relates to postal coding numberical string identifying method.
Background technology
Optical character recognition progressively move towards practical, yet people wishes that still recognition system can reach the better recognition performance through the development of decades.In order to improve discrimination and degree of confidence, people more and more tend to adopt the combination of multiple information sources, many feature extractions and identification methods to realize high performance recognition system.
A kind of simple method that existing postal coding numberical string multi-categorizer makes up is voted exactly, as majority vote rule and rule in full accord etc.But these voting rules are not considered the characteristic of each sorter itself, implementation be the principle of " on a one-man-one-vote basis ".And in fact because the feature difference that each sorter uses, based on principle and method different, perhaps the sample of training process use is not quite similar, the recognition performance of each sorter is difference to some extent, certain complementarity is arranged, and promptly each sorter has certain difference to the recognition capability of each classification.
General Combination of Multiple Classifiers is paid close attention to is combination to single character identification result, its objective is the optimization that reaches the individual character recognition effect, and its principle after input is waited to know sample Xn and discerned through K recognition classifier, obtains K recognition result Sn as shown in Figure 1
(k)(k=1,2 .. K), after the decision-making of multi-categorizer knowledge result combinations, obtains final recognition result Cn.Do not consider the context of character string during to the combination of multi-categorizer, it is with the combination recognition sequence (C of each character in the character string
1Cn ... C
N) deliver to a dictionary library, whether effective by the recognition result of dictionary library check character string, as shown in Figure 2.
In some practical application, wish to obtain the whole recognition effect optimum of character string, and be not only the recognition effect optimum of single character string, because the recognition effect optimum of single character string is not necessarily represented the whole recognition effect optimum of character string.Such as in the identification of postcode, six numerals are discerned simultaneously correctly and can be used for the automatic mail sorting machine, require the recognition effect of whole postal coding numberical string is reached best.
Summary of the invention
The object of the present invention is to provide a kind of postal coding numberical string identifying method of the Combination of Multiple Classifiers based on knowledge base.
Adopt following technical scheme for reaching above-mentioned purpose the present invention,
A kind of postal coding numberical string identifying method comprises the steps:
(1) with the visual X=(x of N postcode character string
1X
nX
N) be input to independently individual character recognition classifier e of K respectively
k, wherein N and K are the positive integer greater than 1; For China Post's coded digital character string, N=6.
(2) each described individual character recognition classifier e
kCharacter image x with input
nBe identified as postcode { c
1C
mC
MIn one, perhaps refuse to know, be expressed as c
(M+1), wherein M is the positive integer greater than 1; Postcode { c
1C
mC
MBe any one in the numeral 0 to 9, M=10 is promptly arranged.
(3) calculating input pattern when recognition result is m ' is c
mProbability P (x ∈ C
m/ e
k(x)=m ');
(4) according to P (x ∈ C
m/ e
k(x)=and m ') recognition result that calculates X is D=(d
1, d
2..., d
N) Probability p (D|X); D=(d wherein
1, d
2..., d
N) be an effective postcode among the postcode dictionary library Ω;
(5) according to the recognition result of Probability p (D|X) decision input pattern.
As a kind of optimal way of the present invention, in the described step (3), input pattern was c when recognition result was m '
mProbability P (x ∈ C
m/ e
k(x)=m ') computing method can be following method:
According to described individual character recognition classifier e
kRecognition result carries out sample statistics, forms described individual character recognition classifier e
kThe chaotic matrix of identification situation:
N wherein
Mm ' (k)Represent described individual character recognition classifier e
kWith C
mSample in the class is identified as C
M 'The quantity of class, the implication of its expression is: (a) work as m=m ', e
kCorrect identification C
mThe quantity of sample in the class;
(b) work as m '=M+1, e
kRefuse to know C
mThe quantity of sample in the class;
(c) as m ≠ m ' and m ' ≠ M+1, e
kWith C
mSample wrong identification in the class is C
M 'The quantity of class,
Described individual character recognition classifier e
kRecognition result is m '=e
k(x) total sample number is:
At described individual character recognition classifier e
kRecognition result be that sample is from C under the condition of m '
mThe probability of class is:
As another optimal way of the present invention, in the described step (4), according to P (x ∈ C
m/ e
k(x)=and m ') recognition result that calculates X is D=(d
1, d
2..., d
N) the method for Probability p (D|X) be:
Suppose to generate chaotic Matrix C M
kSample abundant and reflected the space distribution of recognition result, with CM
kAs set of classifiers fashionable priori, promptly with P (x ∈ C
m/ e
k(x)=and m ') score when voting, x ∈ C
mProbability tables be shown:
s
(k)(x∈C
m)=P(x∈C
m/e
k(x)=m’)i=1,2,...,M
Suppose that the frequency that postcode D occurs is expressed as f (D), then X is calculated as follows from the score of D:
The probability that last X belongs to D is p (D|X)=e
F (D)S (D|X).
As an optimal way more of the present invention, in the described step (5),
Determine the method for the recognition result of input pattern to be according to Probability p (D|X),
If exist D to belong to Ω, and p (D|X) is the maximal value in the recognition result, and p (D|X)>α, X=D then, and promptly recognition result is D; Wherein α is refusing to know and wrong value of explaining (α=0.5) that obtains compromise between knowing;
If exist D to belong to Ω, and p (D|X) is the maximal value in the recognition result, exists D ' to belong to Ω, and p (D ' | value X) is only second to maximal value p (D|X), if p (D|X)-p (D ' | X)>β, β is constant (β=0.2) here, X=D then, promptly recognition result is D.
Postal coding numberical string identifying method of the present invention, its identification voting rule have been brought into play the advantage of each sorter according to the characteristic of each sorter itself.Obtain the priori of each sorter recognition performance by statistics, as the foundation of voting, make the identification combined result reach high discrimination and high confidence level it great amount of samples.Improved the accuracy rate of postal coding numberical string identification.
Description of drawings
Further specify the present invention below in conjunction with drawings and Examples.
Fig. 1 is a Combination of Multiple Classifiers individual character identification block scheme in the prior art
Fig. 2 carries out verification for dictionary library in the prior art to recognition result block scheme
Fig. 3 is the inventive method functional-block diagram
Embodiment
As shown in Figure 3, sequence X to be identified=(x
1... x
n... x
N) through individual character recognition classifier e
kAfter the identification, make a strategic decision, obtain recognition result sequence (d at last in conjunction with the probability of dictionary library and appearance
1, d
2..., d
N).
A kind of postal coding numberical string identifying method comprises the steps:
(1) with the visual X=(x of N postcode character string
1... x
n... x
N) be input to independently individual character recognition classifier of K simultaneously.For Chinese code number word character string, N=6.
(2) each individual character recognition classifier e
kCharacter image x to input
nDiscern, obtain recognition result, suppose that sorter is identified as { c with input pattern
1... c
m... c
MIn the class one, perhaps refuse to know.For postcode numeral, M=10, promptly its recognition result may be 0,1 ..., any one among the 9}.
(3) when recognition result is m ', input pattern may be c
mProbability represent with following mode:
At first utilize great amount of samples statistical sorter e
kThe identification situation, thereby form the chaotic matrix of relevant this sorter identification situation:
N wherein
Mm ' (k)Presentation class device e
kWith C
mSample in the class is identified as C
M 'The quantity of class, the implication of expression is:
(a) if m=m ', e
kCorrect identification C
mThe quantity of sample in the class;
(b) if m '=M+1, e
kRefuse to know C
mThe quantity of sample in the class;
(c) if m ≠ m ' and m ' ≠ M+1, e
kWith C
mSample wrong identification in the class is C
M 'The quantity of class.
To sorter e
k, recognition result is m '=e
k(x) total sample number is:
At sorter e
kRecognition result be that sample is from C under the condition of m '
mThe probability of class can be represented with conditional probability:
If generate chaotic Matrix C M
kSample abundant and reflected the distribution of model space, this confusion matrix has reflected sorter e
kThe identification situation, with CM
kAs set of classifiers fashionable priori, promptly with P (x ∈ C
m/ e
k(x)=and m ') score when voting, x ∈ C
mProbability tables be shown:
s
(k)(x∈C
m)=P(x∈C
m/e
k(x)=m’)i=1,2,...,M
(4) calculate X and belong to a certain postcode character string D=(d
1, d
2..., d
N) probability:
Suppose D=(d
1, d
2..., d
N) be an effective postcode among the postcode dictionary library Ω, and suppose that for certain specific application scenario, the frequency that postcode D occurs is expressed as f (D).
X is calculated as follows from the score of D:
The possibility that last X belongs to D is expressed as:
p(D|X)=e
f(D)·S(D|X)
(5) adopt following rule to determine the optimal identification result of input pattern:
Rule 1:
If exist D to belong to Ω, and
X=D then
Wherein α is a threshold value, be used for refusing to obtain compromise (α=0.5) between knowledge and the wrong knowledge,
Rule 2:
If exist D to belong to Ω, and
Exist D ' to belong to Ω, and
And p (D | X)-p (D ' | X)>β
X=D then
Here β is constant (β=0.2).