CN116645675A

CN116645675A - Character recognition method, device, equipment and medium

Info

Publication number: CN116645675A
Application number: CN202310627818.5A
Authority: CN
Inventors: 卢健
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-25

Abstract

The present disclosure provides a character recognition method, which relates to the field of artificial intelligence. The method comprises the following steps: acquiring an image to be recognized, wherein the image to be recognized comprises N characters to be recognized which are irregularly arranged, the irregular arrangement comprises that the characters are not arranged according to a straight line, and N is an integer greater than or equal to 2; inputting the image to be recognized into a character recognition model; obtaining recognition results of the N characters to be recognized, which are output by the character recognition model; wherein the character recognition model comprises an encoder and a decoder, the character recognition model being configured to be pre-trained by: a first loss function value is obtained based on an encoding result of the encoder, a second loss function value is obtained based on a decoding result of the decoder, and the encoder and the decoder are updated based on the first loss function value and the second loss function value. The present disclosure also provides a character recognition apparatus, device, medium, and program product.

Description

Character recognition method, device, equipment and medium

技术领域technical field

本公开涉及人工智能领域，更具体地，涉及一种字符识别方法、装置、设备、介质和程序产品。The present disclosure relates to the field of artificial intelligence, and more specifically, to a character recognition method, device, equipment, medium and program product.

背景技术Background technique

传统的OCR(optical character recognition，光学字符识别)识别模型主要对于识别常规的横向或纵向等直线排列的文字比较有效，对于弯曲、多行文字或符号等不规则排列的字符，如印章、公式等的识别不是很有效。The traditional OCR (optical character recognition, optical character recognition) recognition model is mainly effective for recognizing characters arranged in straight lines such as horizontal or vertical, etc., and for irregularly arranged characters such as curved, multi-line text or symbols, such as seals, formulas, etc. identification is not very effective.

相关技术中，对于不规则排列字符的识别方案，有的需要对特殊排列文字进行变换，使之转化为横向排列再用传统OCR识别模型进行识别，存在步骤繁琐，且准确率和鲁棒性不理想的问题。In related technologies, for the recognition schemes of irregularly arranged characters, some of them need to transform the specially arranged characters into horizontal arrangement and then use the traditional OCR recognition model for recognition. The steps are cumbersome, and the accuracy and robustness are not good. Ideal question.

发明内容Contents of the invention

鉴于上述问题，本公开提供了一种字符识别方法、装置、设备、介质和程序产品。In view of the above problems, the present disclosure provides a character recognition method, device, equipment, medium and program product.

本公开实施例的一个方面提供了一种字符识别方法，包括：获取待识别图像，其中，所述待识别图像包括不规则排列的N个待识别字符，所述不规则排列包括未按照直线排列，N为大于或等于2的整数；将所述待识别图像输入字符识别模型；获得所述字符识别模型输出的所述N个待识别字符的识别结果；其中，所述字符识别模型包括编码器和解码器，所述字符识别模型被配置为通过如下操作预先训练得到：基于所述编码器的编码结果得到第一损失函数值，基于所述解码器的解码结果得到第二损失函数值，基于所述第一损失函数值和所述第二损失函数值更新所述编码器和所述解码器。An aspect of an embodiment of the present disclosure provides a method for character recognition, including: acquiring an image to be recognized, wherein the image to be recognized includes N characters to be recognized that are irregularly arranged, and the irregular arrangement includes not arranged in a straight line , N is an integer greater than or equal to 2; the image to be recognized is input into the character recognition model; the recognition results of the N characters to be recognized outputted by the character recognition model are obtained; wherein the character recognition model includes an encoder and a decoder, the character recognition model is configured to be pre-trained through the following operations: obtain a first loss function value based on the encoding result of the encoder, obtain a second loss function value based on the decoding result of the decoder, and obtain a second loss function value based on The first loss function value and the second loss function value update the encoder and the decoder.

根据本公开的实施例，所述获取待识别图像包括：确定目标图像；利用目标检测模型从所述目标图像中确定第一目标区域，所述第一目标区域包括所述N个待识别字符；将所述第一目标区域从所述目标图像分割，得到所述待识别图像。According to an embodiment of the present disclosure, the acquiring the image to be recognized includes: determining a target image; using a target detection model to determine a first target area from the target image, the first target area including the N characters to be recognized; Segmenting the first target area from the target image to obtain the image to be recognized.

根据本公开的实施例，所述获得所述字符识别模型输出的所述N个待识别字符的识别结果包括：利用所述编码器处理所述待识别图像，获得所述编码器输出的第一图像特征；利用分类函数处理所述第一图像特征，得到所述N个待识别字符的中间识别结果，所述分类函数用于对所述第一图像特征进行分类；将所述第一图像特征和第一序列文本输入所述解码器，获得所述解码器输出的所述N个待识别字符的识别结果，所述第一序列文本通过将所述中间识别结果中每个字符识别结果右移M个字符位置获得，M为大于或等于1的整数。According to an embodiment of the present disclosure, the obtaining the recognition results of the N characters to be recognized output by the character recognition model includes: using the encoder to process the image to be recognized, and obtaining the first Image features; using a classification function to process the first image features to obtain intermediate recognition results of the N characters to be recognized, the classification function is used to classify the first image features; the first image features and the first sequence of text are input into the decoder to obtain the recognition results of the N characters to be recognized output by the decoder, and the first sequence of text is obtained by shifting the recognition results of each character in the intermediate recognition results to the right M character positions are obtained, and M is an integer greater than or equal to 1.

根据本公开的实施例，所述解码器包括交叉注意力层和分类层，所述将所述第一图像特征和第一序列文本输入所述解码器，获得所述解码器输出的所述N个待识别字符的识别结果包括：利用所述交叉注意力层处理所述第一图像特征和所述第一序列文本，得到第一注意力特征，其中，所述交叉注意力层被配置为基于交叉注意力机制处理数据；利用所述分类层处理所述第一注意力特征，得到所述N个待识别字符的识别结果，所述分类层用于对所述第一注意力特征进行分类。According to an embodiment of the present disclosure, the decoder includes a cross-attention layer and a classification layer, the first image feature and the first sequence of text are input into the decoder, and the N outputted by the decoder is obtained. The recognition result of a character to be recognized includes: using the cross-attention layer to process the first image feature and the first sequence of text to obtain the first attention feature, wherein the cross-attention layer is configured to be based on The data is processed by a cross-attention mechanism; the classification layer is used to process the first attention feature to obtain the recognition results of the N characters to be recognized, and the classification layer is used to classify the first attention feature.

根据本公开的实施例，所述获得所述字符识别模型输出的所述N个待识别字符的识别结果包括：利用所述编码器处理所述待识别图像，获得所述编码器输出的第一图像特征；将所述第一图像特征输入所述解码器，获得所述解码器输出的所述N个待识别字符的识别结果。According to an embodiment of the present disclosure, the obtaining the recognition results of the N characters to be recognized output by the character recognition model includes: using the encoder to process the image to be recognized, and obtaining the first Image features: input the first image features into the decoder, and obtain the recognition results of the N characters to be recognized outputted by the decoder.

根据本公开的实施例，所述编码结果包括第二图像特征，所述基于所述编码器的编码结果得到第一损失函数值包括：将训练图像输入所述编码器，获得所述编码器输出的第二图像特征，其中，所述训练图像包括不规则排列的N个待识别字符；基于所述第二图像特征得到第一预测张量，所述第一预测张量包括对所述N个待识别字符的字符预测信息；获得所述第一预测张量与标注文本之间的第一损失函数值，所述标注文本包括每个待识别字符的标签。According to an embodiment of the present disclosure, the encoding result includes a second image feature, and the obtaining the first loss function value based on the encoding result of the encoder includes: inputting a training image into the encoder, and obtaining an output of the encoder The second image features, wherein, the training image includes irregularly arranged N characters to be recognized; based on the second image features, a first prediction tensor is obtained, and the first prediction tensor includes the N characters Character prediction information of the character to be recognized; obtaining a first loss function value between the first prediction tensor and annotated text, where the annotated text includes a label of each character to be recognized.

根据本公开的实施例，所述解码结果包括第二预测张量，所述基于所述解码器的解码结果得到第二损失函数值包括：将所述第二图像特征输入所述解码器，获得所述解码器输出的所述第二预测张量，所述第二预测张量包括对所述N个待识别字符的字符预测信息；获得所述第二预测张量与所述标注文本之间的第二损失函数值。According to an embodiment of the present disclosure, the decoding result includes a second prediction tensor, and the obtaining the second loss function value based on the decoding result of the decoder includes: inputting the second image feature into the decoder to obtain The second prediction tensor output by the decoder, the second prediction tensor including character prediction information for the N characters to be recognized; obtaining the relationship between the second prediction tensor and the marked text The second loss function value of .

根据本公开的实施例，所述基于所述第一损失函数值和所述第二损失函数值更新所述编码器和所述解码器包括：根据所述第一损失函数值及其第一权重，得到第一加权值；根据所述第二损失函数值及其第二权重，得到第二加权值；根据所述第一加权值和所述第二加权值得到综合损失函数值；根据所述综合损失函数值更新所述编码器和所述解码器，得到经训练的字符识别模型。According to an embodiment of the present disclosure, the updating the encoder and the decoder based on the first loss function value and the second loss function value includes: according to the first loss function value and its first weight , to obtain a first weighted value; according to the second loss function value and its second weight, obtain a second weighted value; obtain a comprehensive loss function value according to the first weighted value and the second weighted value; according to the The integrated loss function value updates the encoder and the decoder to obtain a trained character recognition model.

根据本公开的实施例，所述解码器包括交叉注意力层和分类层，所述将所述第二图像特征输入所述解码器，获得所述解码器输出的所述第二预测张量包括：将所述第二图像特征和第二序列文本输入所述解码器，所述第二序列文本通过将所述标签文本中每个待识别字符的标签右移M个字符位置获得，M为大于或等于1的整数；利用所述交叉注意力层处理所述第二图像特征和所述第二序列文本，得到第二注意力特征，其中，所述交叉注意力层被配置为基于交叉注意力机制处理数据；利用所述分类层处理所述第二注意力特征，得到所述第二预测张量。According to an embodiment of the present disclosure, the decoder includes a cross-attention layer and a classification layer, the inputting the second image feature into the decoder, and obtaining the second prediction tensor output by the decoder includes : Input the second image feature and the second sequence of text into the decoder, the second sequence of text is obtained by moving the label of each character to be recognized in the label text to the right by M character positions, and M is greater than or an integer equal to 1; use the cross-attention layer to process the second image feature and the second sequence text to obtain a second attention feature, wherein the cross-attention layer is configured to be based on cross-attention The data is processed by a mechanism; the classification layer is used to process the second attention feature to obtain the second prediction tensor.

根据本公开的实施例，所述基于所述第二图像特征得到第一预测张量包括：对所述第二图像特征进行升维处理，得到第三图像特征；将所述第三图像特征输入至分类函数，得到所述第一预测张量。According to an embodiment of the present disclosure, the obtaining the first prediction tensor based on the second image feature includes: performing dimension-up processing on the second image feature to obtain a third image feature; inputting the third image feature To the classification function, get the first prediction tensor.

本公开实施例的另一方面，提供了一种字符识别装置，包括：图像获取模块，用于获取待识别图像，其中，所述待识别图像包括不规则排列的N个待识别字符，所述不规则排列包括未按照直线排列，N为大于或等于2的整数；第三输入模块，用于将所述待识别图像输入字符识别模型；识别结果模块，用于获得所述字符识别模型输出的所述N个待识别字符的识别结果；其中，所述字符识别模型包括编码器和解码器，所述字符识别模型被配置为通过如下操作预先训练得到：Another aspect of the embodiments of the present disclosure provides a character recognition device, including: an image acquisition module, configured to acquire an image to be recognized, wherein the image to be recognized includes N characters to be recognized irregularly arranged, the Irregular arrangements include not arranged in a straight line, N is an integer greater than or equal to 2; the third input module is used to input the image to be recognized into the character recognition model; the recognition result module is used to obtain the output of the character recognition model The recognition results of the N characters to be recognized; wherein, the character recognition model includes an encoder and a decoder, and the character recognition model is configured to be pre-trained through the following operations:

基于所述编码器的编码结果得到第一损失函数值，基于所述解码器的解码结果得到第二损失函数值，基于所述第一损失函数值和所述第二损失函数值更新所述编码器和所述解码器。Obtaining a first loss function value based on an encoding result of the encoder, obtaining a second loss function value based on a decoding result of the decoder, updating the encoding based on the first loss function value and the second loss function value device and the decoder.

本公开实施例的另一方面提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得一个或多个处理器执行如上所述的方法。Another aspect of the embodiments of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs, wherein, when the one or more programs are executed by the one When executed by one or more processors, one or more processors execute the method as described above.

本公开实施例的另一方面还提供了一种计算机可读存储介质，其上存储有可执行指令，该指令被处理器执行时使处理器执行如上所述的方法。Another aspect of the embodiments of the present disclosure also provides a computer-readable storage medium, on which executable instructions are stored, and when the instructions are executed by a processor, the processor executes the above method.

本公开实施例的另一方面还提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现如上所述的方法。Another aspect of the embodiments of the present disclosure further provides a computer program product, including a computer program, and when the computer program is executed by a processor, the above method is implemented.

上述一个或多个实施例具有如下有益效果：在预先训练过程中，编码器的编码结果不仅输入到解码器中，而且获得该编码结果相关的第一损失函数值，和基于解码器的解码结果得到的第二损失函数值共同参与到训练中，以便获得更多信息进行学习，加快字符识别模型的训练速度，并提高了识别效果。在字符识别场景，可以基于经训练的字符识别模型对待识别图像直接进行识别，在实际应用场景具有识别效率和准确率高、鲁棒性好及简单有效的特点。The above-mentioned one or more embodiments have the following beneficial effects: in the pre-training process, the encoding result of the encoder is not only input to the decoder, but also the first loss function value related to the encoding result is obtained, and the decoding result based on the decoder The obtained second loss function values are jointly used in the training, so as to obtain more information for learning, speed up the training speed of the character recognition model, and improve the recognition effect. In the character recognition scene, the image to be recognized can be directly recognized based on the trained character recognition model. In the actual application scene, it has the characteristics of high recognition efficiency and accuracy, good robustness, simple and effective.

附图说明Description of drawings

通过以下参照附图对本公开实施例的描述，本公开的上述内容以及其他目的、特征和优点将更为清楚，在附图中：Through the following description of the embodiments of the present disclosure with reference to the accompanying drawings, the above content and other objects, features and advantages of the present disclosure will be more clear, in the accompanying drawings:

图1示意性示出了根据本公开实施例的字符识别模型的训练方法和字符识别方法的应用场景图；FIG. 1 schematically shows an application scenario diagram of a character recognition model training method and a character recognition method according to an embodiment of the present disclosure;

图2示意性示出了根据本公开实施例的字符识别模型的训练方法的流程图；Fig. 2 schematically shows a flow chart of a method for training a character recognition model according to an embodiment of the present disclosure;

图3A～图3B示意性示出了根据本公开实施例的训练图像实例；3A-3B schematically illustrate examples of training images according to embodiments of the present disclosure;

图4示意性示出了根据本公开实施例的字符识别模型的训练架构图；FIG. 4 schematically shows a training architecture diagram of a character recognition model according to an embodiment of the present disclosure;

图5示意性示出了根据本公开实施例的获得训练图像的流程图；Fig. 5 schematically shows a flow chart of obtaining a training image according to an embodiment of the present disclosure;

图6示意性示出了根据本公开实施例的样本图像实例；Fig. 6 schematically shows an example of a sample image according to an embodiment of the present disclosure;

图7示意性示出了根据本公开实施例的获得第二预测张量的流程图；Fig. 7 schematically shows a flow chart of obtaining a second prediction tensor according to an embodiment of the present disclosure;

图8示意性示出了根据本公开实施例的得到综合损失函数值的流程图；Fig. 8 schematically shows a flow chart of obtaining a comprehensive loss function value according to an embodiment of the present disclosure;

图9示意性示出了根据本公开实施例的字符识别方法的流程图；FIG. 9 schematically shows a flowchart of a character recognition method according to an embodiment of the present disclosure;

图10示意性示出了根据本公开实施例的获得待识别图像的流程图；Fig. 10 schematically shows a flow chart of obtaining an image to be recognized according to an embodiment of the present disclosure;

图11示意性示出了根据本公开实施例的获得N个待识别字符的识别结果的流程图；Fig. 11 schematically shows a flow chart of obtaining recognition results of N characters to be recognized according to an embodiment of the present disclosure;

图12示意性示出了根据本公开另一实施例的获得N个待识别字符的识别结果的流程图；Fig. 12 schematically shows a flow chart of obtaining recognition results of N characters to be recognized according to another embodiment of the present disclosure;

图13示意性示出了根据本公开实施例的基于YOLO和TrOCR的端到端特殊字符识别方法的流程图；Fig. 13 schematically shows a flow chart of an end-to-end special character recognition method based on YOLO and TrOCR according to an embodiment of the present disclosure;

图14示意性示出了根据本公开实施例的字符识别模型的训练装置的结构框图；Fig. 14 schematically shows a structural block diagram of a training device for a character recognition model according to an embodiment of the present disclosure;

图1 5示意性示出了根据本公开实施例的字符识别装置的结构框图；以及FIG. 15 schematically shows a structural block diagram of a character recognition device according to an embodiment of the present disclosure; and

图16示意性示出了根据本公开实施例的适于实现字符识别模型的训练方法或字符识别方法的电子设备的方框图；16 schematically shows a block diagram of an electronic device suitable for implementing a character recognition model training method or a character recognition method according to an embodiment of the present disclosure;

需要注意的是，为了清晰起见，在用于描述本公开的实施例的附图中，整体/局部结构或整体/局部区域的尺寸可能被放大或缩小，即这些附图并非按照实际的比例绘制。It should be noted that, for the sake of clarity, in the drawings used to describe the embodiments of the present disclosure, the overall/partial structure or the size of the whole/partial area may be enlarged or reduced, that is, these drawings are not drawn according to the actual scale .

具体实施方式Detailed ways

以下，将参照附图来描述本公开的实施例。但是应该理解，这些描述只是示例性的，而并非要限制本公开的范围。在下面的详细描述中，为便于解释，阐述了许多具体的细节以提供对本公开实施例的全面理解。然而，明显地，一个或多个实施例在没有这些具体细节的情况下也可以被实施。此外，在以下说明中，省略了对公知结构和技术的描述，以避免不必要地混淆本公开的概念。Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. It should be understood, however, that these descriptions are exemplary only, and are not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Also, in the following description, descriptions of well-known structures and techniques are omitted to avoid unnecessarily obscuring the concept of the present disclosure.

在此使用的术语仅仅是为了描述具体实施例，而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在，但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。The terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the present disclosure. The terms "comprising", "comprising", etc. used herein indicate the presence of stated features, steps, operations and/or components, but do not exclude the presence or addition of one or more other features, steps, operations or components.

在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义，除非另外定义。应注意，这里使用的术语应解释为具有与本说明书的上下文相一致的含义，而不应以理想化或过于刻板的方式来解释。All terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that the terms used herein should be interpreted to have a meaning consistent with the context of this specification, and not be interpreted in an idealized or overly rigid manner.

在使用类似于“A、B和C等中至少一个”这样的表述的情况下，一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如，“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。Where expressions such as "at least one of A, B, and C, etc." are used, they should generally be interpreted as those skilled in the art would normally understand the expression (for example, "having A, B, and C A system of at least one of "shall include, but not be limited to, systems with A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ).

为便于理解本公开的技术方案，将本公开一些实施例中涉及的术语释义如下：In order to facilitate the understanding of the technical solutions of the present disclosure, the terms involved in some embodiments of the present disclosure are defined as follows:

Yolo：一种一阶段的目标检测算法，具有速度快，准确率高的特点。Yolo: A one-stage target detection algorithm with fast speed and high accuracy.

TrOCR：一种基于Vision Tranformer和Transformer Decoder的OCR识别方法，具有准确率高，应用场景广的特点。TrOCR: An OCR recognition method based on Vision Tranformer and Transformer Decoder, which has the characteristics of high accuracy and wide application scenarios.

张量：指多维数组，是基于向量和矩阵的推广。Tensor: refers to a multidimensional array, which is based on the promotion of vectors and matrices.

交叉注意力机制：又称Cross Attention，是一种多头注意力机制的变体，可以在序列到序列的模型中使用，允许模型在处理输入时，同时关注输入的不同部分。Cross Attention Mechanism: Also known as Cross Attention, it is a variant of the multi-head attention mechanism that can be used in a sequence-to-sequence model, allowing the model to focus on different parts of the input while processing the input.

分类函数：用于将输入数据映射到不同分类空间，得到分类结果的函数，如softmax函数、sigmod函数或CTC(Connectionist Temporal Classification，联接时序分类函数)函数等。Classification function: a function used to map input data to different classification spaces to obtain classification results, such as softmax function, sigmod function or CTC (Connectionist Temporal Classification, connection time series classification function) function, etc.

图1示意性示出了根据本公开实施例的字符识别模型的训练方法和字符识别方法的应用场景图。需要注意的是，图1所示仅为可以应用本公开实施例的系统架构的示例，以帮助本领域技术人员理解本公开的技术内容，但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。Fig. 1 schematically shows an application scene diagram of a character recognition model training method and a character recognition method according to an embodiment of the present disclosure. It should be noted that, what is shown in FIG. 1 is only an example of the system architecture to which the embodiments of the present disclosure can be applied, so as to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used in other device, system, environment or scenario.

如图1所示，根据该实施例的应用场景100可以包括终端设备101、102、103，网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , an application scenario 100 according to this embodiment may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备101、102、103通过网络104与服务器105交互，以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用，例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等(仅为示例)。Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 101, 102, 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social platform software, etc. (just for example).

终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers and the like.

服务器105可以是提供各种服务的服务器，例如对用户利用终端设备101、102、103所浏览的网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的用户请求等数据进行分析等处理，并将处理结果(例如根据用户请求获取或生成的网页、信息、或数据等)反馈给终端设备。The server 105 may be a server that provides various services, such as a background management server that provides support for websites browsed by users using the terminal devices 101 , 102 , 103 (just an example). The background management server can analyze and process received data such as user requests, and feed back processing results (such as webpages, information, or data obtained or generated according to user requests) to the terminal device.

服务器105可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或分布式系统，还可以是提供云服务、云计算、网络服务、中间件服务等基础云计算服务的云服务器。The server 105 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud computing, network services, and middleware services.

在一些实施例中，可以在终端设备101、102、103准备好训练集，并运行服务器105实现模型训练。终端设备101、102、103可以通过摄像装置拍摄具有待识别字符的图片，或者通过网络下载的方式获得具有待识别字符的图片，并将识别请求发送至服务器105，获得服务器105运行字符识别模型返回的识别结果。在一些实施例中，字符识别模型可以部署在终端设备101、102、103，在终端设备101、102、103本地执行训练和识别过程。In some embodiments, the training set can be prepared on the terminal devices 101, 102, 103, and the server 105 can be run to implement model training. The terminal devices 101, 102, 103 can take pictures with characters to be recognized by the camera, or obtain the pictures with characters to be recognized by downloading from the network, and send the recognition request to the server 105, and obtain the response from the server 105 to run the character recognition model. recognition results. In some embodiments, the character recognition model can be deployed on the terminal devices 101, 102, 103, and the training and recognition processes are performed locally on the terminal devices 101, 102, 103.

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

以下将基于图1描述的场景，通过图2～图8对本公开实施例的字符识别模型的训练方法进行详细描述。Based on the scene described in FIG. 1 , the training method of the character recognition model in the embodiment of the present disclosure will be described in detail through FIGS. 2 to 8 .

图2示意性示出了根据本公开实施例的字符识别模型的训练方法的流程图。图3A～图3B示意性示出了根据本公开实施例的训练图像实例。图4示意性示出了根据本公开实施例的字符识别模型的训练架构图。字符识别模型包括编码器和解码器。Fig. 2 schematically shows a flowchart of a method for training a character recognition model according to an embodiment of the present disclosure. 3A-3B schematically illustrate examples of training images according to an embodiment of the present disclosure. Fig. 4 schematically shows a training architecture diagram of a character recognition model according to an embodiment of the present disclosure. A character recognition model includes an encoder and a decoder.

如图2所示，该实施例的字符识别模型的训练方法包括：As shown in Figure 2, the training method of the character recognition model of this embodiment comprises:

在操作S210，将训练图像输入字符识别模型的编码器，获得编码器输出的第二图像特征，编码器的编码结果包括第二图像特征，其中，训练图像包括不规则排列的N个待识别字符，不规则排列包括未按照直线排列，N为大于或等于2的整数。In operation S210, the training image is input into the encoder of the character recognition model, and the second image feature output by the encoder is obtained. The encoding result of the encoder includes the second image feature, wherein the training image includes N characters to be recognized that are arranged irregularly , the irregular arrangement includes not being arranged in a straight line, and N is an integer greater than or equal to 2.

在本公开实施例中，在训练模型之前，需要收集大量训练图像，该大量的训练图像中，可以包括不同类型的文本图像。如图3A～图3B，训练图像可以包括印章图像和公式图像，还可以包括其他不规则排列的文字或符号等。直线排列包括从左往右书写或从上到下书写，所有的字大致在一条直线上且这条直线和水平/竖直方向几乎重合。In the embodiment of the present disclosure, before training the model, a large number of training images need to be collected, and the large number of training images may include different types of text images. As shown in FIG. 3A to FIG. 3B , the training images may include seal images and formula images, and may also include other irregularly arranged characters or symbols. Straight line arrangement includes writing from left to right or from top to bottom, all characters are roughly on a straight line and this line almost coincides with the horizontal/vertical direction.

参照图3A，各式各样的印章形状会影响印章文字的排列方向，又或者存在部分印章文字为完全文本，弯曲文本是指，字几乎都不在同一直线上，字的中心点连起来之后大致呈一条曲线。另外，由于印章的使用场景，可能会出现重叠文本，即图3A下方示出的不规则排列的文本与规则排列的文本存在重叠。Referring to Figure 3A, various seal shapes will affect the arrangement direction of the stamp characters, or there may be some stamp characters that are complete texts. Curved text means that the characters are almost not on the same straight line, and the center points of the characters are roughly connected. in a curved line. In addition, due to the usage scenarios of the stamp, overlapping texts may appear, that is, the irregularly arranged text shown at the bottom of FIG. 3A overlaps the regularly arranged text.

继续参照图3A，每个印章图像包括至少一个印章，印章可以为合同专用章、发票专用章等各种类型的印章。印章可以为圆形印章、椭圆形印章、多边形印章(例如，矩形印章)等规则形状印章，也可以为不规则形状印章。Continuing to refer to FIG. 3A , each stamp image includes at least one stamp, and the stamps can be various types of stamps such as a special seal for contracts and a special seal for invoices. The seal can be a regular shape such as a circular seal, an oval seal, a polygonal seal (for example, a rectangular seal), or an irregular shape.

参照图3B，上侧示出了手写公式图像，下侧示出了LaTex语言的标签文本。训练图像还可以是打印公式图像。对于如图3B示出的通过多行排列形成向阅读者传达一组信息效果的图像，也在本公开的不规则排列的范围内，换言之，未按照直线排列包括多行排列的公式字符。Referring to FIG. 3B , the upper side shows a handwritten formula image, and the lower side shows label text in LaTex language. The training image may also be a print formula image. As shown in FIG. 3B , an image that conveys a group of information to the reader through multi-line arrangement is also within the scope of the present disclosure.

训练图像可以是来自用户上传或从网络下载的文本图像，也可以来自电子设备直接通过摄像头采集的文本图像，在此不限制训练图像的来源方式。The training image may be a text image uploaded by a user or downloaded from the network, or a text image directly collected by an electronic device through a camera, and the source of the training image is not limited here.

操作S210中编码器用于提取训练图像的第二图像特征。参照图4示出的TrOCR模型的架构，编码器可以为Vision Transformer模型。Vision Transformer模型为一种分类模型，将计算机视觉(Computer Vision，简称CV)和NLP结合起来，对原始图像进行分块，展平成序列，输入编码部分，然后处理编码的输出完成分类任务。尤其说明的是，本公开无意将操作S210中编码器限定为Vision Transformer，可以灵活替换为其他图像特征提取模型，如Swin Transformer、CSwinTransformer等。In operation S210, the encoder is used to extract the second image features of the training image. Referring to the architecture of the TrOCR model shown in FIG. 4 , the encoder can be a Vision Transformer model. The Vision Transformer model is a classification model that combines computer vision (CV for short) and NLP to block the original image, flatten it into a sequence, input the coding part, and then process the coded output to complete the classification task. In particular, the present disclosure does not intend to limit the encoder in operation S210 to Vision Transformer, and it can be flexibly replaced with other image feature extraction models, such as Swin Transformer and CSwinTransformer.

在操作S220，基于第二图像特征得到第一预测张量，第一预测张量包括对N个待识别字符的字符预测信息。In operation S220, a first prediction tensor is obtained based on the second image feature, and the first prediction tensor includes character prediction information for the N characters to be recognized.

示例性地，可以直接基于第二图像特征进行预测，例如可以使用分类函数、基于神经网络的分类模型或基于其他机器学习算法的分类模型等获得第一预测张量，字符预测信息表征了可以从字典中获得预测结果的信息。Exemplarily, the prediction can be made directly based on the second image feature, for example, the first prediction tensor can be obtained by using a classification function, a classification model based on a neural network, or a classification model based on other machine learning algorithms, and the character prediction information can be obtained from The information in the dictionary to obtain the predicted results.

具体而言，训练图像的数目记为Batch_size，Batch_size可以取值大于或等于1的整数。将等于Batch_size数量的训练图像输入至编码器，第一图像特征表征为Batch_size×L×C，其中，L表示待识别字符的长度，C表示张量的维度。Specifically, the number of training images is recorded as Batch_size, and Batch_size can take an integer greater than or equal to 1. The number of training images equal to Batch_size is input to the encoder, and the first image feature is represented by Batch_size×L×C, where L represents the length of the character to be recognized, and C represents the dimension of the tensor.

在一些实施例中，对第二图像特征进行升维处理，得到第三图像特征。将第三图像特征输入至分类函数，得到第一预测张量。例如对Batch_size×L×C的第一图像特征进行线性变换(仅为示例)实现升维处理，得到Batch_size×L×D，D为字典的维度。然后将Batch_size×L×D的第二图像特征输入至分类函数，得到第一预测张量。第二图像特征升维后，第三图像特征与字典维度相同，便于分类处理。In some embodiments, dimension-up processing is performed on the second image feature to obtain the third image feature. The third image feature is input to the classification function to obtain the first prediction tensor. For example, the first image feature of Batch_size×L×C is linearly transformed (just an example) to realize dimension-up processing to obtain Batch_size×L×D, where D is the dimension of the dictionary. Then input the second image feature of Batch_size×L×D to the classification function to obtain the first prediction tensor. After the dimension of the second image feature is increased, the dimension of the third image feature is the same as that of the dictionary, which is convenient for classification processing.

在操作S230，将第二图像特征输入解码器，获得解码器输出的第二预测张量，解码器的解码结果包括第二预测张量，第二预测张量包括对N个待识别字符的字符预测信息。In operation S230, the second image feature is input into the decoder to obtain the second prediction tensor output by the decoder, the decoding result of the decoder includes the second prediction tensor, and the second prediction tensor includes characters for N characters to be recognized Forecast information.

参照图4，操作S230中解码器可以为Transformer Decoder，为执行自然语言处理任务的自然语言处理模型中的解码器。其可以根据第一图像特征做出预测，得到第二预测张量，字符预测信息也表征了可以从字典中获得预测结果的信息。Referring to FIG. 4 , the decoder in operation S230 may be a Transformer Decoder, which is a decoder in a natural language processing model for performing natural language processing tasks. It can make a prediction according to the first image feature to obtain the second prediction tensor, and the character prediction information also represents the information that can obtain the prediction result from the dictionary.

在操作S240，获得第一预测张量与标注文本之间的第一损失函数值，以及第二预测张量与标注文本之间的第二损失函数值，标注文本包括每个待识别字符的标签。In operation S240, obtain the first loss function value between the first prediction tensor and the labeled text, and the second loss function value between the second predicted tensor and the labeled text, the labeled text includes the label of each character to be recognized .

标注文本为训练图像中的待识别字符的标签文本，如图3A所示的印章图像，该训练图像对应的标注文本为“XX有限公司”，第一预测张量或第二预测张量的理想结果即为表征“XX有限公司”识别出来。其中，字典包括翻译时的可映射字，表示模型进行文字翻译时的所有可翻译字。基于字典，可以将每一个字符映射到一个整型数字(token)，例如，“XX有限公司”被转换为token整数序列：“[1919，25，103，429，17，151]”。计算第一损失函数值或第二损失函数值时，采用文本的token序列进行损失值的数值计算。The label text is the label text of the character to be recognized in the training image, such as the seal image shown in Figure 3A, the label text corresponding to the training image is "XX Co., Ltd", the ideal of the first prediction tensor or the second prediction tensor The result is that the token "XX Co., Ltd." is identified. Among them, the dictionary includes mappable words during translation, representing all translatable words when the model performs text translation. Based on the dictionary, each character can be mapped to an integer number (token), for example, "XX Co., Ltd." is converted into a sequence of token integers: "[1919, 25, 103, 429, 17, 151]". When calculating the first loss function value or the second loss function value, the numerical calculation of the loss value is performed using the token sequence of the text.

在操作S250，根据第一损失函数值和第二损失函数值得到综合损失函数值。In operation S250, a comprehensive loss function value is obtained according to the first loss function value and the second loss function value.

在操作S260，根据综合损失函数值更新编码器和解码器，得到经训练的字符识别模型。In operation S260, the encoder and the decoder are updated according to the integrated loss function value to obtain a trained character recognition model.

根据综合损失函数值，可以采用Adam或AdamW等优化器进行模型训练，直至训练图像输入完成，或者综合损失函数值达到目标收敛条件。According to the comprehensive loss function value, optimizers such as Adam or AdamW can be used for model training until the training image input is completed, or the comprehensive loss function value reaches the target convergence condition.

通过如上操作S220和操作S230获得第一预测张量和第二预测张量的凡是，可以直接输出字符预测的结果。由于模型可以在不对印章图像中的文字进行旋转的情况下直接提取文字的特征，因此可以避免印章的随意旋转问题以及印章中文字的在纸面上的排列走势变化导致的识别结果不准确的问题。By obtaining the first prediction tensor and the second prediction tensor through operation S220 and operation S230 above, the character prediction result can be output directly. Since the model can directly extract the features of the text without rotating the text in the stamp image, it can avoid the problem of random rotation of the stamp and the inaccurate recognition results caused by the change of the arrangement trend of the text in the stamp on the paper .

根据本公开的实施例，能够在训练过程中从编码器和解码器的输出获得更多信息进行学习，加快字符识别模型的训练速度和识别效果。可以基于提取出的图像特征直接进行识别，经训练的字符识别模型在实际应用场景具有识别效率和准确率高、鲁棒性好及简单有效的特点。According to the embodiments of the present disclosure, more information can be obtained from the output of the encoder and decoder for learning during the training process, and the training speed and recognition effect of the character recognition model can be accelerated. It can be directly recognized based on the extracted image features, and the trained character recognition model has the characteristics of high recognition efficiency and accuracy, good robustness, simplicity and effectiveness in practical application scenarios.

需要说明的是，上述方法的一些步骤可以单独执行或组合执行，以及可以并行执行或顺序执行，并不局限于图中所示的具体操作顺序。例如可以在操作S220计算得到第一预测张量后，即可在操作S240计算得到第一损失函数值，而不必在操作S230之后计算第一损失函数值。It should be noted that some steps of the above methods may be executed individually or in combination, and may be executed in parallel or sequentially, and are not limited to the specific operation sequence shown in the figure. For example, after the first prediction tensor is calculated in operation S220, the first loss function value can be calculated in operation S240 instead of calculating the first loss function value after operation S230.

相关技术中，存在一阶段的端到端场景文字识别方法如ABCNet、PGNet等。这类算法是同时包含文字检测、变换和识别功能的，但是识别准确率也不是很高，而且标注成本比较高，需要逐字标注文字的位置。加入一张图片上的一个印章上有20个字那就要标注大约40个坐标点，代价相当大。本公开提出两阶段的方式，其中一阶段如图5示出的先获得训练图像。另一阶段如图2示出的进行模型训练。In related technologies, there are one-stage end-to-end scene text recognition methods such as ABCNet and PGNet. This type of algorithm includes text detection, transformation, and recognition functions at the same time, but the recognition accuracy is not very high, and the labeling cost is relatively high, and the position of the text needs to be marked word by word. If there are 20 characters on a seal on a picture, about 40 coordinate points need to be marked, which is quite expensive. The present disclosure proposes a two-stage approach, one of which is to obtain training images as shown in FIG. 5 . Another stage is model training as shown in FIG. 2 .

图5示意性示出了根据本公开实施例的获得训练图像的流程图。如图5所示，该实施例获得训练图像包括：Fig. 5 schematically shows a flow chart of obtaining training images according to an embodiment of the present disclosure. As shown in Figure 5, this embodiment obtains the training image and includes:

在操作S510，确定样本图像。In operation S510, a sample image is determined.

例如，样本图像可以为任何包括印章、公式的图像，例如可以为发票图像、收据图像或文档图像等，本公开不限于此。For example, the sample image may be any image including a seal or a formula, such as an invoice image, a receipt image, or a document image, and the present disclosure is not limited thereto.

图6示意性示出了根据本公开实施例的样本图像实例。将图6图示中一些信息利用纹理图像进行了遮挡，以避免产生不利影响。如图6所示，该收据包括一个圆形印章，且该圆形印章为收据专用章，包括印章的相关文字信息。Fig. 6 schematically shows an example of a sample image according to an embodiment of the present disclosure. Some of the information in the illustration in Figure 6 is covered with texture images to avoid adverse effects. As shown in FIG. 6 , the receipt includes a circular seal, and the circular seal is a special seal for receipts, including relevant text information of the seal.

例如，样本图像可以为图像采集装置直接采集到的原始图像，也可以是对原始图像进行预处理之后获得的图像。例如，为了避免输入图像的数据质量、数据不均衡等对于印章图像识别的影响，在处理样本图像前，还可以包括对输入图像进行预处理的操作。预处理可以消除输入图像中的无关信息或噪声信息，以便于更好地对输入图像进行处理。预处理例如可以包括对输入图像进行缩放、剪裁、伽玛(Gamma)校正、图像增强或降噪滤波等处理。For example, the sample image may be an original image directly collected by the image acquisition device, or an image obtained after preprocessing the original image. For example, in order to avoid the influence of the data quality and data imbalance of the input image on the stamp image recognition, before processing the sample image, it may also include the operation of preprocessing the input image. Preprocessing can eliminate irrelevant information or noise information in the input image, so as to process the input image better. The preprocessing may include, for example, scaling, cropping, gamma (Gamma) correction, image enhancement, or noise reduction filtering on the input image.

在操作S520，利用目标检测模型从样本图像中确定第二目标区域，第二目标区域包括N个待识别字符。In operation S520, a second target area is determined from the sample image by using the target detection model, the second target area includes N characters to be recognized.

在操作S530，将第二目标区域从样本图像分割，得到训练图像。In operation S530, the second target area is segmented from the sample image to obtain a training image.

根据本公开的实施例，提出了一种两阶段的端到端OCR识别方法，该方法对任意排列文字符号均有很好的效果。这种方法只需要标注图片上的文字、符号整体的坐标和字符表示即可，无需对特殊排列的每一个文字符号标注坐标，简单有效。According to the embodiment of the present disclosure, a two-stage end-to-end OCR recognition method is proposed, which has a good effect on any arrangement of text symbols. This method only needs to label the text on the picture, the overall coordinates and character representation of the symbol, and does not need to label the coordinates of each text symbol in a special arrangement, which is simple and effective.

此外，参照图6，样本图像可以包括其他多余的文字，如金额、客户签字、或收据抬头等，还可能存在多个印章。通过确定第二目标区域可以给字符识别模型输入准确的数据，提高字符识别准确性。In addition, referring to FIG. 6 , the sample image may include other redundant characters, such as amount, customer signature, or receipt title, etc., and there may also be multiple stamps. By determining the second target area, accurate data can be input to the character recognition model, thereby improving the character recognition accuracy.

在一些实施例中，分类函数包括联接时序分类函数，操作S240中获得第一预测张量与标注文本之间的第一损失函数值包括：将第一预测张量与标注文本输入至第一损失函数，得到第一损失函数值，其中，第一损失函数基于联接时序分类函数构建，用于比较第一预测张量与标注文本之间的差异。参照图4，第一损失函数可以为CTCloss。In some embodiments, the classification function includes a joint time series classification function, and obtaining the first loss function value between the first prediction tensor and the labeled text in operation S240 includes: inputting the first predicted tensor and the labeled text into the first loss function to obtain the value of the first loss function, wherein the first loss function is constructed based on the connection time series classification function, and is used to compare the difference between the first prediction tensor and the labeled text. Referring to FIG. 4, the first loss function may be CTCloss.

根据本公开的实施例，得到第一损失函数值表征第一预测张量与标注文本之间的差异，并与经解码器得到的第二预测张量与标注文本之间的差异协作，一起参与到字符识别模型的训练过程中，能够提取更多的图像信息和字符信息进行学习，提高模型收敛速度和识别准确率。According to an embodiment of the present disclosure, the obtained first loss function value represents the difference between the first prediction tensor and the labeled text, and cooperates with the difference between the second predicted tensor obtained by the decoder and the labeled text to participate in During the training process of the character recognition model, more image information and character information can be extracted for learning, and the model convergence speed and recognition accuracy can be improved.

图7示意性示出了根据本公开实施例的获得第二预测张量的流程图。如图7所示，该实施例是操作S230的其中一个实施例，获得第二预测张量包括：Fig. 7 schematically shows a flow chart of obtaining a second prediction tensor according to an embodiment of the present disclosure. As shown in FIG. 7, this embodiment is one of the embodiments of operation S230, and obtaining the second prediction tensor includes:

在操作S710，将第二图像特征和第二序列文本输入字符识别模型的解码器，第二序列文本通过将标签文本中每个待识别字符的标签右移M个字符位置获得，M为大于或等于1的整数。解码器包括交叉注意力层和分类层。一些实施例中，M可以等于1。In operation S710, the second image feature and the second sequence of text are input into the decoder of the character recognition model, and the second sequence of text is obtained by moving the label of each character to be recognized in the label text to the right by M character positions, where M is greater than or An integer equal to 1. The decoder includes a cross-attention layer and a classification layer. In some embodiments, M may be equal to 1.

在操作S720，利用交叉注意力层处理第二图像特征和第二序列文本，得到第二注意力特征，其中，交叉注意力层被配置为基于交叉注意力机制处理数据。In operation S720, the second image feature and the second sequence of text are processed by using a cross-attention layer to obtain a second attention feature, wherein the cross-attention layer is configured to process data based on a cross-attention mechanism.

在操作S730，利用分类层处理第二注意力特征，得到第二预测张量。In operation S730, the classification layer is used to process the second attention feature to obtain a second prediction tensor.

根据本公开的实施例，第二预测张量是解码器根据第二图像特征和第二序列文本经由交叉注意力层和分类层处理得到，由于第二序列文本通过将标签文本中每个待识别字符的标签右移M个字符位置获得，可以令解码器在处理数据过程中充分获取上下文信息，输出的识别结果更准确。According to an embodiment of the present disclosure, the second prediction tensor is obtained by the decoder through the cross-attention layer and the classification layer according to the second image feature and the second sequence of text, because the second sequence of text passes each label text to be recognized The label of the character is obtained by shifting M character positions to the right, which can enable the decoder to fully obtain context information during data processing, and the output recognition result is more accurate.

在一些实施例中，操作S240中获得第二预测张量与标注文本之间的第二损失函数值包括：将第二预测张量与标注文本输入至第二损失函数，得到第二损失函数值，其中，第二损失函数用于比较第二预测张量与标注文本之间的差异。In some embodiments, obtaining the second loss function value between the second prediction tensor and the labeled text in operation S240 includes: inputting the second predicted tensor and the labeled text into the second loss function to obtain the second loss function value , where the second loss function is used to compare the difference between the second prediction tensor and the labeled text.

参照图4，第二损失函数可以为分类损失函数，具体地，可以是如CrossEntropy(交叉熵损失函数)、KLDivLoss(散度损失函数)等损失函数。Referring to FIG. 4 , the second loss function may be a classification loss function, specifically, a loss function such as CrossEntropy (cross-entropy loss function), KLDivLoss (divergence loss function), and the like.

图8示意性示出了根据本公开实施例的得到综合损失函数值的流程图。如图8所示，该实施例是操作S250的其中一个实施例，得到综合损失函数值包括：Fig. 8 schematically shows a flow chart of obtaining a comprehensive loss function value according to an embodiment of the present disclosure. As shown in FIG. 8, this embodiment is one of the embodiments of operation S250, and obtaining the comprehensive loss function value includes:

在操作S810，根据第一损失函数值及其第一权重，得到第一加权值。In operation S810, a first weighted value is obtained according to the first loss function value and the first weighted value thereof.

在操作S820，根据第二损失函数值及其第二权重，得到第二加权值。In operation S820, a second weighted value is obtained according to the second loss function value and its second weight.

在操作S830，根据第一加权值和第二加权值得到综合损失函数值。In operation S830, a comprehensive loss function value is obtained according to the first weighted value and the second weighted value.

参照图4，第三损失函数可以是加权混合损失函数Loss，如式1。Referring to FIG. 4 , the third loss function may be a weighted mixed loss function Loss, as shown in Equation 1.

Loss＝αCTCLoss(y，y′₁)+(1-α)ClassfyLoss(y，y′₂) 式1Loss=αCTCLoss(y,y′ ₁ )+(1-α)ClassfyLoss(y,y′ ₂ ) Formula 1

其中，α为第一权重，取值在0-1之间。1-α为第二权重。CTCLoss(y，y′₁)为第一损失函数值，ClassfyLoss(y，y′₂)为第二损失函数值，ClassfyLoss为第二损失函数，y为标注文本的序列向量，y′₁为第一预测张量，y′₂为第二预测张量。Wherein, α is the first weight, and its value is between 0-1. 1-α is the second weight. CTCLoss(y, y′ ₁ ) is the first loss function value, ClassfyLoss(y, y′ ₂ ) is the second loss function value, ClassfyLoss is the second loss function, y is the sequence vector of the labeled text, and y′ ₁ is the first One prediction tensor, _y′2 is the second prediction tensor.

第一损失函数值和第二损失函数值用以反映编码器和解码器不同的优化情况，各自权重是用来平衡不同编码器和解码器之间的重要性。α的值可以是人工手动赋予一个确定的数值，也可以是先赋予初始值，然后在训练过程中动态更新。The first loss function value and the second loss function value are used to reflect different optimization situations of the encoder and decoder, and the respective weights are used to balance the importance between different encoders and decoders. The value of α can be assigned a certain value manually, or an initial value can be assigned first, and then dynamically updated during the training process.

例如使用梯度下降算法反向更新编码器参数、解码器参数和α值，某损失函数值的权重越大，表示该任务(编码器或解码器)越重要，对梯度贡献越大，反之亦然。在训练过程中能够自动调节权重，可以进一步影响编码器和解码器中参数的更新。权重越大的任务，对应的损失函数的梯度越大，会使模型参数更快地向该任务的最优解方向移动。权重越小的任务，对应的损失函数的梯度越小，会使模型参数更慢地向该任务的最优解方向移动。For example, use the gradient descent algorithm to reversely update the encoder parameters, decoder parameters, and α values. The greater the weight of a loss function value, the more important the task (encoder or decoder) is, and the greater the contribution to the gradient, and vice versa. . The ability to automatically adjust the weights during training can further affect the update of parameters in the encoder and decoder. For a task with a larger weight, the gradient of the corresponding loss function is larger, which will make the model parameters move to the optimal solution of the task faster. For tasks with smaller weights, the gradient of the corresponding loss function is smaller, which will make the model parameters move more slowly towards the optimal solution of the task.

根据本公开的实施例，在训练阶段能够考虑到第二图像特征、第一预测张量和第二预测张量等多个影响识别准确率的因素，在实际识别时能够实现优化识别准确率的效果。According to the embodiments of the present disclosure, in the training phase, multiple factors that affect the recognition accuracy, such as the second image feature, the first prediction tensor, and the second prediction tensor, can be considered, and the recognition accuracy can be optimized during actual recognition. Effect.

在训练阶段结束后，将训练完成的字符识别模型部署到实际应用场景中。在上述图2～图8的基础上，以下结合图9～图12进一步说明字符识别方法。After the training phase is over, deploy the trained character recognition model to the actual application scenario. On the basis of the foregoing FIGS. 2 to 8 , the character recognition method will be further described below in conjunction with FIGS. 9 to 12 .

图9示意性示出了根据本公开实施例的字符识别方法的流程图。如图9所示，该实施例的字符识别方法包括：Fig. 9 schematically shows a flowchart of a character recognition method according to an embodiment of the present disclosure. As shown in Figure 9, the character recognition method of this embodiment comprises:

在操作S910，在获取待识别图像后，将待识别图像输入字符识别模型，其中，待识别图像包括不规则排列的N个待识别字符，不规则排列包括未按照直线排列，N为大于或等于2的整数。In operation S910, after the image to be recognized is obtained, the image to be recognized is input into the character recognition model, wherein the image to be recognized includes N characters to be recognized that are irregularly arranged, and the irregular arrangement includes not being arranged in a straight line, and N is greater than or equal to Integer of 2.

示例性地，待识别图像可以如图3A～图3B所示。待识别图像可以是来自用户上传或从网络下载的文本图像，也可以来自电子设备直接通过摄像头采集的文本图像，在此不限制待识别图像的来源方式。Exemplarily, the image to be recognized may be as shown in FIG. 3A to FIG. 3B . The image to be recognized may be a text image uploaded by a user or downloaded from the network, or a text image directly collected by an electronic device through a camera, and the source of the image to be recognized is not limited here.

字符识别模型被配置为通过如下操作预先训练得到：基于编码器的编码结果得到第一损失函数值，基于解码器的解码结果得到第二损失函数值，基于第一损失函数值和第二损失函数值更新编码器和解码器。具体地，可以参照如图2～图8描述的任一个实施例所描述的训练方法。The character recognition model is configured to be pre-trained by the following operations: obtain the first loss function value based on the encoding result of the encoder, obtain the second loss function value based on the decoding result of the decoder, and obtain the second loss function value based on the first loss function value and the second loss function The value updates the encoder and decoder. Specifically, you can refer to the training method described in any one of the embodiments described in FIGS. 2 to 8 .

在操作S920，获得字符识别模型输出的N个待识别字符的识别结果。In operation S920, the recognition results of N characters to be recognized outputted by the character recognition model are obtained.

若待识别图像为印章图像，识别结果可以包括印章的文字，如图3A所示。若待识别图像为公式图像，识别结果可以是公式符号或Latex语言，如图3B下侧所示。If the image to be recognized is a seal image, the recognition result may include the text of the seal, as shown in FIG. 3A . If the image to be recognized is a formula image, the recognition result can be formula symbols or Latex language, as shown in the lower side of FIG. 3B .

根据本公开的实施例，可以在不对图像中的文字进行旋转等操作的情况下直接提取文字的特征，相当于直接从待识别图像中“翻译”出识别结果。能够高效识别各类印章、公式、多行文本等特殊字符图片，具有准确率高、鲁棒性好，简单有效的特点。According to the embodiments of the present disclosure, the features of the text can be directly extracted without performing operations such as rotation on the text in the image, which is equivalent to directly "translating" the recognition result from the image to be recognized. It can efficiently identify pictures of special characters such as various seals, formulas, and multi-line texts. It has the characteristics of high accuracy, good robustness, simplicity and effectiveness.

图10示意性示出了根据本公开实施例的获得待识别图像的流程图。如图10所示，该实施例获得训练图像包括：Fig. 10 schematically shows a flow chart of obtaining an image to be recognized according to an embodiment of the present disclosure. As shown in Figure 10, this embodiment obtains training image and comprises:

在操作S1010，确定目标图像。目标图像可以如图6所示，但本公开不限于此，还可以是文档图像、发票图像等。In operation S1010, a target image is determined. The target image may be as shown in FIG. 6 , but the disclosure is not limited thereto, and may also be a document image, an invoice image, and the like.

在操作S1020，利用目标检测模型从目标图像中确定第一目标区域，第一目标区域包括N个待识别字符。In operation S1020, a first target area is determined from the target image by using the target detection model, the first target area includes N characters to be recognized.

在操作S 1030，将第一目标区域从目标图像分割，得到待识别图像。In operation S1030, the first target area is segmented from the target image to obtain an image to be recognized.

根据本公开的实施例，使用一种两阶段的端到端OCR识别方法，通过确定第一目标区域可以给字符识别模型输入准确的数据，提高字符识别准确性，对任意排列文字符号均有很好的效果。According to the embodiment of the present disclosure, a two-stage end-to-end OCR recognition method is used, by determining the first target area, accurate data can be input to the character recognition model, and the accuracy of character recognition can be improved. Good results.

图11示意性示出了根据本公开实施例的获得N个待识别字符的识别结果的流程图。如图11所示，该实施例是操作S910的其中一个实施例，获得N个待识别字符包括：Fig. 11 schematically shows a flow chart of obtaining recognition results of N characters to be recognized according to an embodiment of the present disclosure. As shown in Figure 11, this embodiment is one of the embodiments of operation S910, obtaining N characters to be recognized includes:

在操作S1110，利用字符识别模型的编码器处理待识别图像，获得编码器输出的第一图像特征。In operation S1110, the image to be recognized is processed by the encoder of the character recognition model, and a first image feature output by the encoder is obtained.

在操作S1120，利用分类函数处理第一图像特征，得到N个待识别字符的中间识别结果，分类函数用于对第一图像特征进行分类。In operation S1120, the first image feature is processed by using a classification function to obtain intermediate recognition results of N characters to be recognized, and the classification function is used to classify the first image feature.

在操作S1130，将第一图像特征和第一序列文本输入字符识别模型的解码器，获得解码器输出的N个待识别字符的识别结果，第一序列文本通过将中间识别结果中每个字符识别结果右移M个字符位置获得，M为大于或等于1的整数。In operation S1130, the first image feature and the first sequence of text are input into the decoder of the character recognition model, and the recognition results of N characters to be recognized output by the decoder are obtained, and the first sequence of text is recognized by each character in the intermediate recognition result The result is obtained by shifting M characters to the right, where M is an integer greater than or equal to 1.

参照图4，中间识别结果可以对应于第一预测张量，与训练阶段以标签文本得到右移的文本不同的是，将中间识别结果作为右移的文本，与第一图像特征一同输入解码器。解码器被配置为基于第一图像特征对中间识别结果重新排序，并输出最终的识别结果，提高识别准确率。Referring to Fig. 4, the intermediate recognition result may correspond to the first prediction tensor, which is different from the right-shifted text obtained from the label text in the training stage, and the intermediate recognition result is used as the right-shifted text, which is input to the decoder together with the first image feature . The decoder is configured to reorder the intermediate recognition results based on the first image feature, and output the final recognition result, so as to improve the recognition accuracy.

在一些实施例中，利用交叉注意力层处理第一图像特征和第一序列文本，得到第一注意力特征，其中，交叉注意力层被配置为基于交叉注意力机制处理数据。利用分类层处理第一注意力特征，得到N个待识别字符的识别结果。In some embodiments, the first image feature and the first sequence of text are processed by using a cross-attention layer to obtain the first attention feature, wherein the cross-attention layer is configured to process data based on a cross-attention mechanism. The classification layer is used to process the first attention feature, and the recognition results of N characters to be recognized are obtained.

图12示意性示出了根据本公开另一实施例的获得N个待识别字符的识别结果的流程图。如图12所示，该实施例是操作S910的其中一个实施例，获得N个待识别字符包括：Fig. 12 schematically shows a flow chart of obtaining recognition results of N characters to be recognized according to another embodiment of the present disclosure. As shown in FIG. 12, this embodiment is one of the embodiments of operation S910, obtaining N characters to be recognized includes:

在操作S1210，利用字符识别模型的编码器处理待识别图像，获得编码器输出的第一图像特征。In operation S1210, the image to be recognized is processed by the encoder of the character recognition model, and a first image feature output by the encoder is obtained.

在操作S1220，将第一图像特征输入字符识别模型的解码器，获得解码器输出的N个待识别字符的识别结果。In operation S1220, the first image feature is input into the decoder of the character recognition model, and the recognition results of N characters to be recognized outputted by the decoder are obtained.

与图11所示的实施例不同的是，图12所示的实施例不获得中间识别结果，仅将第一图像特征输入至解码器，识别速度较快，且具有足够的准确性。可以将[CLS]作为图4中右移的文本与第一图像特征一同输入解码器。[CLS]特殊编码标记表示为文本序列的开始。可以理解，[CLS]仅为示例，可以灵活设置标记。Different from the embodiment shown in FIG. 11 , the embodiment shown in FIG. 12 does not obtain an intermediate recognition result, but only inputs the first image feature to the decoder, and the recognition speed is fast and has sufficient accuracy. [CLS] can be input to the decoder as the text shifted to the right in Fig. 4 along with the first image feature. [CLS] A special encoding marker denoted as the start of a text sequence. It can be understood that [CLS] is only an example, and the flags can be set flexibly.

参照图11和图12两种不同的获取识别结果的方式，可以根据上述式1的第一权重“α”和第二权重“1-α”考虑采用图11或图12的方式。如果α较大时，且追求模型预测的精度，可以使用先CTC函数实现CTC束搜索，再将中间识别结果输入到解码器，以AttentionRescore的方式进行解码，即图12的方式。如α较小时，且追求模型预测的速度，可以仅使用输入到解码器，以Attention束搜索的方式进行解码，即图11的方式。Referring to the two different ways of obtaining recognition results in FIG. 11 and FIG. 12 , the method in FIG. 11 or FIG. 12 can be considered according to the first weight "α" and the second weight "1-α" in the above formula 1. If α is large and the accuracy of model prediction is pursued, the CTC function can be used to realize the CTC beam search first, and then the intermediate recognition result is input to the decoder for decoding in the way of AttentionRescore, that is, the way in Figure 12. If α is small and the speed of model prediction is pursued, only the input to the decoder can be used to decode by Attention beam search, that is, the method in Figure 11.

图13示意性示出了根据本公开实施例的基于YOLO和TrOCR的端到端特殊字符识别方法的流程图。联合参照图1～图12及上述任一实施例，图13所示的端到端特殊字符识别方法包括：Fig. 13 schematically shows a flowchart of an end-to-end special character recognition method based on YOLO and TrOCR according to an embodiment of the present disclosure. With joint reference to Figures 1 to 12 and any of the above-mentioned embodiments, the end-to-end special character recognition method shown in Figure 13 includes:

在操作S1310，使用YOLO-V7训练目标检测模型，实现字符结构体的检测，检测并切出仅包含整个字符结构体的图片。字符结构体为包括待识别字符的目标区域，可以表征为目标区域框。In operation S1310, use YOLO-V7 to train the target detection model, realize the detection of the character structure, detect and cut out the picture containing only the entire character structure. The character structure is a target area including characters to be recognized, which can be represented as a target area box.

在操作S1320，基于TrOCR构建一个字符识别模型，其Encoder部分可以为任意的Vision Transformer模型或图像特征提取模型，如Swin Transformer、CSwin Transformer等，其Decoder部分可以为Transformer Decoder算法。In operation S1320, build a character recognition model based on TrOCR, its Encoder part can be any Vision Transformer model or image feature extraction model, such as Swin Transformer, CSwin Transformer etc., its Decoder part can be Transformer Decoder algorithm.

在操作S1330，构建基于CTC损失函数和分类损失函数的加权混合损失函数，作为第三损失函数。In operation S1330, a weighted hybrid loss function based on the CTC loss function and the classification loss function is constructed as a third loss function.

在操作S1340，对构建的TrOCR字符识别模型进行模型训练。在模型中输入字符结构体图片、往右移动一位的字符序列id和标签数据(字符序列id)，并使用S1330所构造的加权混合损失函数，使用Adam或AdamW等优化器进行模型训练。模型训练过程中可以对图片数据做一些列的数据增强处理，如高斯模糊、随机掩码、平移图片、调整图片色彩、饱和度和亮度等等。In operation S1340, perform model training on the constructed TrOCR character recognition model. Input the character structure picture, the character sequence id and label data (character sequence id) shifted one bit to the right into the model, and use the weighted mixed loss function constructed by S1330, and use Adam or AdamW and other optimizers for model training. During the model training process, a series of data enhancement processes can be performed on the picture data, such as Gaussian blur, random mask, panning pictures, adjusting picture color, saturation and brightness, etc.

在训练阶段，为了加快模型的收敛速度，可以在模型的编码器部分加载相应的用于图像分类的预训练模型参数，在模型的解码器部分加载相应的用于自然语言处理的预训练模型参数，如ERNIE3.0、BERT等。另外，为了获得更好模型泛化性，节约人工标注成本，也可以采用计算机程序合成大量仿真标注样本，并使用少量人工标注的真实数据进行调优。In the training phase, in order to speed up the convergence speed of the model, the corresponding pre-trained model parameters for image classification can be loaded in the encoder part of the model, and the corresponding pre-trained model parameters for natural language processing can be loaded in the decoder part of the model , such as ERNIE3.0, BERT, etc. In addition, in order to obtain better model generalization and save the cost of manual labeling, computer programs can also be used to synthesize a large number of simulated labeling samples and use a small amount of manually labeled real data for tuning.

在操作S1350，对训练好的TrOCR模型，输入未知字符序列的图片，使用Attention束搜索的方式进行解码，预测出图片上的字符序列，进行验证。In operation S1350, the trained TrOCR model is input with a picture of an unknown character sequence, decoded using an Attention beam search method, and the character sequence on the picture is predicted for verification.

在操作S1360，将目标检测模型和字符识别模型进行串联构成端到端识别模型。在端到端应用时，输入整张文档图片到端识别模型，利用目标检测模型获得字符结构体的切片得到待识别图像，然后将切片字符识别模型进行解码，预测出切片上的字符序列。In operation S1360, the object detection model and the character recognition model are connected in series to form an end-to-end recognition model. In the end-to-end application, input the entire document picture to the end recognition model, use the target detection model to obtain the slice of the character structure to obtain the image to be recognized, and then decode the slice character recognition model to predict the character sequence on the slice.

根据本公开的实施例，提出了一种端到端的字符识别方案，可以有效识别不规则排列的文字、符号等，具有准确率高，鲁棒性好的特点。可有效用于各类印章如圆章、椭圆章、方章等的抬头识别，公式识别，多行文字识别等。According to the embodiments of the present disclosure, an end-to-end character recognition scheme is proposed, which can effectively recognize irregularly arranged characters, symbols, etc., and has the characteristics of high accuracy and good robustness. It can be effectively used for header recognition, formula recognition, and multi-line text recognition of various seals such as round stamps, oval stamps, and square stamps.

可以理解，本实施例中使用的目标检测模型，可以是任意的目标模型。本发明中TrOCR算法中使用的Encoder可以是任意的图像特征提取模型，Decoder部分可以是任意的基于Transformer Encoder/Decoder的自然语言模型。本发明中使用的混合损失函数中的分类损失函数也可以是任意的分类损失函数。It can be understood that the target detection model used in this embodiment may be any target model. The Encoder used in the TrOCR algorithm in the present invention can be any image feature extraction model, and the Decoder part can be any natural language model based on Transformer Encoder/Decoder. The classification loss function in the mixed loss function used in the present invention can also be any classification loss function.

基于上述字符识别模型的训练方法和字符识别方法，本公开还提供了一种字符识别模型的训练装置和一种字符识别装置。以下将结合图14和图15对该装置进行详细描述。Based on the above character recognition model training method and character recognition method, the present disclosure also provides a character recognition model training device and a character recognition device. The device will be described in detail below with reference to FIGS. 14 and 15 .

图14示意性示出了根据本公开实施例的字符识别模型的训练装置的结构框图。Fig. 14 schematically shows a structural block diagram of a training device for a character recognition model according to an embodiment of the present disclosure.

如图14所示，该实施例的字符识别模型的训练装置1400包括第一输入模块1410、第一预测模块1420、第二输入模块1430、损失计算模块1440、综合损失模块1450和模型更新模块1460。As shown in Figure 14, the training device 1400 of the character recognition model of this embodiment includes a first input module 1410, a first prediction module 1420, a second input module 1430, a loss calculation module 1440, a comprehensive loss module 1450 and a model update module 1460 .

第一输入模块1410可以执行操作S210，用于将训练图像输入字符识别模型的编码器，获得编码器输出的第二图像特征，其中，训练图像包括不规则排列的N个待识别字符，不规则排列包括未按照直线排列，N为大于或等于2的整数。The first input module 1410 can perform operation S210, which is used to input the training image into the encoder of the character recognition model to obtain the second image features output by the encoder, wherein the training image includes N characters to be recognized that are arranged irregularly, irregular Arrangement includes non-linear arrangement, and N is an integer greater than or equal to 2.

第一预测模块1420可以执行操作S220，用于基于第二图像特征得到第一预测张量，第一预测张量包括对N个待识别字符的字符预测信息。The first prediction module 1420 may perform operation S220 for obtaining a first prediction tensor based on the second image feature, the first prediction tensor including character prediction information for the N characters to be recognized.

第二输入模块1430可以执行操作S230，用于将第二图像特征输入字符识别模型的解码器，获得解码器输出的第二预测张量，第二预测张量包括对N个待识别字符的字符预测信息。The second input module 1430 can perform operation S230, which is used to input the second image feature into the decoder of the character recognition model, and obtain the second prediction tensor output by the decoder, the second prediction tensor includes characters for N characters to be recognized Forecast information.

在一些实施例中，第二输入模块1430还可以执行操作S710～操作S730，在此不再赘述。In some embodiments, the second input module 1430 may also perform operation S710 to operation S730, which will not be repeated here.

损失计算模块1440可以执行操作S240，用于获得第一预测张量与标注文本之间的第一损失函数值，以及第二预测张量与标注文本之间的第二损失函数值，标注文本包括每个待识别字符的标签。The loss calculation module 1440 may perform operation S240 for obtaining a first loss function value between the first predicted tensor and the labeled text, and a second loss function value between the second predicted tensor and the labeled text, the labeled text includes A label for each character to be recognized.

综合损失模块1450可以执行操作S250，用于根据第一损失函数值和第二损失函数值得到综合损失函数值。The comprehensive loss module 1450 may perform operation S250 for obtaining a comprehensive loss function value according to the first loss function value and the second loss function value.

在一些实施例中，第二输入模块1430还可以执行操作S810～操作S830，在此不再赘述In some embodiments, the second input module 1430 may also perform operation S810 to operation S830, which will not be repeated here.

模型更新模块1460可以执行操作S260，用于根据综合损失函数值更新编码器和解码器，得到经训练的字符识别模型。The model update module 1460 may perform operation S260 for updating the encoder and the decoder according to the integrated loss function value to obtain a trained character recognition model.

在一些实施例中，训练装置1400还可以包括第一目标检测模块，该模块用于执行操作S510～操作S530，在此不再赘述。In some embodiments, the training device 1400 may further include a first object detection module, which is configured to perform operation S510 to operation S530, which will not be repeated here.

尤其说明，训练装置1400包括分别用于执行如上图2～图8描述的任意一个实施例的各个步骤的模块。In particular, the training device 1400 includes modules for executing the steps of any one of the embodiments described above in FIG. 2 to FIG. 8 .

图15示意性示出了根据本公开实施例的字符识别装置的结构框图。Fig. 15 schematically shows a structural block diagram of a character recognition device according to an embodiment of the present disclosure.

如图15所示，该实施例的字符识别装置1500包括第三输入模块1510和识别结果模块1520。在一些实施例中，字符识别装置1500还可以包括图像获取模块，用于获取待识别图像，未在图15中示出。As shown in FIG. 15 , the character recognition device 1500 of this embodiment includes a third input module 1510 and a recognition result module 1520 . In some embodiments, the character recognition apparatus 1500 may further include an image acquisition module, which is used to acquire an image to be recognized, which is not shown in FIG. 15 .

第三输入模块1510可以执行操作S910，用于将待识别图像输入字符识别模型，其中，字符识别模型根据如上任一项实施例描述的训练方法得到，待识别图像包括不规则排列的N个待识别字符，不规则排列包括未按照直线排列，N为大于或等于2的整数。The third input module 1510 can perform operation S910, which is used to input the image to be recognized into the character recognition model, wherein the character recognition model is obtained according to the training method described in any one of the above embodiments, and the image to be recognized includes N to-be-recognized images arranged irregularly. Recognize characters, irregular arrangement includes not arranged in a straight line, N is an integer greater than or equal to 2.

识别结果模块1520可以执行操作S930，用于获得字符识别模型输出的N个待识别字符的识别结果。The recognition result module 1520 may perform operation S930 for obtaining recognition results of N characters to be recognized outputted by the character recognition model.

字符识别模型被配置为通过如下操作预先训练得到：基于编码器的编码结果得到第一损失函数值，基于解码器的解码结果得到第二损失函数值，基于第一损失函数值和第二损失函数值更新编码器和解码器。具体地，可以参照如图2～图8阐述的任一个实施例所描述的训练方法。The character recognition model is configured to be pre-trained by the following operations: obtain the first loss function value based on the encoding result of the encoder, obtain the second loss function value based on the decoding result of the decoder, and obtain the second loss function value based on the first loss function value and the second loss function The value updates the encoder and decoder. Specifically, reference may be made to the training method described in any one of the embodiments illustrated in FIGS. 2 to 8 .

在一些实施例中，识别结果模块1520还可以执行操作S1110～操作S1130，或操作S1210～操作S1220，在此不再赘述。In some embodiments, the recognition result module 1520 may also perform operation S1110 to operation S1130, or operation S1210 to operation S1220, which will not be repeated here.

在一些实施例中，字符识别装置1500还可以包括第二目标检测模块，该模块用于执行操作S1010～操作S1030，在此不再赘述。In some embodiments, the character recognition apparatus 1500 may further include a second object detection module, which is configured to perform operation S1010 to operation S1030, which will not be repeated here.

尤其说明，字符识别装置1500包括分别用于执行如上图9～图12描述的任意一个实施例的各个步骤的模块。In particular, the character recognition device 1500 includes modules for executing the steps of any one of the embodiments described above in FIG. 9 to FIG. 12 .

需要说明的是，装置部分实施例中各模块/单元/子单元等的实施方式、解决的技术问题、实现的功能、以及达到的技术效果分别与方法部分实施例中各对应的步骤的实施方式、解决的技术问题、实现的功能、以及达到的技术效果相同或类似，在此不再赘述。It should be noted that the implementations of modules/units/subunits, etc., the technical problems solved, the functions realized, and the technical effects achieved in the embodiments of the device part are respectively the same as those of the corresponding steps in the embodiments of the method part. , the technical problems solved, the functions realized, and the technical effects achieved are the same or similar, and will not be repeated here.

根据本公开的实施例，训练装置1400或字符识别装置1500中的任意多个模块可以合并在一个模块中实现，或者其中的任意一个模块可以被拆分成多个模块。或者，这些模块中的一个或多个模块的至少部分功能可以与其他模块的至少部分功能相结合，并在一个模块中实现。According to an embodiment of the present disclosure, any number of modules in the training device 1400 or the character recognition device 1500 can be implemented in one module, or any one module can be split into multiple modules. Alternatively, at least part of the functions of one or more of these modules may be combined with at least part of the functions of other modules and implemented in one module.

根据本公开的实施例，训练装置1400或字符识别装置1500中的至少一个可以至少被部分地实现为硬件电路，例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC)，或可以通过对电路进行集成或封装的任何其他的合理方式等硬件或固件来实现，或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者，训练装置1400或字符识别装置1500中的至少一个可以至少被部分地实现为计算机程序模块，当该计算机程序模块被运行时，可以执行相应的功能。According to an embodiment of the present disclosure, at least one of the training device 1400 or the character recognition device 1500 may be at least partially implemented as a hardware circuit, such as a field programmable gate array (FPGA), a programmable logic array (PLA), a system on a chip, A system on a substrate, a system on a package, an application-specific integrated circuit (ASIC), or any other reasonable means of integrating or packaging circuits, such as hardware or firmware, or in software, hardware, and firmware Any one of them or an appropriate combination of any of them. Alternatively, at least one of the training device 1400 or the character recognition device 1500 may be at least partially implemented as a computer program module, and when the computer program module is executed, corresponding functions may be performed.

图16示意性示出了根据本公开实施例的适于实现字符识别模型的训练方法或字符识别方法的电子设备的方框图。Fig. 16 schematically shows a block diagram of an electronic device suitable for implementing a character recognition model training method or a character recognition method according to an embodiment of the present disclosure.

如图16所示，根据本公开实施例的电子设备1600包括处理器1601，其可以根据存储在只读存储器(ROM)1602中的程序或者从存储部分1608加载到随机访问存储器(RAM)1603中的程序而执行各种适当的动作和处理。处理器1601例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如，专用集成电路(ASIC))等等。处理器1601还可以包括用于缓存用途的板载存储器。处理器1601可以包括用于执行根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。As shown in FIG. 16, an electronic device 1600 according to an embodiment of the present disclosure includes a processor 1601, which can be loaded into a random access memory (RAM) 1603 according to a program stored in a read-only memory (ROM) 1602 or from a storage section 1608. Various appropriate actions and processing are performed by the program. The processor 1601 may include, for example, a general-purpose microprocessor (eg, a CPU), an instruction set processor and/or related chipsets, and/or a special-purpose microprocessor (eg, an application-specific integrated circuit (ASIC)), and the like. Processor 1601 may also include on-board memory for caching purposes. The processor 1601 may include a single processing unit or a plurality of processing units for executing different actions of the method flow according to the embodiments of the present disclosure.

在RAM 1603中，存储有电子设备1600操作所需的各种程序和数据。处理器1601、ROM 1602以及RAM 1603通过总线1604彼此相连。处理器1601通过执行ROM 1602和/或RAM1603中的程序来执行根据本公开实施例的方法流程的各种操作。需要注意，程序也可以存储在除ROM 1602和RAM 1603以外的一个或多个存储器中。处理器1601也可以通过执行存储在一个或多个存储器中的程序来执行根据本公开实施例的方法流程的各种操作。In the RAM 1603, various programs and data necessary for the operation of the electronic device 1600 are stored. The processor 1601 , ROM 1602 , and RAM 1603 are connected to each other through a bus 1604 . The processor 1601 executes various operations according to the method flow of the embodiment of the present disclosure by executing programs in the ROM 1602 and/or RAM 1603 . It is to be noted that the programs may also be stored in one or more memories other than the ROM 1602 and the RAM 1603 . The processor 1601 may also perform various operations according to the method flow of the embodiments of the present disclosure by executing programs stored in one or more memories.

根据本公开的实施例，电子设备1600还可以包括输入/输出(I/O)接口1605，输入/输出(I/O)接口1605也连接至总线1604。电子设备1600还可以包括连接至I/O接口1605的以下部件中的一项或多项：包括键盘、鼠标等的输入部分1606。包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1607。包括硬盘等的存储部分1608。以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1609。通信部分1609经由诸如因特网的网络执行通信处理。驱动器1610也根据需要连接至I/O接口1605。可拆卸介质1611，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1610上，以便于从其上读出的计算机程序根据需要被安装入存储部分1608。According to an embodiment of the present disclosure, the electronic device 1600 may further include an input/output (I/O) interface 1605 which is also connected to the bus 1604 . The electronic device 1600 may also include one or more of the following components connected to the I/O interface 1605: an input part 1606 including a keyboard, a mouse, and the like. An output section 1607 such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., a speaker, etc. is included. A storage section 1608 including a hard disk and the like. And a communication section 1609 including a network interface card such as a LAN card, a modem, and the like. The communication section 1609 performs communication processing via a network such as the Internet. A drive 1610 is also connected to the I/O interface 1605 as needed. A removable medium 1611, such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, etc., is mounted on the drive 1610 as necessary so that a computer program read therefrom is installed into the storage section 1608 as necessary.

本公开还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的设备/装置/系统中所包含的。也可以是单独存在，而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序，当上述一个或者多个程序被执行时，实现根据本公开实施例的方法。The present disclosure also provides a computer-readable storage medium, which may be included in the device/apparatus/system described in the above embodiments. It can also exist independently without being assembled into the equipment/device/system. The above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed, the method according to the embodiment of the present disclosure is realized.

根据本公开的实施例，计算机可读存储介质可以是非易失性的计算机可读存储介质，例如可以包括但不限于：便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。例如，根据本公开的实施例，计算机可读存储介质可以包括上文描述的ROM 1602和/或RAM 1603和/或ROM 1602和RAM 1603以外的一个或多个存储器。According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, such as may include but not limited to: portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM) , erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present disclosure, the computer-readable storage medium may include one or more memories other than the ROM 1602 and/or the RAM 1603 and/or the ROM 1602 and the RAM 1603 described above.

本公开的实施例还包括一种计算机程序产品，其包括计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。当计算机程序产品在计算机系统中运行时，该程序代码用于使计算机系统实现本公开实施例所提供的方法。Embodiments of the present disclosure also include a computer program product, which includes a computer program including program codes for executing the methods shown in the flowcharts. When the computer program product runs in the computer system, the program code is used to make the computer system realize the method provided by the embodiments of the present disclosure.

在该计算机程序被处理器1601执行时执行本公开实施例的系统/装置中限定的上述功能。根据本公开的实施例，上文描述的系统、装置、模块、单元等可以通过计算机程序模块来实现。When the computer program is executed by the processor 1601, the above-mentioned functions defined in the system/apparatus of the embodiment of the present disclosure are performed. According to the embodiments of the present disclosure, the above-described systems, devices, modules, units, etc. may be implemented by computer program modules.

在一种实施例中，该计算机程序可以依托于光存储器件、磁存储器件等有形存储介质。在另一种实施例中，该计算机程序也可以在网络介质上以信号的形式进行传输、分发，并通过通信部分1609被下载和安装，和/或从可拆卸介质1611被安装。该计算机程序包含的程序代码可以用任何适当的网络介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。In one embodiment, the computer program may rely on tangible storage media such as optical storage devices and magnetic storage devices. In another embodiment, the computer program can also be transmitted and distributed in the form of a signal on network media, downloaded and installed through the communication part 1609, and/or installed from the removable media 1611. The program code contained in the computer program can be transmitted by any appropriate network medium, including but not limited to: wireless, wired, etc., or any appropriate combination of the above.

在这样的实施例中，该计算机程序可以通过通信部分1609从网络上被下载和安装，和/或从可拆卸介质1611被安装。在该计算机程序被处理器1601执行时，执行本公开实施例的系统中限定的上述功能。根据本公开的实施例，上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 1609 and/or installed from removable media 1611 . When the computer program is executed by the processor 1601, the above-mentioned functions defined in the system of the embodiment of the present disclosure are executed. According to the embodiments of the present disclosure, the above-described systems, devices, devices, modules, units, etc. may be implemented by computer program modules.

根据本公开的实施例，可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例提供的计算机程序的程序代码，具体地，可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。程序设计语言包括但不限于诸如Java，C++，python，“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。According to the embodiments of the present disclosure, the program codes for executing the computer programs provided by the embodiments of the present disclosure can be written in any combination of one or more programming languages, specifically, high-level procedural and/or object-oriented programming language, and/or assembly/machine language to implement these computing programs. Programming languages include, but are not limited to, programming languages such as Java, C++, python, "C" or similar programming languages. The program code can execute entirely on the user computing device, partly on the user device, partly on the remote computing device, or entirely on the remote computing device or server. In cases involving a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (for example, using an Internet service provider). business to connect via the Internet).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that includes one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block in the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be implemented by a A combination of dedicated hardware and computer instructions.

本领域技术人员可以理解，本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合或/或结合，即使这样的组合或结合没有明确记载于本公开中。特别地，在不脱离本公开精神和教导的情况下，本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本公开的范围。Those skilled in the art can understand that various combinations and/or combinations of the features described in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not explicitly recorded in the present disclosure. In particular, without departing from the spirit and teaching of the present disclosure, the various embodiments of the present disclosure and/or the features described in the claims can be combined and/or combined in various ways. All such combinations and/or combinations fall within the scope of the present disclosure.

以上对本公开的实施例进行了描述。但是，这些实施例仅仅是为了说明的目的，而并非为了限制本公开的范围。尽管在以上分别描述了各实施例，但是这并不意味着各个实施例中的措施不能有利地结合使用。本公开的范围由所附权利要求及其等同物限定。不脱离本公开的范围，本领域技术人员可以做出多种替代和修改，这些替代和修改都应落在本公开的范围之内。The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the various embodiments have been described separately above, this does not mean that the measures in the various embodiments cannot be advantageously used in combination. The scope of the present disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of the present disclosure, and these substitutions and modifications should all fall within the scope of the present disclosure.

Claims

1. A character recognition method, comprising:

Acquiring an image to be recognized, wherein the image to be recognized includes N characters to be recognized that are irregularly arranged, the irregular arrangement includes not being arranged in a straight line, and N is an integer greater than or equal to 2;

Inputting the image to be recognized into the character recognition model;

obtaining the recognition results of the N characters to be recognized output by the character recognition model;

Wherein, the character recognition model includes an encoder and a decoder, and the character recognition model is configured to be pre-trained through the following operations:

Obtaining a first loss function value based on an encoding result of the encoder, obtaining a second loss function value based on a decoding result of the decoder, updating the encoding based on the first loss function value and the second loss function value device and the decoder.

2. The method according to claim 1, wherein said obtaining the image to be recognized comprises:

Determine the target image;

Using a target detection model to determine a first target area from the target image, the first target area includes the N characters to be recognized;

Segmenting the first target area from the target image to obtain the image to be recognized.

3. The method according to claim 1, wherein said obtaining the recognition results of said N characters to be recognized outputted by said character recognition model comprises:

using the encoder to process the image to be recognized to obtain a first image feature output by the encoder;

Processing the first image feature by using a classification function to obtain an intermediate recognition result of the N characters to be recognized, the classification function being used to classify the first image feature;

The first image feature and the first sequence of text are input into the decoder to obtain the recognition results of the N characters to be recognized output by the decoder, and the first sequence of text is obtained by adding the intermediate recognition results to Each character recognition result is obtained by shifting M character positions to the right, where M is an integer greater than or equal to 1.

4. The method according to claim 3, wherein the decoder comprises a cross-attention layer and a classification layer, the first image feature and the first sequence of text are input into the decoder to obtain the decoded The recognition results of the N characters to be recognized that the device outputs include:

Using the cross-attention layer to process the first image feature and the first sequence of text to obtain a first attention feature, wherein the cross-attention layer is configured to process data based on a cross-attention mechanism;

The classification layer is used to process the first attention feature to obtain recognition results of the N characters to be recognized, and the classification layer is used to classify the first attention feature.

5. The method according to claim 1, wherein said obtaining the recognition results of said N characters to be recognized outputted by said character recognition model comprises:

The first image feature is input into the decoder, and the recognition results of the N characters to be recognized outputted by the decoder are obtained.

6. The method according to claim 1, wherein the encoding result comprises a second image feature, and the obtaining the first loss function value based on the encoding result of the encoder comprises:

Inputting a training image into the encoder to obtain a second image feature output by the encoder, wherein the training image includes irregularly arranged N characters to be recognized;

Obtaining a first prediction tensor based on the second image feature, the first prediction tensor including character prediction information for the N characters to be recognized;

Obtain a first loss function value between the first prediction tensor and annotated text, where the annotated text includes a label of each character to be recognized.

7. The method according to claim 6, wherein the decoding result comprises a second prediction tensor, and the obtaining a second loss function value based on the decoding result of the decoder comprises:

Inputting the second image feature into the decoder to obtain the second prediction tensor output by the decoder, the second prediction tensor including character prediction information for the N characters to be recognized;

A second loss function value between the second prediction tensor and the labeled text is obtained.

8. The method of claim 7, wherein said updating said encoder and said decoder based on said first loss function value and said second loss function value comprises:

Obtaining a first weighted value according to the first loss function value and its first weight;

Obtaining a second weighted value according to the second loss function value and its second weight;

obtaining a comprehensive loss function value according to the first weighted value and the second weighted value;

updating the encoder and the decoder according to the comprehensive loss function value to obtain a trained character recognition model.

9. The method according to claim 7, wherein said decoder comprises a cross-attention layer and a classification layer, said inputting said second image feature into said decoder, obtaining said decoder outputted The second prediction tensor consists of:

The second image feature and the second sequence of text are input into the decoder, the second sequence of text is obtained by moving the label of each character to be recognized in the label text to the right by M character positions, and M is greater than or an integer equal to 1;

Using the cross-attention layer to process the second image feature and the second sequence of text to obtain a second attention feature, wherein the cross-attention layer is configured to process data based on a cross-attention mechanism;

The classification layer is used to process the second attention feature to obtain the second prediction tensor.

10. The method according to claim 6, wherein said obtaining a first prediction tensor based on said second image feature comprises:

performing dimension-up processing on the second image feature to obtain a third image feature;

The third image feature is input to a classification function to obtain the first prediction tensor.

11. A character recognition device, comprising:

An image acquisition module, configured to acquire an image to be recognized, wherein the image to be recognized includes N characters to be recognized that are irregularly arranged, the irregular arrangement includes not being arranged in a straight line, and N is an integer greater than or equal to 2;

The third input module is used to input the image to be recognized into the character recognition model;

A recognition result module, configured to obtain the recognition results of the N characters to be recognized output by the character recognition model;

12. An electronic device comprising:

one or more processors;

storage means for storing one or more programs,

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are made to execute the method according to any one of claims 1-10.

13. A computer-readable storage medium, on which executable instructions are stored, and the instructions, when executed by a processor, cause the processor to perform the method according to any one of claims 1-10.

14. A computer program product comprising a computer program, the computer program implementing the method according to any one of claims 1-10 when executed by a processor.