CN119181106A

CN119181106A - Form content extraction method and system

Info

Publication number: CN119181106A
Application number: CN202411417968.4A
Authority: CN
Inventors: 万洪林; 张桐; 李刚; 李敏; 周鸣乐; 韩德隆; 李旺
Original assignee: Shandong Shanke Digital Economy Research Institute Co ltd; Shandong Normal University; National Supercomputing Center in Jinan
Current assignee: Shandong Shanke Digital Economy Research Institute Co ltd; Shandong Normal University; National Supercomputing Center in Jinan
Priority date: 2024-10-11
Filing date: 2024-10-11
Publication date: 2024-12-24

Abstract

The present invention provides a form content extraction method and system, the scheme comprising: obtaining a form image to be extracted; extracting text and its corresponding text position information based on the obtained form image; based on the extracted text and its corresponding text position information, using a pre-trained text label recognition model to realize label recognition of the text in the form; based on the obtained text and text label, realizing the extraction of form content.

Description

Form content extraction method and system

Technical Field

The invention belongs to the technical field of document content extraction, and particularly relates to a form content extraction method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Traditional document content extraction is usually based on human intervention, smartFIX is designed to process various files such as fixed format forms and unstructured letters in any format, schuster et al automatically extracts company names from financial news by adopting a rule-based filtering method, and a template-based matching method is widely used for extracting one or more targets, but a large number of rules are required to be designed in advance or templates are designed based on position coordinates, so that migration cost is high.

Based on the above problems, chuu et al propose an innovative hybrid model that combines BiLSTM and CNN to automatically detect and extract word and character level features, without cumbersome feature engineering and deep vocabulary knowledge, simplifying the flow and meeting high performance requirements, huang et al propose a BI-LSTM-CRF model applied to sequence labeling tasks, which can combine BiLSTM and CRF (conditional random field) simultaneously, not only utilize past and future input features, but also consider sentence-level tag information, making it more powerful and flexible when processing natural language processing tasks, post-OCR parsing scheme extracts text paragraphs and coordinates through OCR, sequences and carries out BIO (BIO-markup) marking, and finally combines the grouping to generate parsing results. J iang et al propose that the text block coordinate embedding feature is directly added into BiLSTM-CRF model, but these methods rely on text itself and its corresponding position information to extract content, and for multi-mode documents (e.g. forms, there is a certain association relationship between internal text data), such methods cannot effectively obtain the association relationship between texts, resulting in incomplete extracted content information.

Disclosure of Invention

The embodiment of the invention provides a form content extraction method and a form content extraction system, which are used for solving the problem that the extracted content information is incomplete because the prior art only depends on texts and corresponding position information thereof to extract the content and ignores the association relation between text entities in the form.

According to a first aspect of an embodiment of the present invention, there is provided a form content extraction method, including:

acquiring a form image to be extracted from content;

extracting text and corresponding text position information based on the obtained form image;

based on the extracted text and the corresponding text position information, utilizing a pre-trained text label recognition model to realize label recognition of the text in the form;

The text tag recognition model specifically performs the following processing procedures of respectively constructing a text embedding representation, a position information embedding representation and a layout embedding representation based on the extracted text and the corresponding text position information thereof, obtaining a multi-mode characteristic representation by utilizing a multi-mode encoder which is introduced into an attention refining module based on the text embedding representation, the position information embedding representation and the layout embedding representation, obtaining a recognition result of a text tag by utilizing a decoder which is introduced into the attention refining module based on the multi-mode characteristic representation, wherein the attention refining module is used for constructing an initial similarity matrix by utilizing an attention score of input data obtained based on an attention mechanism, constructing a fine matrix based on the initial similarity matrix, and obtaining an output result of the attention refining module by combining a content vector obtained by the attention mechanism based on the initial similarity matrix and the fine matrix;

and based on the obtained text and the text label, extracting the form content.

The text embedding representation, the position information embedding representation and the layout embedding representation are constructed by dividing words of the extracted text, converting each word into an index in a word list, taking the index as one-dimensional position information embedding representation, adding word vectors of each divided word in the text and the one-dimensional position information embedding representation to obtain the text embedding representation, and constructing the layout embedding representation corresponding to the text based on the width, the height and the boundary coordinate values of a text box corresponding to the text extraction.

The multi-mode encoder comprises a multi-head attention refining module and a two-layer feedforward neural network, wherein the multi-head attention refining module comprises a plurality of attention refining modules, text embedded representation, position information embedded representation and layout embedded representation are respectively used as the input of each attention refining module in the multi-head attention refining module, the output of each attention refining module is subjected to linear transformation after being spliced to obtain splicing characteristics, and the splicing characteristics are sequentially subjected to residual connection, normalization processing and full-connection network processing to obtain multi-mode characteristic representation.

Further, the decoder of the attention drawing refining module specifically includes a first decoder and a second decoder, based on the obtained multi-modal feature representation, a category label corresponding to the text is obtained through the first decoder, and simultaneously, based on the obtained category label, a relation between the texts is obtained through the second decoder.

Further, the attention refining module specifically includes the following steps:

R=reshape(norm(max(0,conv(reshape(α_t)))W^R))

Wherein A represents an attention score matrix, t E [0, L), t represents a position, alpha _t represents an attention score of the position t, R represents a fine matrix obtained after batch normalization and dimension adjustment, W ^R∈d_c multiplied by n is a trainable parameter matrix, d _c is an intermediate dimension of a convolution layer, and n represents the number of attention heads.

Furthermore, the text and the corresponding text position information are extracted, and the text is obtained by an optical character recognition method.

According to a second aspect of an embodiment of the present invention, there is provided a form content extraction system including:

a data acquisition unit for acquiring a form image to be extracted of content;

A basic information extraction unit for extracting text and corresponding text position information based on the obtained form image;

The text tag recognition unit is used for realizing tag recognition of the text in the form by utilizing a pre-trained text tag recognition model based on the extracted text and corresponding text position information thereof, wherein the text tag recognition model specifically performs the following processing procedures of respectively constructing text embedded representation, position information embedded representation and layout embedded representation based on the extracted text and corresponding text position information thereof, acquiring multi-mode feature representation by utilizing a multi-mode encoder of an attention-oriented refining module based on the text embedded representation, the position information embedded representation and the layout embedded representation, acquiring a recognition result of the text tag by utilizing a decoder of the attention-oriented refining module based on the multi-mode feature representation, constructing an initial similarity matrix by utilizing an attention score of input data acquired based on an attention mechanism, constructing a fine matrix based on the initial similarity matrix and the fine matrix, and acquiring an output result of the attention-oriented refining module by combining content vectors acquired by the attention mechanism;

And the form content extraction unit is used for extracting the form content based on the obtained text and the text label.

According to a third aspect of the embodiment of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored to run on the memory, where the processor implements the form content extraction method when executing the program.

According to a fourth aspect of embodiments of the present invention, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the form content extraction method.

According to a fifth aspect of embodiments of the present invention, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the form extraction method.

The one or more of the above technical solutions have the following beneficial effects:

(1) The invention provides a form content extraction method and a form content extraction system, wherein the scheme is characterized in that in the form content extraction process, the whole layout of a form document is introduced to embed a representation, the text and the corresponding text position information are combined, the text in the form is subjected to label recognition based on multi-mode characteristics, the association relationship between the text in the form is obtained through label recognition, and the combination of the label of the text and the text is used as the form content extraction result, so that the form content extraction integrity is effectively enriched.

(2) The invention provides an attention refining module, by which attention scores can be weighted and improved, and text recognition can be performed more accurately.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of an overall form extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a form image text extraction and tag recognition result according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of a part and an entire structure of an attention refining module according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, an embodiment of the present invention provides a form content extraction method, which specifically includes the following processing procedures:

Step 1, acquiring a form image to be extracted from content;

In implementations, a digital camera may be employed to capture the form image.

Step 2, extracting text and corresponding text position information based on the obtained form image;

In a specific implementation, the text and the text position information corresponding to the text are extracted by adopting an optical character recognition method, and the method specifically comprises the steps of inputting a form image into a DBNet ++ network to obtain the positions of all the texts, wherein the positions of the texts are represented by POS _m and m, the result is shown as (a) in fig. 2, and IMG _m is caused to represent an image corresponding to POS _m. IMG _m is entered into the CRNN network to obtain text at the location of POS _m, which text is denoted by TXT _m, the result of which is shown in fig. 2 (b).

Step 3, based on the extracted text and the corresponding text position information, utilizing a pre-trained text label recognition model to realize label recognition of the text in the form;

The text embedding representation, the position information embedding representation and the layout embedding representation are constructed by dividing words of an extracted text, converting each word into an index in a word list, taking the index as one-dimensional position information embedding representation, adding word vectors of each divided word in the text and the one-dimensional position information embedding representation to obtain the text embedding representation, and constructing the layout embedding representation corresponding to the text based on the width, the height and the boundary coordinate values of a text box corresponding to the text extraction.

In a specific implementation, the multi-mode encoder introducing the attention refining module specifically comprises a multi-head attention refining module and a two-layer feedforward neural network, wherein the multi-head attention refining module is composed of a plurality of attention refining modules, text embedded representation, position information embedded representation and layout embedded representation are respectively used as the input of each attention refining module in the multi-head attention refining module, the output of each attention refining module is spliced and then subjected to linear transformation to obtain splicing characteristics, and the splicing characteristics are sequentially subjected to residual connection, normalization processing and full-connection network processing to obtain multi-mode characteristic representation.

In a specific implementation, the decoder of the attention drawing module specifically includes a first decoder and a second decoder, based on the obtained multi-modal feature representation, a category label corresponding to the text is obtained through the first decoder, and simultaneously, based on the obtained category label, a relation between the texts is obtained through the second decoder.

For easy understanding, the text label recognition model in this embodiment is described in detail below with reference to the accompanying drawings:

The text tag recognition model includes three parts, namely a multi-modal information embedding layer, an encoder (processing result is shown in fig. 2 (c)) and a decoder (processing result is shown in fig. 2 (d)) which draw attention to the refinement module. The processing results are shown in fig. 2 (c), specifically:

modality information embedding layer:

The modality information embedding layer is comprised of a trainable embedding matrix, each row of which corresponds to an embedding vector of an input element. By mapping the input data to the corresponding rows in the embedding matrix, an embedded representation of the input data may be obtained. The multimodal information embedding portion includes text embedding, layout embedding, and one-dimensional position information embedding. In the text embedding part, firstly, the text in the laboratory sheet OCR is segmented, and each word is converted into an index in a word list. These indices can be regarded as one-dimensional position vectors for representing the position of the text in the sequence. The word vector and the one-dimensional position vector are then added to obtain a text vector representing the input encoder of the text. The ith text insert T _i is defined as:

T_i＝TokenEmb(t_i)+PosEmb1D(i),0≤i≤L (1)

where t _i denotes an i-th word vector, L denotes a sequence length, tokenEmb () denotes an embedded representation of the word vector, and PosEmb D () denotes a position information embedded representation.

The layout vector is a vector representation of spatial layout information, the entire laboratory sheet image is regarded as a coordinate system (the upper left is the origin), and for each coordinate range covered in the image by each word, 2D position information is represented by using a bounding box parallel to the coordinate axes. The width and height coordinates of the bounding box are both identified by the width and height of the box in the OCR result. After the text is segmented, different Token is obtained, and each Token shares a text-level bounding box. The method normalizes all coordinates to a value of [0,1000]. The two vector layers are then used to encode the position features of the x-axis and the y-axis, respectively. The layout vector layer describes each bounding box by four key parameters of the width and height, which together determine the exact location of the bounding box on a two-dimensional plane, along with four coordinate values defining its upper, lower, left and right boundaries. For example, in a test chart, there is a text box of "average hemoglobin concentration", the height h and width w of the border box are first identified, and then different Token such as "average", "hemoglobin" and "concentration" are obtained after word segmentation, and the three Token share the whole text box. And then coding the text box by using four coordinate values to obtain x0 (upper boundary), x1 (lower boundary), y0 (left boundary) and y1 (right boundary), normalizing, and finally determining the specific position of the text box by the four parameters of the upper, lower, left and right boundaries and the height width. These six parameters are then converted into corresponding vector representations and integrated into a comprehensive layout vector by a stitching operation. For the i-th text bounding box _i, the layout vector P _i in the input encoder is represented as:

P_i＝Concat(PosEmb2D_x(x₀,x₁,w),PosEmb2D_y(y₀,y₁,h)),0≤i<L (2)

Where (x ₀,y₀) denotes the coordinates of the upper left corner, (x ₁,y₁) denotes the coordinates of the lower right corner, w denotes the width of the bounding box, h denotes the height of the bounding box, posEmb D _x () and PosEmb D _y () are two-dimensional embedded representations of the position coordinates.

Attention-directing refining module encoder:

The multi-mode encoder with attention refining is constructed based on a transducer, and the core component part of the multi-mode encoder comprises a multi-head attention refining module and two layers of feedforward neural networks, wherein residual connection is added into the two layers of networks, and layer normalization operation is performed after the residual connection. This section focuses on the multi-headed attention refining module. Improved refining attention as shown in fig. 3 (a), the multi-headed attention refining module is composed of a plurality of self-attention refining layers. Each layer is created by adding a refining module to the traditional attention mechanism.

The self-attention mechanism starts with the input x= { X ₁,x₂,......x_L}∈L×d_model, multiplied by three different matrices, resulting in Q (for matching queries of other units), K (for keys matched by other units), V (the value of the information that needs to be extracted). The three vector formulas are expressed as:

Q=XW^Q,K＝XW^K,V＝XW^V (3)

Where W _Q、W_K∈d_model×d_k,W_V∈d_model×d_v is the trainable parameter matrix, L is the length of the input sequence, d _model is the model depth, and d _k is the depth of each attention head.

The attention score a is then calculated as follows:

Wherein d _k is set to avoid gradient extinction, and α is the similarity matrix between the input units.

The embodiment can weight and improve the attention score through the attention refining module, and can more accurately identify the text. Let t e [0, L), t denote position, α _t denote the attention fraction of position t, R denote the fine matrix obtained after batch normalization and dimension adjustment, then

And

R=reshape(norm(max(0,conv(reshape(α_t)))W^R)) (6)

Wherein the convolution kernel size of the convolution layer is 5,W ^R∈d_c ×n, which is a trainable parameter matrix, and d _c is the middle dimension of the convolution layer. n represents the number of attention headers. Finally subtracting the fine term R from the originally generated similarity matrix alpha to obtain a final similarity matrix B, wherein the calculation formula is

B=α-R (7)

B is weighted with V after softmax to obtain a result of the attention head. The i-th attention header H _i can be expressed as:

H_i＝softmax(B)V_i (8)

The final output result of the multi-head attention refining module is that the n attention heads are spliced and then are subjected to linear transformation, namely

H=concat(H₁,H₂,...,H_n)Wⁿ (9)

Where W ⁿ∈nd_k×d_model is a trainable parameter matrix.

And (3) connecting the output characteristics of the multi-head attention refining module through residual errors, inputting the obtained characteristics into a two-layer fully-connected network after normalization operation, further extracting the characteristics, and enhancing the expression capacity of the model. The specific calculation process is as follows:

FFN(x)=max(0,xW₁+b₁)W₂+b₂ (10)

Wherein the dimension of the input x is d _model, the dimension of the output of the second linear layer is equal to the dimension of x, W ₁、W₂ represents the weight parameters of the first layer and the second layer respectively, and b ₁、b₂ represents the bias parameters of the first layer and the second layer respectively.

The multi-mode encoder is formed by stacking N encoding layers consisting of a multi-head attention refining module and two layers of feedforward neural networks.

The decoder of the attention-drawing refining module:

The task of the decoder section is to generate a specific output based on the multi-modal information generated by the encoder. The decoder comprises two sub-decoders, which process SER tasks and RE tasks, respectively. The SER task is essentially a classification task, and the main purpose of the SER task is to assign a predefined entity class label to an input sequence at each time step, namely, the model is required to classify entities such as "Header", "Result", "Reference" and the like in a given laboratory sheet image. The SER decoder part firstly uses a full-connection layer to carry out entity classification on the output result of the encoder, a predefined laboratory sheet image comprises 11 types of entities, each category is represented by BIO labels, then the prediction result is combined with the OCR result, and the most predicted result of all text segments is selected as the label of the text by counting the labels of all text segments of each text. Model performance is assessed by calculating cross entropy loss between each predicted entity class and the true labels, assisting in network training. RE tasks are extracting relationships between entities from predicted entity classes. In this embodiment, the RE task needs to create all possible relation pairs for all entities from a given laboratory sheet image, send all possible relation results to Biaffine Attention classifiers, and predict the effective relation pair as "1" and "0" on the contrary. The laboratory sheet image predicts the pre-defined Key-value representation similar to the clinical diagnosis results of the department of opening a sheet through an RE decoder.

And 4, extracting the form content based on the obtained text and the text label.

In the scheme, the text in the form is subjected to label identification based on the multi-mode characteristics by introducing the overall layout embedded representation of the form document and combining the text and the corresponding text position information in the form content extraction process, the association relation between the text in the form is obtained through the label identification, and the combination of the label of the text and the text is used as the form content extraction result, so that the integrity of form content extraction is effectively enriched.

In one or more embodiments, the embodiment of the present invention corresponds to the above-mentioned form content extraction method, and provides a form content extraction system, including:

a data acquisition unit for acquiring a form image to be extracted of content;

It can be understood that the system in this embodiment corresponds to the method in the foregoing embodiment, and the technical details of this embodiment have been described in detail in the first embodiment, which is not repeated herein.

In further embodiments, there is also provided:

An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in the above embodiments. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the above embodiments.

The method in the above embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

A computer program product comprising a computer program which when executed by a processor implements the form content extraction method.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A form content extraction method and system are characterized by comprising the following steps:

acquiring a form image to be extracted from content;

and based on the obtained text and the text label, extracting the form content.

2. The method and system for extracting form contents according to claim 1, wherein the text embedding representation, the position information embedding representation and the layout embedding representation are constructed by dividing words of an extracted text, converting each word into an index in a word list, taking the index as one-dimensional position information embedding representation, adding word vectors of each divided word in the text and the one-dimensional position information embedding representation to obtain the text embedding representation, and constructing the layout embedding representation corresponding to the text based on the width, the height and boundary coordinate values of a text box corresponding to the text extraction.

3. The method and system for extracting form contents according to claim 1, wherein the multi-modal encoder of the attention-directing refining module comprises a multi-head attention-refining module and a two-layer feedforward neural network, wherein the multi-head attention-refining module comprises a plurality of attention-refining modules, text embedded representations, position information embedded representations and layout embedded representations are respectively used as inputs of all attention-refining modules in the multi-head attention-refining module, outputs of all attention-refining modules are spliced and then subjected to linear transformation to obtain spliced features, and the spliced features are sequentially subjected to residual connection, normalization processing and full-connection network processing to obtain multi-modal feature representations.

4. The method and system for extracting form contents according to claim 1, wherein the decoder of the attention-directing refining module comprises a first decoder and a second decoder, wherein the category labels corresponding to the texts are obtained through the first decoder based on the obtained multi-modal feature representation, and the relationships between the texts are obtained through the second decoder based on the obtained category labels.

5. The method and system for extracting form contents according to claim 3, wherein the attention refining module specifically comprises the following steps:

R=reshape(norm(max(0,conv(reshape(α_t)))W^R))

6. The method and system for extracting form contents according to claim 1, wherein the text and the text position information corresponding thereto are extracted by an optical character recognition method.

7. A form content extraction system, comprising:

a data acquisition unit for acquiring a form image to be extracted of content;

8. An electronic device comprising a memory, a processor and a computer program stored for execution on the memory, wherein the processor implements the form content extraction method of any one of claims 1-6 when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the form content extraction method according to any of claims 1-6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the form extraction method of any one of claims 1-6.