CN116895072A

CN116895072A - A table parsing method, device, equipment and storage medium

Info

Publication number: CN116895072A
Application number: CN202310921820.3A
Authority: CN
Inventors: 肖雪丽; 廖常辉; 谢洁芳; 廖旭明; 邵向潮; 李惠仪; 冷颖雄; 周彦吉; 叶海珍; 邓茵; 刘贯科; 钟荣富; 戴喜良
Original assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-10-17
Anticipated expiration: 2043-07-26

Abstract

The invention discloses a method, a device, equipment and a storage medium for analyzing a table, wherein the method comprises the following steps: receiving image data of a content-containing table; extracting geometric position features of text blocks, appearance features of cells and content features of text information from the image data respectively; separating visual features on rows and columns from the appearance features; extracting semantic features from the geometric position features and the content features respectively to serve as appearance semantic features and content semantic features; learning a first collaboration block between the representation rows and columns, a second collaboration block between the representation cells and the text boxes, and a third collaboration block between the representation text boxes and the text information according to the visual features, the appearance semantic features, and the content semantic features; fusing the first collaboration block, the second collaboration block and the third collaboration block into a collaboration graph code; and encoding and identifying structural information in the table according to the collaborative graph. The shallow image characterization information combines semantic information in the cells, so that the accuracy of identifying the table is improved.

Description

A table parsing method, device, equipment and storage medium

技术领域Technical field

本发明涉及计算机视觉的技术领域，尤其涉及一种表格的解析方法、装置、设备及存储介质。The present invention relates to the technical field of computer vision, and in particular, to a table parsing method, device, equipment and storage medium.

背景技术Background technique

电网的档案来源广泛、类型多样，在各种生产建设中都会产生大量的建设类档案项目案卷，而其中大部分档案又以表格的方式存储信息。每当对档案审查时，需要对于划分表进行一致性审查，即，根据文件目录表格索引划分表所对应的文件名，再在划分表中逐个核对打“√”的工程项目名称是否在文件目录中存在。The archives of the power grid come from a wide range of sources and are of various types. A large number of construction archive project files will be generated in various production and construction, and most of these files store information in the form of tables. Whenever a file is reviewed, a consistency review of the division table is required, that is, the file names corresponding to the division table are divided according to the file directory table index, and then one by one in the division table is checked whether the project names marked with "√" are in the file directory. exist in.

为提高办公效率，目前是利用脚本自动对表格中的内容进行数据分析、修改、挖掘、可视化等工作，从而根据文件目录表格索引划分表所对应的文件名以及最后的比对工程项目名称是否在文件目录中存在。In order to improve office efficiency, scripts are currently used to automatically perform data analysis, modification, mining, and visualization of the contents in the table, so as to divide the file names corresponding to the table according to the file directory table index and finally compare whether the project name is in file exists in the directory.

目前主要是基于深度学习识别表格的结构，包括自底向上的方法、自顶向下的方法和图像文本生成的方法。其中，自底向上的方法主要特点是先进行表格单元格和文本块的检测，再进行单元格关系的分类；自顶向下的方法则先进行表格行列的分割，之后对单元格进行合并等操作；图像文本生成方法是指基于表格图像直接生成表格结构所对应的序列文本。At present, it is mainly based on deep learning to identify the structure of tables, including bottom-up methods, top-down methods and image text generation methods. Among them, the main feature of the bottom-up method is to first detect table cells and text blocks, and then classify the cell relationships; the top-down method first divides the table rows and columns, and then merges the cells, etc. Operation; the image text generation method refers to directly generating sequence text corresponding to the table structure based on the table image.

但是，电网生产建设档案中的划分表往往具有复杂的表格结构，以及数百个单元格内容，在表格结构解析上容易出现合并关系混淆，行列划分错误等问题，也就会导致受影响的单元格内容残缺，语义错误等情况。少数单元格的识别遗漏很有可能造成大面积的表格结构解析错误，导致对划分表中待审核比对的工程项目名称识别不全，造成遗漏。However, the division tables in power grid production and construction files often have complex table structures and hundreds of cell contents. When parsing the table structure, problems such as confusing merge relationships and incorrect division of rows and columns are prone to occur, which will also lead to affected units. Incomplete case content, semantic errors, etc. The identification omission of a few cells is likely to cause large-scale table structure parsing errors, leading to incomplete identification of project names to be reviewed and compared in the division table, resulting in omissions.

发明内容Contents of the invention

本发明提供了一种表格的解析方法、装置、设备及存储介质，以解决如何提高解析表格的结构的准确性的问题。The present invention provides a table parsing method, device, equipment and storage medium to solve the problem of how to improve the accuracy of the structure of the parsed table.

根据本发明的一方面，提供了一种表格的解析方法，包括：According to one aspect of the present invention, a table parsing method is provided, including:

接收内容包含表格的图像数据；Receive image data containing tables;

分别从所述图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征；Extract geometric position features of text blocks, appearance features of cells, and content features of text information from the image data respectively;

从所述外观特征中分离出在行与列上的视觉特征；Separating visual features in rows and columns from said appearance features;

分别从所述几何位置特征、所述内容特征提取语义上的特征，作为外观语义特征、内容语义特征；Extract semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features;

依据所述视觉特征、外观语义特征与所述内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块；According to the visual features, the appearance semantic features and the content semantic features, learn to represent the first collaboration block between rows and columns, represent the second collaboration block between cells and text boxes, and represent between text boxes and text information. The third collaboration block;

将所述第一协作块、所述第二协作块与所述第三协作块融合为协同图编码；Fusion of the first collaboration block, the second collaboration block and the third collaboration block into collaborative graph coding;

依据所述协同图编码识别所述表格中的结构信息。Structural information in the table is identified based on the collaboration graph encoding.

可选地，所述分别从所述图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征，包括：Optionally, extracting geometric position features of text blocks, appearance features of cells, and content features of text information from the image data includes:

加载第一全连接层、残差网络、第二全连接层、词向量模型、卷积层；Load the first fully connected layer, residual network, second fully connected layer, word vector model, and convolutional layer;

对所述图像数据执行光学字符识别、得到文本块，所述文本块具有位置信息；Perform optical character recognition on the image data to obtain text blocks, where the text blocks have position information;

将所述文本块的位置信息输入所述第一全连接层中进行映射，得到几何位置特征；Input the position information of the text block into the first fully connected layer for mapping to obtain geometric position features;

将所述图像数据输入所述残差网络中提取图像特征；Input the image data into the residual network to extract image features;

对所述图像特征执行感兴趣区域的聚集操作，得到表示单元格对应的文本框的特征；Perform an aggregation operation on the region of interest on the image features to obtain features representing the text box corresponding to the cell;

将所述文本框的特征输入所述第二全连接层中进行映射，得到单元格的外观特征；Input the characteristics of the text box into the second fully connected layer for mapping to obtain the appearance characteristics of the cell;

将所述文本框对应的文本信息输入所述词向量模型中编码，得到文本向量；Enter the text information corresponding to the text box into the word vector model for encoding to obtain a text vector;

将所述文本向量输入所述卷积层中执行卷积操作，得到文本信息的内容特征。The text vector is input into the convolution layer to perform a convolution operation to obtain the content features of the text information.

可选地，所述从所述外观特征中分离出在行与列上的视觉特征，包括：Optionally, the separation of visual features on rows and columns from the appearance features includes:

加载分离式聚合模块、长短期记忆网络；Load the separate aggregation module and long short-term memory network;

将所述外观特征输入所述分离式聚合模块中提取在行与列上的融合特征；Input the appearance features into the separate aggregation module to extract fusion features on rows and columns;

将所述融合特征提取出在行上的第一分离特征、在列上的第二分离特征；Extract the first separation feature on the row and the second separation feature on the column from the fusion feature;

将所述第一分离特征与所述第二分离特征输入所述长短期记忆网络中融合为在行与列上的视觉特征。The first separation feature and the second separation feature are input into the long short-term memory network and fused into visual features in rows and columns.

可选地，所述分别从所述几何位置特征、所述内容特征提取语义上的特征，作为外观语义特征、内容语义特征，包括：Optionally, extracting semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features includes:

加载自语义提取器，所述自语义提取器中设置有通过残留连接关联的多头注意机制；Load the self-semantic extractor, which is provided with a multi-head attention mechanism associated through residual connections;

将所述几何位置特征构造为第一有向图；Construct the geometric position feature as a first directed graph;

将所述第一有向图输入所述自语义提取器中，通过所述多头注意机制提取语义上的特征，作为外观语义特征；Input the first directed graph into the self-semantic extractor, and extract semantic features through the multi-head attention mechanism as appearance semantic features;

将所述内容特征构造为第二有向图；constructing the content features as a second directed graph;

将所述第二有向图输入所述自语义提取器中，通过所述多头注意机制提取语义上的特征，作为内容语义特征。The second directed graph is input into the self-semantic extractor, and semantic features are extracted through the multi-head attention mechanism as content semantic features.

可选地，所述依据所述视觉特征、外观语义特征与所述内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块，包括：Optionally, the learning based on the visual features, the appearance semantic features and the content semantic features represents a first collaboration block between rows and columns, a second collaboration block between cells and text boxes, and a representation of text. The third collaboration block between boxes and text information includes:

加载跨上下文合成器，所述跨上下文合成器中设置有多个多头注意力机制；Load a cross-context synthesizer, which is equipped with multiple multi-head attention mechanisms;

将所述视觉特征输入所述跨上下文合成器中，并行通过多个多头注意力机制学习表示行与列之间的第一协作块；Input the visual features into the cross-context synthesizer, and learn to represent the first cooperative block between rows and columns through multiple multi-head attention mechanisms in parallel;

将外观语义特征与所述内容语义特征输入所述跨上下文合成器中，并行通过多个多头注意力机制学习表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块。The appearance semantic features and the content semantic features are input into the cross-context synthesizer, and multiple multi-head attention mechanisms are used in parallel to learn to represent the second cooperation block between the cell and the text box and between the text box and the text information. The third collaboration block.

可选地，所述依据所述协同图编码识别所述表格中的结构信息，包括：Optionally, identifying the structural information in the table based on the collaborative graph encoding includes:

加载结构预测网络，所述结构预测网络具有多个预测块；loading a structure prediction network having a plurality of prediction blocks;

依次将所述协同图编码输入多个所述预测块中进行处理，以识别所述表格中的结构信息。The collaborative graph encoding is sequentially input into multiple prediction blocks for processing to identify structural information in the table.

可选地，所述预测块的数量为三个，在每个所述预测块中均具有第三全连接层、第四全连接层与激活层。Optionally, the number of the prediction blocks is three, and each prediction block has a third fully connected layer, a fourth fully connected layer and an activation layer.

根据本发明的另一方面，提供了一种表格的解析装置，包括：According to another aspect of the present invention, a table parsing device is provided, including:

图像数据接收模块，用于接收内容包含表格的图像数据；The image data receiving module is used to receive image data containing tables;

特征提取模块，用于分别从所述图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征；A feature extraction module, configured to extract geometric position features of text blocks, appearance features of cells, and content features of text information from the image data;

视觉特征分离模块，用于从所述外观特征中分离出在行与列上的视觉特征；A visual feature separation module, used to separate visual features in rows and columns from the appearance features;

语义特征识别模块，用于分别从所述几何位置特征、所述内容特征提取语义上的特征，作为外观语义特征、内容语义特征；A semantic feature recognition module, configured to extract semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features;

协作块生成模块，用于依据所述视觉特征、外观语义特征与所述内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块；A collaboration block generation module, configured to learn a first collaboration block between rows and columns, a second collaboration block between cells and text boxes, and a representation based on the visual features, appearance semantic features, and content semantic features. The third collaboration block between text boxes and text information;

协同图编码生成模块，用于将所述第一协作块、所述第二协作块与所述第三协作块融合为协同图编码；A collaborative graph coding generation module, configured to fuse the first collaboration block, the second collaboration block and the third collaboration block into collaborative graph coding;

结构信息生成模块，用于依据所述协同图编码识别所述表格中的结构信息。A structural information generation module, configured to identify the structural information in the table based on the collaborative graph encoding.

可选地，所述特征提取模块还用于：Optionally, the feature extraction module is also used to:

可选地，所述视觉特征分离模块还用于：Optionally, the visual feature separation module is also used to:

可选地，所述语义特征识别模块还用于：Optionally, the semantic feature recognition module is also used to:

可选地，所述协作块生成模块还用于：Optionally, the collaboration block generation module is also used to:

可选地，所述结构信息生成模块还用于：Optionally, the structural information generation module is also used to:

示例性地，所述预测块的数量为三个，在每个所述预测块中均具有第三全连接层、第四全连接层与激活层。For example, the number of the prediction blocks is three, and each prediction block has a third fully connected layer, a fourth fully connected layer and an activation layer.

根据本发明的另一方面，提供了一种电子设备，所述电子设备包括：According to another aspect of the present invention, an electronic device is provided, the electronic device including:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的计算机程序，所述计算机程序被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例所述的表格的解析方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present invention. Table parsing method.

根据本发明的另一方面，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序用于使处理器执行时实现本发明任一实施例所述的表格的解析方法。According to another aspect of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program is used to implement any embodiment of the present invention when executed by a processor. The parsing method of the table.

在本实施例中，接收内容包含表格的图像数据；分别从图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征；从外观特征中分离出在行与列上的视觉特征；分别从几何位置特征、内容特征提取语义上的特征，作为外观语义特征、内容语义特征；依据视觉特征、外观语义特征与内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块；将第一协作块、第二协作块与第三协作块融合为协同图编码；依据协同图编码识别表格中的结构信息。本实施例可以为表格的每个模态生成上下文，从而融合和调制不同表格的模态间交互信息，通过多次堆叠不同维度下的上下文，使得模态内上下文生成和模态间协作可以以分层方式交替进行，这使得模态内交互从低层到顶层不断生成，即，多模态的低层上下文信息和高层的上下文信息可以在整个网络中相互协作，这样既能学到足够的浅层图像表征信息，又能较好地结合单元格内内容的语义信息，极大地提高了识别表格的结构信息的精确度。In this embodiment, image data containing tables is received; the geometric position features of text blocks, the appearance features of cells, and the content features of text information are extracted from the image data; and the rows and columns are separated from the appearance features. Visual features; extract semantic features from geometric position features and content features respectively as appearance semantic features and content semantic features; learn to represent the first collaboration block between rows and columns based on visual features, appearance semantic features, and content semantic features , representing the second collaboration block between the cell and the text box, and the third collaboration block representing the text box and the text information; integrating the first collaboration block, the second collaboration block and the third collaboration block into collaboration graph coding; Identify the structural information in the table based on the collaboration graph encoding. This embodiment can generate context for each modality of the table, thereby merging and modulating the inter-modal interaction information of different tables. By stacking contexts in different dimensions multiple times, intra-modal context generation and inter-modal collaboration can be The hierarchical approach is carried out alternately, which enables intra-modal interactions to be continuously generated from low-level to top-level, that is, multi-modal low-level context information and high-level context information can cooperate with each other in the entire network, so that sufficient shallow-level context can be learned The image representation information can be better combined with the semantic information of the content in the cell, which greatly improves the accuracy of identifying the structural information of the table.

应当理解，本部分所描述的内容并非旨在标识本发明的实施例的关键或重要特征，也不用于限制本发明的范围。本发明的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention will become easily understood from the following description.

附图说明Description of the drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1是根据本发明实施例一提供的一种表格的解析方法的流程图；Figure 1 is a flow chart of a table parsing method provided according to Embodiment 1 of the present invention;

图2是根据本发明实施例一提供的一种神经协同图机器的结构示意图；Figure 2 is a schematic structural diagram of a neural synergy map machine provided according to Embodiment 1 of the present invention;

图3是根据本发明实施例一提供的一种自语义提取器与上下文合成器的结构示意图；Figure 3 is a schematic structural diagram of a self-semantic extractor and context synthesizer provided according to Embodiment 1 of the present invention;

图4是根据本发明实施例二提供的一种表格的解析装置的结构示意图；Figure 4 is a schematic structural diagram of a table parsing device provided according to Embodiment 2 of the present invention;

图5是实现本发明实施例三提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided in Embodiment 3 of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only These are some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the invention described herein are capable of being practiced in sequences other than those illustrated or described herein. Furthermore, the terms "include" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

实施例一Embodiment 1

图1为本发明实施例一提供的一种表格的解析方法的流程图，该方法可以由表格的解析装置来执行，该表格的解析装置可以采用硬件和/或软件的形式实现，该表格的解析装置可配置于电子设备中。如图1所示，该方法包括：Figure 1 is a flow chart of a table parsing method provided in Embodiment 1 of the present invention. The method can be executed by a table parsing device. The table parsing device can be implemented in the form of hardware and/or software. The table parsing device can be implemented in the form of hardware and/or software. The analysis device can be configured in electronic equipment. As shown in Figure 1, the method includes:

步骤101、接收内容包含表格的图像数据。Step 101: Receive image data whose content includes a table.

在本实施例中，提出神经协同图机器(NCGM)，如图2所示，将电网中各种包含表格(如划分表)的档案转换图像数据，那么，该图像数据又可称为表格图像数据(TableImage)。In this embodiment, a neural collaborative graph machine (NCGM) is proposed, as shown in Figure 2, to convert various files containing tables (such as partition tables) in the power grid into image data. Then, the image data can also be called table images. data(TableImage).

其中，表格中含有多个单元格，单元格中包含文本信息，一般情况下，多个单元格并没有明显的规律分布，文本信息也并没有明显的规律。Among them, the table contains multiple cells, and the cells contain text information. In general, the multiple cells have no obvious regular distribution, and the text information also has no obvious regularity.

对图像数据进行分页等预处理，并将图像数据输入至神经协同图机器(NCGM)中进行处理。Preprocess the image data such as paging, and input the image data into the Neural Collaboration Graph Machine (NCGM) for processing.

步骤102、分别从图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征。Step 102: Extract the geometric position features of the text block, the appearance features of the cells, and the content features of the text information from the image data.

在本实施例中，如图2所示，可以对图像数据进行三路的特征提取(Featureextraction)，其中一路特征提取为从图像数据中提取文本块的几何位置特征，另一路特征提取为从图像数据中提取单元格的外观特征，又一路特征提取为从图像数据中提取文本信息的内容特征。In this embodiment, as shown in Figure 2, three-way feature extraction (Featureextraction) can be performed on the image data. One of the feature extractions is to extract the geometric position features of the text block from the image data, and the other is to extract the geometric position features of the text block from the image data. The appearance features of cells are extracted from the data, and the next step of feature extraction is to extract the content features of text information from the image data.

其中，文本块为文本信息聚合的区域。Among them, the text block is an area where text information is aggregated.

在具体实现中，可以加载d维的第一全连接层(fully connected layers，FC)、残差网络(ResNet，如ResNet18)、d维的的第二全连接层、词向量模型(如wrod2vec)、卷积层(Convolutional layer)。In the specific implementation, you can load the d-dimensional first fully connected layer (FC), residual network (ResNet, such as ResNet18), d-dimensional second fully connected layer, and word vector model (such as wrod2vec) , Convolutional layer.

对图像数据执行光学字符识别(Optical Character Recognition，OCR)、得到文本块，其中，文本块具有位置信息。Perform optical character recognition (Optical Character Recognition, OCR) on the image data to obtain a text block, where the text block has position information.

将文本块的位置信息输入第一全连接层中进行映射，得到几何位置特征(Geometry)。The position information of the text block is input into the first fully connected layer for mapping to obtain the geometric position feature (Geometry).

将图像数据输入残差网络中提取图像特征。The image data is input into the residual network to extract image features.

对图像特征执行感兴趣区域的聚集操作(ROIalign)，得到表示单元格对应的文本框的特征。Perform a region of interest aggregation operation (ROIalign) on the image features to obtain the features representing the text box corresponding to the cell.

将文本框的特征输入第二全连接层中进行映射，得到单元格的外观特征(Appearance fseg)。The features of the text box are input into the second fully connected layer for mapping to obtain the appearance features of the cell (Appearance fseg).

将文本框对应的文本信息输入词向量模型(如wrod2vec)中编码，将其编码到分布空间，得到文本向量。Enter the text information corresponding to the text box into the word vector model (such as wrod2vec) for encoding, and encode it into the distribution space to obtain the text vector.

将文本向量输入卷积层中、使用大小为7×1×d的卷积核执行步长为1的卷积操作，得到文本信息的内容特征(content)。Input the text vector into the convolution layer, use a convolution kernel of size 7×1×d to perform a convolution operation with a step size of 1, and obtain the content features (content) of the text information.

步骤103、从外观特征中分离出在行与列上的视觉特征。Step 103: Separate visual features in rows and columns from appearance features.

在本实施例中，可以对外观特征进行高层次的特征提取，从外观特征中分离出在行与列上的视觉特征。In this embodiment, high-level feature extraction can be performed on the appearance features, and visual features in rows and columns can be separated from the appearance features.

在具体实现中，如图2所示，可以加载分离式聚合模块(Split-aggregationModule)、长短期记忆网络(lstm module)。In the specific implementation, as shown in Figure 2, the split-aggregation module (Split-aggregationModule) and the long-short-term memory network (lstm module) can be loaded.

将外观特征输入分离式聚合模块中提取在行与列上的融合特征，利用相关关系将融合特征提取出在行上的第一分离特征、在列上的第二分离特征。Input the appearance features into the separate aggregation module to extract the fused features on rows and columns, and use the correlation relationship to extract the fused features into the first separation feature on the row and the second separation feature on the column.

将第一分离特征与第二分离特征输入长短期记忆网络中进行特征增强，以融合为在行与列上的视觉特征。The first separation feature and the second separation feature are input into the long short-term memory network for feature enhancement to fuse into visual features in rows and columns.

步骤104、分别从几何位置特征、内容特征提取语义上的特征，作为外观语义特征、内容语义特征。Step 104: Extract semantic features from geometric position features and content features respectively as appearance semantic features and content semantic features.

在本实施例中，可以对几何位置特征、内容特征进行高层次的特征提取，从几何位置特征提取语义上的特征，记为外观语义特征，从内容特征提取语义上的特征，记为内容语义特征。In this embodiment, high-level feature extraction can be performed on geometric position features and content features. Semantic features are extracted from the geometric position features and are recorded as appearance semantic features. Semantic features are extracted from content features and are recorded as content semantic features. feature.

在具体实现中，如图2所示，加载自语义提取器(ECE(Ego Context Extractor)module)。In the specific implementation, as shown in Figure 2, it is loaded from the semantic extractor (ECE (Ego Context Extractor) module).

其中，如图3所示，输入到ECE的每个特征的模态都被构造为单独的有向图G^～＝{v，ε}∈{G^G，G^A，G^C}，在图的每个解耦模态中，将每个文本段边界框的相应嵌入视为节点X＝{x₁，x₂，...，x_N}∈v，该节点X通过边ε∈v*v相互连接。在构造的有向图中，每个节点可以是锚点，也可以是其他节点的上下文之一。对于图表示，CNN(卷积神经网络)具有较强的归纳偏置，可能不是最优选择。为了解决该问题，本实施例提出的自语义提取器ECE中设置有通过残留连接关联的多头注意机制(Multi-head Attention，MHA)，通过多头注意机制聚合了所有三种模式的全连接图信息，多头注意机制对输入的假设很少，并且可以根据输入内容学习结合局部行为和全局信息，能够更好地处理大型复杂表格中的结构解析。Among them, as shown in Figure 3, the mode of each feature input to ECE is constructed as a separate directed graph G ^~ = {v, ε}∈{G ^G , G ^A , G ^C }, in the graph In each decoupled modality, consider _the _{corresponding} embedding of each text segment bounding box as _a node Connect with each other. In the constructed directed graph, each node can be an anchor or one of the contexts of other nodes. For graph representation, CNN (Convolutional Neural Network) has a strong inductive bias and may not be the optimal choice. In order to solve this problem, the self-semantic extractor ECE proposed in this embodiment is equipped with a multi-head attention mechanism (Multi-head Attention, MHA) associated through residual connections. The multi-head attention mechanism aggregates the fully connected graph information of all three modes. , The multi-head attention mechanism makes few assumptions about the input, and can learn to combine local behavior and global information based on the input content, and can better handle structural parsing in large and complex tables.

进一步地，MHA通过引入内存压缩模块，此时，MHA又称为CMHA(Compressed Multi-head Attention，压缩多头注意力机制)：Furthermore, MHA introduces a memory compression module. At this time, MHA is also called CMHA (Compressed Multi-head Attention, compressed multi-head attention mechanism):

MC(H)＝Norm(Reshape(x，∈)W^h)MC(H)=Norm(Reshape(x,∈)W ^h )

CMHA可以将“内存”的数量压缩到等于query(查询，Q)的量，使得CMHA可以减少图像像素数，大大降低了多头注意力机制运算中的计算复杂度。CMHA can compress the amount of "memory" to the amount equal to query (query, Q), so that CMHA can reduce the number of image pixels and greatly reduce the computational complexity in the operation of the multi-head attention mechanism.

此外，本实施例中还为CMHA配备了残留连接，使query(查询)信息能够很好地传递下去，可定义为：In addition, this embodiment is also equipped with residual connections for CMHA, so that query information can be passed on well, which can be defined as:

P＝MHA(Q，MC(K)，MC(V))，P=MHA(Q, MC(K), MC(V)),

其中“FFN(■)”为前馈层，“Add&Norm(■)”表示元素相加和层归一化。Among them, "FFN(■)" is the feedforward layer, and "Add&Norm(■)" means element addition and layer normalization.

一方面，可以将几何位置特征构造为第一有向图，将第一有向图输入自语义提取器中，通过多头注意机制提取语义上的特征，作为外观语义特征。On the one hand, the geometric position feature can be constructed as a first directed graph, the first directed graph is input into the semantic extractor, and the semantic features are extracted through a multi-head attention mechanism as appearance semantic features.

另一方面，可以将内容特征构造为第二有向图，将第二有向图输入自语义提取器中，通过多头注意机制提取语义上的特征，作为内容语义特征。On the other hand, the content features can be constructed as a second directed graph, the second directed graph is input into the semantic extractor, and the semantic features are extracted through a multi-head attention mechanism as content semantic features.

步骤105、依据视觉特征、外观语义特征与内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块。Step 105: Learn the first collaboration block between rows and columns, the second collaboration block between cells and text boxes, and the collaboration between text boxes and text information based on visual features, appearance semantic features, and content semantic features. The third collaboration block.

在本实施例中，可以依据视觉特征学习表示行与列之间的第一协作块，依据外观语义特征与内容语义特征学习表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块。In this embodiment, it is possible to learn to represent the first collaboration block between rows and columns based on visual features, to learn to represent the second collaboration block between cells and text boxes based on appearance semantic features and content semantic features, and to represent the second collaboration block between text boxes and text boxes based on appearance semantic features and content semantic features. Third collaboration block between text messages.

在具体实现中，如图2所示，可以加载跨上下文合成器(CCS(Cross ContextSynthesizer)Intermodality)。In the specific implementation, as shown in Figure 2, the cross-context synthesizer (CCS (Cross ContextSynthesizer) Intermodality) can be loaded.

其中，如图3所示，跨上下文合成器中设置有多个多头注意力机制。Among them, as shown in Figure 3, multiple multi-head attention mechanisms are set up in the cross-context synthesizer.

在每个协作块(即第一协作块、第二协作块与第三协作块)中，提取的特征嵌入被构建为上下文图，由ECE单独应用以形成“模态流”，本实施例将它们以协作的方式融合在一起，并学习不同模态之间的协作模式。CCS有三个并行的CMHA，每个CMHA都具有三种模式，采用一种模式作为查询，而另外两种模式共同作为键K和值V，即，查询模式从另外两个模式中探索有用的信息，使得CCS选择性地将不同模态的个体语境信息融合为维持在“模态流”中的模态交互作用。In each collaboration block (i.e., the first collaboration block, the second collaboration block, and the third collaboration block), the extracted feature embeddings are constructed as context graphs, which are applied individually by ECE to form a "modal flow". This embodiment will They blend together in a collaborative manner and learn collaboration patterns between different modalities. CCS has three parallel CMHAs. Each CMHA has three modes. One mode is used as the query, and the other two modes jointly serve as the key K and the value V. That is, the query mode explores useful information from the other two modes. , allowing CCS to selectively fuse individual context information of different modalities into modal interactions maintained in the "modal flow".

如图2所示，将视觉特征输入跨上下文合成器中，并行通过多个多头注意力机制学习表示行与列之间的第一协作块Ap。As shown in Figure 2, the visual features are input into the cross-context synthesizer, and the first collaborative block Ap between rows and columns is learned to represent the rows and columns through multiple multi-head attention mechanisms in parallel.

将外观语义特征与内容语义特征输入跨上下文合成器中，并行通过多个多头注意力机制学习表示单元格与文本框之间的第二协作块Ge以及表示文本框与文本信息之间的第三协作块Co。The appearance semantic features and content semantic features are input into the cross-context synthesizer, and multiple multi-head attention mechanisms are used to learn the second cooperation block Ge between the cell and the text box and the third collaboration block Ge between the text box and the text information in parallel. Collaboration Block Co.

步骤106、将第一协作块、第二协作块与第三协作块融合为协同图编码。Step 106: Fusion of the first collaboration block, the second collaboration block and the third collaboration block into collaborative graph coding.

在本实施例中，如图2所示，可以通过concat等函数将第一协作块Ap、第二协作块Ge与第三协作块Co融合为协同图编码(collaborative graph embeddings)，示为基于此，把第i个元素和第j个元素作为一对沿着通道方向拼接在一起形成向量/> In this embodiment, as shown in Figure 2, the first collaboration block Ap, the second collaboration block Ge and the third collaboration block Co can be integrated into collaborative graph embeddings (collaborative graph embeddings) through functions such as concat, as shown Based on this, the i-th element and j-th element are spliced together along the channel direction as a pair to form a vector/>

步骤107、依据协同图编码识别表格中的结构信息。Step 107: Identify the structural information in the table based on the collaborative graph encoding.

在本实施例中，如图2所示，可以依据协同图编码进行结构预测(Structureprediction)，得到表格中的结构信息，例如，每个单元格Cell的位置信息，每个单元格的行序列Row、列序号Col，等等，完成表格的结构解析。In this embodiment, as shown in Figure 2, structure prediction (Structure prediction) can be performed based on collaborative graph coding to obtain the structural information in the table, for example, the position information of each cell Cell, the row sequence Row of each cell , column number Col, etc., to complete the structure analysis of the table.

在具体实现中，可以加载结构预测网络，结构预测网络具有多个预测块。In a specific implementation, a structure prediction network can be loaded, which has multiple prediction blocks.

其中，预测块为包含预测的深度学习结构的封装，各个预测块的结构可以相同，也可以不同，本实施例对此不加以限制。The prediction block is an encapsulation of the deep learning structure containing prediction. The structures of each prediction block may be the same or different, and this embodiment is not limited to this.

示例性地，预测块的数量为三个，在每个预测块中均具有第三全连接层、第四全连接层与激活层(如softmax层)，在每个预测块中的处理依次为第三全连接层提供全连接操作、第四全连接层提供全连接操作、激活层提供激活操作。For example, the number of prediction blocks is three. Each prediction block has a third fully connected layer, a fourth fully connected layer and an activation layer (such as a softmax layer). The processing in each prediction block is as follows: The third fully connected layer provides fully connected operations, the fourth fully connected layer provides fully connected operations, and the activation layer provides activation operations.

依次将协同图编码输入多个预测块中进行处理，以识别表格中的结构信息。The collaborative graph encoding is input into multiple prediction blocks in turn for processing to identify the structural information in the table.

实施例二Embodiment 2

图4为本发明实施例二提供的一种表格的解析装置的结构示意图。如图3所示，该装置包括：FIG. 4 is a schematic structural diagram of a table parsing device provided in Embodiment 2 of the present invention. As shown in Figure 3, the device includes:

图像数据接收模块401，用于接收内容包含表格的图像数据；Image data receiving module 401, used to receive image data containing tables;

特征提取模块402，用于分别从所述图像数据中提取文本块的几何位置特征、单元格的外观特征、文本信息的内容特征；Feature extraction module 402 is used to extract geometric position features of text blocks, appearance features of cells, and content features of text information from the image data;

视觉特征分离模块403，用于从所述外观特征中分离出在行与列上的视觉特征；The visual feature separation module 403 is used to separate visual features in rows and columns from the appearance features;

语义特征识别模块404，用于分别从所述几何位置特征、所述内容特征提取语义上的特征，作为外观语义特征、内容语义特征；The semantic feature identification module 404 is used to extract semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features;

协作块生成模块405，用于依据所述视觉特征、外观语义特征与所述内容语义特征学习表示行与列之间的第一协作块、表示单元格与文本框之间的第二协作块以及表示文本框与文本信息之间的第三协作块；The collaboration block generation module 405 is configured to learn a first collaboration block between rows and columns, a second collaboration block between cells and text boxes based on the visual features, appearance semantic features and content semantic features. Represents the third collaboration block between the text box and text information;

协同图编码生成模块406，用于将所述第一协作块、所述第二协作块与所述第三协作块融合为协同图编码；Collaboration graph coding generation module 406, configured to fuse the first collaboration block, the second collaboration block and the third collaboration block into collaboration graph coding;

结构信息生成模块407，用于依据所述协同图编码识别所述表格中的结构信息。The structural information generation module 407 is used to identify the structural information in the table according to the collaborative graph encoding.

在本发明的一个实施例中，所述特征提取模块402还用于：In one embodiment of the present invention, the feature extraction module 402 is also used to:

在本发明的一个实施例中，所述视觉特征分离模块403还用于：In one embodiment of the present invention, the visual feature separation module 403 is also used to:

在本发明的一个实施例中，所述语义特征识别模块404还用于：In one embodiment of the present invention, the semantic feature recognition module 404 is also used to:

在本发明的一个实施例中，所述协作块生成模块405还用于：In one embodiment of the present invention, the cooperation block generation module 405 is also used to:

在本发明的一个实施例中，所述结构信息生成模块407还用于：In one embodiment of the present invention, the structural information generation module 407 is also used to:

本发明实施例所提供的表格的解析装置可执行本发明任意实施例所提供的表格的解析方法，具备执行表格的解析方法相应的功能模块和有益效果。The table parsing device provided by the embodiment of the present invention can execute the table parsing method provided by any embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the table parsing method.

实施例三Embodiment 3

图5示出了可以用来实施本发明的实施例的电子设备10的结构示意图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本发明的实现。FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present invention. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the invention described and/or claimed herein.

如图5所示，电子设备10包括至少一个处理器11，以及与至少一个处理器11通信连接的存储器，如只读存储器(ROM)12、随机访问存储器(RAM)13等，其中，存储器存储有可被至少一个处理器执行的计算机程序，处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序，来执行各种适当的动作和处理。在RAM 13中，还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(I/O)接口15也连接至总线14。As shown in Figure 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program that can be executed by at least one processor. The processor 11 can perform the operation according to the computer program stored in the read-only memory (ROM) 12 or loaded from the storage unit 18 into the random access memory (RAM) 13. Perform various appropriate actions and processing. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14. An input/output (I/O) interface 15 is also connected to bus 14 .

电子设备10中的多个部件连接至I/O接口15，包括：输入单元16，例如键盘、鼠标等；输出单元17，例如各种类型的显示器、扬声器等；存储单元18，例如磁盘、光盘等；以及通信单元19，例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理，如表格的解析方法。Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 executes various methods and processes described above, such as table parsing methods.

在一些实施例中，表格的解析方法可被实现为计算机程序，其被有形地包含于计算机可读存储介质，例如存储单元18。在一些实施例中，计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM13并由处理器11执行时，可以执行上文描述的表格的解析方法的一个或多个步骤。备选地，在其他实施例中，处理器11可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行表格的解析方法。In some embodiments, the table parsing method may be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19 . When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the table parsing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the parsing method of the table in any other suitable manner (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

用于实施本发明的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器，使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Computer programs for implementing the methods of the invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本发明的上下文中，计算机可读存储介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。备选地，计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this invention, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

为了提供与用户的交互，可以在电子设备上实施此处描述的系统和技术，该电子设备具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.

计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务中，存在的管理难度大，业务扩展性弱的缺陷。Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

实施例四Embodiment 4

本发明实施例还提供了一种计算机程序产品，该计算机程序产品包括计算机程序，该计算机程序在被处理器执行时实现如本发明任一实施例所提供的表格的解析方法。An embodiment of the present invention also provides a computer program product. The computer program product includes a computer program. When executed by a processor, the computer program implements the table parsing method provided by any embodiment of the present invention.

计算机程序产品在实现的过程中，可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码，程序设计语言包括面向对象的程序设计语言，诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言，诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。During the implementation of the computer program product, computer program code for performing the operations of the present invention can be written in one or more programming languages or a combination thereof. Programming languages include object-oriented programming languages, such as Java, Smalltalk , C++, and also includes conventional procedural programming languages, such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发明中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本发明的技术方案所期望的结果，本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present invention can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solution of the present invention can be achieved, there is no limitation here.

上述具体实施方式，并不构成对本发明保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等，均应包含在本发明保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of the present invention. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. A table parsing method, characterized by including:

Receive image data containing tables;

Extract geometric position features of text blocks, appearance features of cells, and content features of text information from the image data respectively;

Separating visual features in rows and columns from said appearance features;

Extract semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features;

According to the visual features, the appearance semantic features and the content semantic features, learn to represent the first collaboration block between rows and columns, represent the second collaboration block between cells and text boxes, and represent between text boxes and text information. The third collaboration block;

Fusion of the first collaboration block, the second collaboration block and the third collaboration block into collaborative graph coding;

Structural information in the table is identified based on the collaboration graph encoding.

2. The method according to claim 1, wherein the step of extracting geometric position features of text blocks, appearance features of cells, and content features of text information from the image data includes:

Load the first fully connected layer, residual network, second fully connected layer, word vector model, and convolutional layer;

Perform optical character recognition on the image data to obtain text blocks, where the text blocks have position information;

Input the position information of the text block into the first fully connected layer for mapping to obtain geometric position features;

Input the image data into the residual network to extract image features;

Perform an aggregation operation on the region of interest on the image features to obtain features representing the text box corresponding to the cell;

Input the characteristics of the text box into the second fully connected layer for mapping to obtain the appearance characteristics of the cell;

Enter the text information corresponding to the text box into the word vector model for encoding to obtain a text vector;

The text vector is input into the convolution layer to perform a convolution operation to obtain the content features of the text information.

3. The method according to claim 1, characterized in that said isolating visual features in rows and columns from said appearance features includes:

Load the separate aggregation module and long short-term memory network;

Input the appearance features into the separate aggregation module to extract fusion features on rows and columns;

Extract the first separation feature on the row and the second separation feature on the column from the fusion feature;

The first separation feature and the second separation feature are input into the long short-term memory network and fused into visual features in rows and columns.

4. The method according to claim 1, wherein the semantic features extracted from the geometric position features and the content features respectively as appearance semantic features and content semantic features include:

Load the self-semantic extractor, which is provided with a multi-head attention mechanism associated through residual connections;

Construct the geometric position feature as a first directed graph;

Input the first directed graph into the self-semantic extractor, and extract semantic features through the multi-head attention mechanism as appearance semantic features;

constructing the content features as a second directed graph;

The second directed graph is input into the self-semantic extractor, and semantic features are extracted through the multi-head attention mechanism as content semantic features.

5. The method of claim 1, wherein the first collaboration block between rows and columns, cells and text are learned based on the visual features, appearance semantic features and content semantic features. The second collaboration block between boxes and the third collaboration block representing text boxes and text information include:

Load a cross-context synthesizer, which is equipped with multiple multi-head attention mechanisms;

Input the visual features into the cross-context synthesizer, and learn to represent the first cooperative block between rows and columns through multiple multi-head attention mechanisms in parallel;

The appearance semantic features and the content semantic features are input into the cross-context synthesizer, and multiple multi-head attention mechanisms are used in parallel to learn to represent the second cooperation block between the cell and the text box and between the text box and the text information. The third collaboration block.

6. The method according to any one of claims 1-5, characterized in that identifying the structural information in the table based on the collaborative graph encoding includes:

loading a structure prediction network having a plurality of prediction blocks;

The collaborative graph encoding is sequentially input into multiple prediction blocks for processing to identify structural information in the table.

7. The method according to claim 6, characterized in that the number of the prediction blocks is three, and each prediction block has a third fully connected layer, a fourth fully connected layer and an activation layer.

8. A table parsing device, characterized in that it includes:

The image data receiving module is used to receive image data containing tables;

A feature extraction module, configured to extract geometric position features of text blocks, appearance features of cells, and content features of text information from the image data;

A visual feature separation module, used to separate visual features in rows and columns from the appearance features;

A semantic feature recognition module, configured to extract semantic features from the geometric position features and the content features respectively as appearance semantic features and content semantic features;

A collaboration block generation module, configured to learn a first collaboration block between rows and columns, a second collaboration block between cells and text boxes, and a representation based on the visual features, appearance semantic features, and content semantic features. The third collaboration block between text boxes and text information;

A collaborative graph coding generation module, configured to fuse the first collaboration block, the second collaboration block and the third collaboration block into collaborative graph coding;

A structural information generation module, configured to identify the structural information in the table based on the collaborative graph encoding.

9. An electronic device, characterized in that the electronic device includes:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor, so that the at least one processor can execute any one of claims 1-7 The parsing method of the table.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is used to implement the method described in any one of claims 1-7 when executed by a processor. Table parsing method.