CN116578706A

CN116578706A - A method of labeling the address of the form through the algorithm of text proximity and address collision

Info

Publication number: CN116578706A
Application number: CN202310562329.6A
Authority: CN
Inventors: 朱晶熙; 马山虎
Original assignee: Shanghai Zhongtongji Network Technology Co Ltd
Current assignee: Shanghai Zhongtongji Network Technology Co Ltd
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-08-11

Abstract

This application relates to the technical field of data labeling, and in particular to a method for labeling the address of the bill of lading through text proximity and address collision algorithms, including: obtaining the face bill picture, performing text recognition on the face bill picture, and obtaining the text content and each text content in it coordinate information; based on the text content and the coordinate information of each text content, determine the proximity value of each text in the text content; according to the proximity value of each text in the text content, gather each text in the text content into groups to obtain an array of text fragments; Compare the text fragment array with the preset address comparison library, and determine the text fragment array with the highest similarity as the address text; take the four-corner coordinates of the address text as the address label value of the face sheet image. The technical solution in this application can replace manual work and automatically mark the address of the face sheet pictures, thereby greatly improving the marking efficiency.

Description

A method of labeling the address of the form through the algorithm of text proximity and address collision

技术领域technical field

本申请涉及数据标注技术领域，尤其涉及一种通过文字临近和地址碰撞算法标注面单地址的方法。The present application relates to the technical field of data labeling, in particular to a method for labeling the address of a bill of lading through text proximity and address collision algorithms.

背景技术Background technique

机器学习数据标注是对文本、图像等元数据进行标注的过程，标记好的数据将用于训练机器学习的模型。不同的数据标注类型适用于不同的标注场景，不同的标注场景针对不同的机器学习应用场景。Machine learning data labeling is the process of labeling metadata such as text and images, and the labeled data will be used to train machine learning models. Different data annotation types are suitable for different annotation scenarios, and different annotation scenarios are for different machine learning application scenarios.

面单地址数据标注属于文本分类标注,文本分类和内容分类指的是给文档分配预定义类别的任务，可以按主题标记文档中的句子或段落，如面单地址数据。Labeling of face-to-face address data belongs to text classification labeling. Text classification and content classification refer to the task of assigning predefined categories to documents. Sentences or paragraphs in documents can be marked by subject, such as face-to-face address data.

在进行机器学习面单标注时,尤其是针对地址这种信息复杂度较高,跨行且不是纯数字或字母的内容时,往往需要大量的人力进行标注投入，通过标注工具圈选地址区域，不仅效率低下还浪费人力。When labeling machine learning face sheets, especially for addresses with high information complexity, cross-line content, and not pure numbers or letters, a lot of manpower is often required for labeling. Using labeling tools to circle the address area not only Inefficiency and waste of manpower.

发明内容Contents of the invention

为至少在一定程度上克服相关技术中通过人力进行面单地址标注不仅效率低下还浪费人力的问题，本申请提供一种通过文字临近和地址碰撞算法标注面单地址的方法。In order to overcome at least to a certain extent the problem of low efficiency and waste of manpower in the related technology of labeling the address of the form by manpower, this application provides a method for marking the address of the form by using the algorithm of text proximity and address collision.

本申请的方案如下：The scheme of this application is as follows:

一种通过文字临近和地址碰撞算法标注面单地址的方法，包括：A method for labeling the address of a bill of lading through text proximity and address collision algorithms, comprising:

获取面单图片，对所述面单图片中进行文本识别,获取其中的文本内容和各个文本内容的坐标信息；Obtaining the face sheet picture, performing text recognition in the face sheet image, and obtaining the text content and the coordinate information of each text content therein;

基于文本内容和各个文本内容的坐标信息，确定文本内容中各个文字的临近值；Based on the text content and the coordinate information of each text content, determine the proximity value of each text in the text content;

根据文本内容中各个文字的临近值，将文本内容中各个文字聚集成组,得到文字片段数组；According to the proximity value of each word in the text content, each word in the text content is gathered into a group, and an array of text fragments is obtained;

将所述文字片段数组与预设的地址比对库进行比对，将比对相似度最高的文字片段数组确定为地址文本；Comparing the array of text fragments with a preset address comparison library, and determining the array of text fragments with the highest degree of similarity as the address text;

将所述地址文本的四角坐标进行取值作为所述面单图片的地址标注值。Taking the four-corner coordinates of the address text as the address label value of the face sheet picture.

优选地，所述方法还包括：Preferably, the method also includes:

识别所述面单图片中的运单号；Identify the waybill number in the picture of the waybill;

根据所述运单号调用订单中心服务地址库作为预设的地址比对库。Call the order center service address library as the default address comparison library according to the waybill number.

优选地，所述方法还包括：Preferably, the method also includes:

调用标准地址库作为预设的地址比对库。Call the standard address library as the default address comparison library.

优选地，所述方法还包括：Preferably, the method also includes:

将所述订单中心服务地址库和所述标准地址库进行合并作为预设的地址比对库。The order center service address library and the standard address library are combined as a preset address comparison library.

优选地，所述方法还包括：Preferably, the method also includes:

将所述地址比对库中的数据以一维数组的方式做数据平铺。The data in the address comparison library is tiled in the form of a one-dimensional array.

优选地，根据文本内容中各个文字的临近值，将文本内容中各个文字聚集成组，包括：Preferably, each word in the text content is grouped into groups according to the proximity value of each word in the text content, including:

将临近值相同的文字聚集成组。Groups words with the same proximity value into groups.

将临近值之差的绝对值低于预设阈值的文字聚集成组。Group the text whose absolute value of the difference between adjacent values is lower than a preset threshold.

优选地，将所述文字片段数组与预设的地址比对库进行比对，包括：Preferably, the text segment array is compared with a preset address comparison library, including:

基于地址碰撞算法将所述文字片段数组与预设的地址比对库进行碰撞，所述地址碰撞算法采用相同文字所占比例及置信度进行加权的方式。The text segment array is collided with a preset address comparison library based on an address collision algorithm, and the address collision algorithm adopts a method of weighting the proportion and confidence of the same text.

本申请提供的技术方案可以包括以下有益效果：本申请中的通过文字临近和地址碰撞算法标注面单地址的方法，包括：获取面单图片，对面单图片中进行文本识别,获取其中的文本内容和各个文本内容的坐标信息；基于文本内容和各个文本内容的坐标信息，确定文本内容中各个文字的临近值；根据文本内容中各个文字的临近值，将文本内容中各个文字聚集成组,得到文字片段数组；将文字片段数组与预设的地址比对库进行比对，将比对相似度最高的文字片段数组确定为地址文本；将地址文本的四角坐标进行取值作为面单图片的地址标注值。本申请中的技术方案可以替代人工,自动进行面单图片的地址标注，从而极大的提升标注效率。The technical solution provided by this application may include the following beneficial effects: the method for labeling the address of the face sheet through the text proximity and address collision algorithm in this application includes: obtaining the image of the face sheet, performing text recognition on the image of the face sheet, and obtaining the text content in it and the coordinate information of each text content; based on the text content and the coordinate information of each text content, determine the proximity value of each text in the text content; according to the proximity value of each text in the text content, gather each text in the text content into groups, and obtain Array of text fragments; compare the array of text fragments with the preset address comparison library, and determine the array of text fragments with the highest similarity as the address text; take the four-corner coordinates of the address text as the address of the face sheet image label value. The technical solution in this application can replace manual work and automatically mark the address of the face sheet pictures, thereby greatly improving the marking efficiency.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

图1是本申请一个实施例提供的一种通过文字临近和地址碰撞算法标注面单地址的方法的流程示意图。Fig. 1 is a schematic flowchart of a method for labeling a bill of lading address through text proximity and address collision algorithms provided by an embodiment of the present application.

图2是本申请一个实施例提供的一种通过文字临近和地址碰撞算法标注面单地址的方法中计算相邻两个字的字体间距的示意图。Fig. 2 is a schematic diagram of calculating the font distance between two adjacent characters in a method for labeling the address of the form by using the character proximity and address collision algorithm provided by an embodiment of the present application.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

图1是本申请一个实施例提供的一种通过文字临近和地址碰撞算法标注面单地址的方法的流程示意图，参照图1，一种通过文字临近和地址碰撞算法标注面单地址的方法，包括：Fig. 1 is a schematic flowchart of a method for marking the address of a bill of lading through a text proximity and address collision algorithm provided by an embodiment of the present application. Referring to Fig. 1 , a method for marking a bill of lading address through a text proximity and address collision algorithm includes :

S11：获取面单图片，对面单图片中进行文本识别，获取其中的文本内容和各个文本内容的坐标信息；S11: Obtain the face sheet image, perform text recognition on the face sheet image, and obtain the text content and the coordinate information of each text content;

S12：基于文本内容和各个文本内容的坐标信息，确定文本内容中各个文字的临近值；S12: Based on the text content and the coordinate information of each text content, determine the proximity value of each character in the text content;

S13：根据文本内容中各个文字的临近值，将文本内容中各个文字聚集成组,得到文字片段数组；S13: According to the proximity value of each character in the text content, gather each character in the text content into groups, and obtain an array of character fragments;

S14：将文字片段数组与预设的地址比对库进行比对，将比对相似度最高的文字片段数组确定为地址文本；S14: Compare the text segment array with a preset address comparison library, and determine the text segment array with the highest comparison similarity as the address text;

S15：将地址文本的四角坐标进行取值作为面单图片的地址标注值。S15: Take the four-corner coordinates of the address text as the address label value of the face sheet picture.

需要说明的是，本实施例中的技术方案涉及数据标注技术领域，具体应用于机器学习数据标注领域中的面单地址数据标注。It should be noted that the technical solution in this embodiment relates to the technical field of data labeling, and is specifically applied to the labeling of face-to-face address data in the field of machine learning data labeling.

在具体实践中，可以首先通过CNN(Convolutional Neural Networks，卷积神经网络)+RNN(Recurrent Neural Networks，循环神经网络)神经网络,对面单图片中进行文本识别,这个过程获取的结果是无意义的面单文本内容及各个文本内容的坐标信息。In practice, you can first use CNN (Convolutional Neural Networks, Convolutional Neural Networks) + RNN (Recurrent Neural Networks, Recurrent Neural Networks) neural network to perform text recognition on face-to-face single pictures. The results obtained in this process are meaningless The text content of the face sheet and the coordinate information of each text content.

在具体实践中，根据文字临近算法，基于文本内容和各个文本内容的坐标信息，确定文本内容中各个文字的临近值。具体的：In practice, according to the character proximity algorithm, based on the text content and the coordinate information of each text content, the proximity value of each character in the text content is determined. specific:

设定每个文字的四角坐标为(x1,y1),(x2,y2),(x3,y3),(x4,y4),则相邻两个文字的字体间距为：Set the four-corner coordinates of each character as (x1, y1), (x2, y2), (x3, y3), (x4, y4), then the font spacing between two adjacent characters is:

其中，t1表示相邻两个文字中的第一个文字，t2表示相邻两个文字中的第二个文字。Wherein, t1 represents the first character among the two adjacent characters, and t2 represents the second character among the two adjacent characters.

取的到从t1到tn所有相邻文字的距离后,根据临近距离进行聚合,可以得知，字体间距会有大量趋同的数值,假设此值为K1到Km，从中合适的K值,作为临近值。After taking the distance from all adjacent characters from t1 to tn, aggregate according to the proximity distance, it can be known that the font spacing will have a large number of convergent values, assuming this value is K1 to Km, from which the appropriate K value, as the proximity value.

需要说明的是，本实施例中可以将临近值相同的文字聚集成组，也可以将临近值之差的绝对值低于预设阈值的文字聚集成组，设形成的数组为A1到Ax。It should be noted that in this embodiment, characters with the same proximity value can be grouped, and characters whose absolute value of the difference between proximity values is lower than a preset threshold can also be grouped into groups, and the formed arrays are A1 to Ax.

需要说明的是，方法还包括：It should be noted that the method also includes:

识别面单图片中的运单号；Identify the waybill number in the picture of the waybill;

根据运单号调用订单中心服务地址库作为预设的地址比对库。Call the order center service address library as the default address comparison library according to the waybill number.

将订单中心服务地址库和标准地址库进行合并作为预设的地址比对库。Merge the order center service address library and the standard address library as the default address comparison library.

将地址比对库中的数据以一维数组的方式做数据平铺。The data in the address comparison library is flattened in the form of a one-dimensional array.

在具体实践中，进行地址库选择时具有两种方案：In practice, there are two options for address library selection:

一种为本地维护的地址库(如订单中心服务地址库),一种为通用的省市区地址库(标准地址库),通过二选一或二选二的方式确定地址库,并将地址库以一维数组的方式做数据平铺,便于下一步的碰撞。One is a locally maintained address library (such as an order center service address library), and the other is a general provincial and urban address library (standard address library). The address library is determined by choosing one or two, and the address The library tiles the data in the form of a one-dimensional array, which is convenient for the next collision.

需要说明的是，将文字片段数组与预设的地址比对库进行比对，包括：It should be noted that the array of text fragments is compared with the preset address comparison library, including:

基于地址碰撞算法将文字片段数组与预设的地址比对库进行碰撞，地址碰撞算法采用相同文字所占比例及置信度进行加权的方式。Based on the address collision algorithm, the array of text fragments is collided with the preset address comparison library. The address collision algorithm adopts the weighted method of the proportion and confidence of the same text.

在具体实践中，地址碰撞算法如下：In practice, the address collision algorithm is as follows:

其中，其中C(D)函数表示数组D的长度；L(Ai)函数表示字段Ai的长度；T(j,k)表示数组Ai的第j个字与地址比对库Dm的第k个字的正确度,取值0或1；Pj表示Ai第j个字的置信度；使用Mi,即可算出Ai数组的地址碰撞结果值，进而确定比对相似度最高的文字片段数组。Among them, the C(D) function represents the length of the array D; the L(Ai) function represents the length of the field Ai; T(j,k) represents the jth word of the array Ai and the kth word of the address comparison library Dm The accuracy of is 0 or 1; Pj represents the confidence level of the jth word of Ai; using Mi, the address collision result value of the Ai array can be calculated, and then the text segment array with the highest similarity can be determined.

可以理解的是，本实施例中的通过文字临近和地址碰撞算法标注面单地址的方法，包括：获取面单图片，对面单图片中进行文本识别,获取其中的文本内容和各个文本内容的坐标信息；基于文本内容和各个文本内容的坐标信息，确定文本内容中各个文字的临近值；根据文本内容中各个文字的临近值，将文本内容中各个文字聚集成组,得到文字片段数组；将文字片段数组与预设的地址比对库进行比对，将比对相似度最高的文字片段数组确定为地址文本；将地址文本的四角坐标进行取值作为面单图片的地址标注值。本实施例中的技术方案可以替代人工,自动进行面单图片的地址标注，从而极大的提升标注效率。It can be understood that, in this embodiment, the method for labeling the address of the bill of lading through the algorithm of text proximity and address collision includes: obtaining the picture of the bill of lading, performing text recognition on the picture of the bill of lading, and obtaining the text content and the coordinates of each text content information; based on the text content and the coordinate information of each text content, determine the proximity value of each text in the text content; according to the proximity value of each text in the text content, gather each text in the text content into groups to obtain an array of text fragments; The segment array is compared with the preset address comparison library, and the text segment array with the highest similarity is determined as the address text; the four-corner coordinates of the address text are taken as the address label value of the face sheet image. The technical solution in this embodiment can replace manual work and automatically mark the address of the face sheet picture, thereby greatly improving the marking efficiency.

可以理解的是，上述各实施例中相同或相似部分可以相互参考，在一些实施例中未详细说明的内容可以参见其他实施例中相同或相似的内容。It can be understood that, the same or similar parts in the above embodiments can be referred to each other, and the content that is not described in detail in some embodiments can be referred to the same or similar content in other embodiments.

需要说明的是，在本申请的描述中，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。此外，在本申请的描述中，除非另有说明，“多个”的含义是指至少两个。It should be noted that, in the description of the present application, terms such as "first" and "second" are used for description purposes only, and should not be understood as indicating or implying relative importance. In addition, in the description of the present application, unless otherwise specified, the meaning of "plurality" means at least two.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the present application includes additional implementations in which functions may be performed out of the order shown or discussed, including in substantially simultaneous fashion or in reverse order depending on the functions involved, which shall It should be understood by those skilled in the art to which the embodiments of the present application belong.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that each part of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present application, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. A method for labeling the address of the form by text proximity and address collision algorithm, characterized in that it comprises:

Obtaining the face sheet picture, performing text recognition on the face sheet image, and obtaining the text content and the coordinate information of each text content;

Based on the text content and the coordinate information of each text content, determine the proximity value of each text in the text content;

According to the proximity value of each word in the text content, each word in the text content is gathered into a group, and an array of text fragments is obtained;

Comparing the array of text fragments with a preset address comparison library, and determining the array of text fragments with the highest degree of similarity as the address text;

Taking the four-corner coordinates of the address text as the address label value of the face sheet picture.

2. The method according to claim 1, characterized in that the method further comprises:

Identify the waybill number in the picture of the waybill;

Call the order center service address library as the default address comparison library according to the waybill number.

3. The method according to claim 2, wherein the method further comprises:

Call the standard address library as the default address comparison library.

4. method according to claim 3, is characterized in that, described method also comprises:

The order center service address library and the standard address library are combined as a preset address comparison library.

5. The method according to claim 1, wherein the method further comprises:

The data in the address comparison library is tiled in the form of a one-dimensional array.

6. The method according to claim 1, characterized in that, according to the proximity value of each word in the text content, each word in the text content is gathered into groups, comprising:

Groups words with the same proximity value into groups.

7. The method according to claim 1, wherein, according to the proximity value of each character in the text content, each character in the text content is gathered into groups, including:

Group the text whose absolute value of the difference between adjacent values is lower than a preset threshold.

8. The method according to claim 1, wherein comparing the text segment array with a preset address comparison library includes:

The text segment array is collided with a preset address comparison library based on an address collision algorithm, and the address collision algorithm adopts a method of weighting the proportion and confidence of the same text.