WO2018000269A1

WO2018000269A1 - Data annotation method and system based on data mining and crowdsourcing

Info

Publication number: WO2018000269A1
Application number: PCT/CN2016/087754
Authority: WO
Inventors: 杨新宇; 王昊奋; 邱楠
Original assignee: Shenzhen Gowild Robotics Co Ltd
Current assignee: Shenzhen Gowild Robotics Co Ltd
Priority date: 2016-06-29
Filing date: 2016-06-29
Publication date: 2018-01-04
Anticipated expiration: 2018-12-29
Also published as: CN106489149A

Abstract

A data annotation method based on data mining and crowdsourcing, comprising: acquiring raw data to be annotated (S101); performing classification and crowdsourcing distribution on the raw data by using an integrated algorithm (S102); acquiring a crowdsourcing annotation result and performing automatic review on the same by using an integrated algorithm, filtering and marking a questionable annotation result (S103); and outputting a crowdsourcing annotation result that has gone through the automatic review, the crowdsourcing annotation result including the questionable annotation result (S104). Questionable annotation results having possible problems are searched for and found from among all of the crowdsourcing annotation results, and the questionable annotation results are marked to facilitate reviewing and modifications thereof, thereby significantly facilitating the finding of questionable annotation results and increasing the annotation quality of the outputted results. The method organically combines the technique of data mining and crowdsourcing platforms, thus enabling a massive amount of accurately annotated data to be generated and the annotation costs to be effectively reduced at the same time.

Description

Data mining method and system based on data mining and crowdsourcing

Technical field

本发明涉及数据标注技术领域，尤其涉及一种基于数据挖掘和众包的数据标注方法及系统。The invention relates to the technical field of data annotation, in particular to a data annotation method and system based on data mining and crowdsourcing.

Background technique

近年来，随着众包技术的发展，利用众包技术进行数据标注引起了研究者的关注。众包技术是一种分布式的问题求解方式。该技术利用众人的智慧和力量来解决计算机难以解决的任务，尤其是数据标注、对象识别等这类对人类来说非常简单，但是对计算机来讲非常困难的任务。很多标注任务，例如文本标注、图像分类等，均可以通过众包平台发布到互联网上，由来自互联网的普通用户进行标注。普通用户完成数据标注任务并获得发布者提供的经济报酬。In recent years, with the development of crowdsourcing technology, the use of crowdsourcing technology for data labeling has attracted the attention of researchers. Crowdsourcing technology is a distributed problem solving method. This technology uses the wisdom and strength of everyone to solve tasks that are difficult for computers to solve, especially data annotation, object recognition, etc., which are very simple for humans, but are very difficult tasks for computers. Many annotation tasks, such as text annotation, image classification, etc., can be published to the Internet through the crowdsourcing platform, and marked by ordinary users from the Internet. Ordinary users complete the data annotation task and receive the financial rewards provided by the publisher.

众包平台的优点是处理精细，且规模足够大时可以得到全面、深入的数据处理结果。缺点是投入大、效率低、数据处理量小。而且标注者均为来自互联网的普通用户，与传统的专家标注相比，其标注质量的不到保证。The advantage of the crowdsourcing platform is that the processing is fine, and when the scale is large enough, comprehensive and in-depth data processing results can be obtained. The disadvantages are large investment, low efficiency, and small data processing. Moreover, the annotators are ordinary users from the Internet, and the quality of the annotation is not guaranteed compared with the traditional expert annotation.

因此，如何降低标注数据的标注成本，提高标注的效率和质量，是本技术领域亟需解决的技术问题。Therefore, how to reduce the labeling cost of the label data and improve the efficiency and quality of the labeling is a technical problem that needs to be solved in the technical field.

发明内容Summary of the invention

本发明的目的是提供一种基于数据挖掘和众包的数据标注方法及系统，以降低标注数据的标注成本，提高标注的效率和质量。The object of the present invention is to provide a data mining method and system based on data mining and crowdsourcing, so as to reduce the labeling cost of the label data and improve the efficiency and quality of the labeling.

本发明的目的是通过以下技术方案来实现的：The object of the present invention is achieved by the following technical solutions:

一种基于数据挖掘和众包的数据标注方法，包括：A data mining method based on data mining and crowdsourcing, including:

获取待标注的原始数据；Obtain the original data to be labeled;

使用整合的算法，对所述原始数据进行分类与众包分发；Classification and crowdsourcing of the raw data using an integrated algorithm;

获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；Obtain the crowdsourcing labeling results, use the integrated algorithm, automate the auditing of the crowdsourcing labeling results, filter out the problem labeling results, and mark the problem labeling results;

输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。 The results of the crowdsourced annotations that are automatically reviewed are output, and the results of the crowdsourcing annotations include the problem annotation results.

优选地，所述问题标注结果包括低质量标注结果，所述使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记的步骤具体包括：Preferably, the problem labeling result includes a low quality labeling result, and the step of automatically using the integrated algorithm to automatically perform the crowdsourcing labeling result, screening the question labeling result, and marking the question labeling result comprises:

根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记，其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check.

优选地，所述根据历史标注数据库和对比规则，所述根根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记的步骤具体包括：Preferably, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:

根据全局历史标注数据库，对众包标注结果进行相似度对比，若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值，则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result.

优选地，所述根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记的步骤具体包括：Preferably, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:

根据标注者的历史标注数据库，对众包标注结果进行聚类分析，若该众包标注结果属于该聚类类别中，则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results.

优选地，所述问题标注结果包括错误标注结果，所述对众包标注结果进行自动化审核，获取问题标注结果，并对问题标注结果进行标记的步骤具体包括：Preferably, the problem labeling result includes an error labeling result, and the step of automatically checking the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据意图识别规则对数据意图与众包标注结果进行比对，筛选机器分类与众包标注结果冲突的为错误标注结果并标记。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.

优选地，所述根据意图识别规则对数据意图与众包标注结果进行比对，筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括：Preferably, the step of comparing the data intent with the crowdsourcing labeling result according to the intent identification rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板，若不匹配，则标记为错误标注结果。It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result.

判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇，若不包含，则标记为错误标注结果。 It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result.

优选地，所述整合的算法至少包括聚类算法和标注规则模板，所述使用整合的算法，对所述原始数据进行分类与众包分发的步骤具体包括：根据聚类算法和标注规则模板将所述原始数据进行分类和分发。Preferably, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data using the integrated algorithm specifically includes: according to the clustering algorithm and the labeling rule template The raw data is sorted and distributed.

优选地，所述输出经过自动化审核的众包标注结果的步骤具体包括：Preferably, the step of outputting the result of the crowdsourcing labeling that is automatically audited comprises:

输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks.

一种基于数据挖掘和众包的数据标注系统，包括：A data mining system based on data mining and crowdsourcing, including:

抓取模块，用于获取待标注的原始数据；a capture module for obtaining raw data to be labeled;

分发模块，用于使用整合的算法，对所述原始数据进行分类与众包分发；a distribution module for classifying and crowdsourcing the original data using an integrated algorithm;

处理模块，用于获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；The processing module is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit of the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;

输出模块，用于输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。An output module, configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.

相比现有技术，本发明具有以下优点：现有的众包技术中，标注者为来自互联网的普通用户，其标注质量的不到保证，而本发明中采取的标注方法包括：获取标注所需的原始数据；根据预设规则将所述原始数据进行分发；获取众包标注结果，并对众包标注结果进行自动化审核，获取问题标注结果，并对问题标注结果进行标记；输出众包标注结果和问题标注结果。这样就可以对众包标注结果进行审核，这样就从所有的众包标注结果中找出可能存在问题的问题标注结果，并且将这些问题标注结果标记起来，这样就可以方便对问题标注结果进行审核和修改，极大的方便了找出有问题的标注结果，提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合，使拥有海量精确标注数据的同时，有效的降低标注成本。Compared with the prior art, the present invention has the following advantages: in the existing crowdsourcing technology, the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: obtaining the labeling station. Raw data required; distribute the original data according to preset rules; obtain crowdsourcing labeling results, and automatically audit the crowdsourcing labeling results, obtain problem labeling results, and mark problem labeling results; output crowdsourcing labeling Results and questions are annotated with results. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.

DRAWINGS

图1是本发明实施例一的基于数据挖掘和众包的数据标注方法的流程图；1 is a flowchart of a data mining and crowdsourcing-based data annotation method according to Embodiment 1 of the present invention;

图2是本发明实施例二的基于数据挖掘和众包的数据标注系统的示意图。 2 is a schematic diagram of a data mining and crowdsourcing based data annotation system in accordance with a second embodiment of the present invention.

detailed description

虽然流程图将各项操作描述成顺序的处理，但是其中的许多操作可以被并行地、并发地或者同时实施。各项操作的顺序可以被重新安排。当其操作完成时处理可以被终止，但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. The order of the operations can be rearranged. Processing may be terminated when its operation is completed, but may also have additional steps not included in the figures. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and the like.

计算机设备包括用户设备与网络设备。其中，用户设备或客户端包括但不限于电脑、智能手机、PDA等；网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算的由大量计算机或网络服务器构成的云。计算机设备可单独运行来实现本发明，也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。Computer devices include user devices and network devices. The user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud. The computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

在这里可能使用了术语“第一”、“第二”等等来描述各个单元，但是这些单元不应当受这些术语限制，使用这些术语仅仅是为了将一个单元与另一个单元进行区分。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。当一个单元被称为“连接”或“耦合”到另一单元时，其可以直接连接或耦合到所述另一单元，或者可以存在中间单元。The terms "first," "second," and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another. The term "and/or" used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being "connected" or "coupled" to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.

这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指，否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是，这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的存在，而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It is also to be understood that the terms "comprising" and """ Other features, integers, steps, operations, units, components, and/or combinations thereof.

下面结合附图和较佳的实施例对本发明作进一步说明。The invention will now be further described with reference to the drawings and preferred embodiments.

实施例一Embodiment 1

如图1所示，本实施例中公开一种基于数据挖掘和众包的数据标注方法，包括：As shown in FIG. 1 , a data annotation method based on data mining and crowdsourcing is disclosed in the embodiment, including:

S101、获取待标注的原始数据；S101. Acquire original data to be marked.

S102、使用整合的算法，对所述原始数据进行分类与众包分发；S102. Perform classification and crowdsourcing distribution of the original data by using an integrated algorithm.

S103、获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；S103. Obtaining crowdsourcing labeling results, using an integrated algorithm, performing an automated audit of the crowdsourcing labeling results, screening out the problem labeling results, and marking the problem labeling results;

S104、输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。S104. Output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result is included Including the problem labeling results.

其中，标注的数据范围包括但不限于文字，图像，音频，统计数据以及其他的数据。The range of data marked includes, but is not limited to, text, images, audio, statistics, and other data.

现有的众包技术中，标注者为来自互联网的普通用户，其标注质量的不到保证，而本发明中采取的标注方法包括：S101、获取待标注的原始数据；S102、使用整合的算法，对所述原始数据进行分类与众包分发；S103、获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；S104、输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。这样就可以对众包标注结果进行审核，这样就从所有的众包标注结果中找出可能存在问题的问题标注结果，并且将这些问题标注结果标记起来，这样就可以方便对问题标注结果进行审核和修改，极大的方便了找出有问题的标注结果，提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合，使拥有海量精确标注数据的同时，有效的降低标注成本。本发明可以适用于机器人交互的技术领域，方便机器人采集经过标注的数据，这样可以方便机器人收集到需要的高质量数据，更好的与人交互。In the existing crowdsourcing technology, the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: S101, obtaining original data to be labeled; S102, using an integrated algorithm Classification and crowdsourcing distribution of the original data; S103, obtaining crowdsourcing labeling results, using an integrated algorithm, automatically reviewing the crowdsourcing labeling results, screening out the problem labeling results, and marking the problem labeling results; S104 And outputting the crowdsourcing labeling result that has been automatically reviewed, and the crowdsourcing labeling result includes the problem labeling result. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost. The invention can be applied to the technical field of robot interaction, and is convenient for the robot to collect the labeled data, so that the robot can collect the required high quality data and better interact with the human.

根据其中一个示例，所述问题标注结果包括低质量标注结果，所述对众包标注结果进行自动化审核，获取问题标注结果，并对问题标注结果进行标记的步骤具体包括：According to one example, the problem labeling result includes a low quality labeling result, the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记，其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。低质量标注结果具体为可能的低质量标注，具体的讲是疑似低质量的标注结果，作为怀疑对象，需要进一步具体的检查。According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check. The low-quality labeling result is specifically a possible low-quality labeling, specifically the result of the suspected low-quality labeling, and as a suspect object, further specific inspection is needed.

根据其中另一个示例，所述根据历史标注数据库和对比规则，所述根根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记的步骤具体包括：According to another example, according to the historical annotation database and the comparison rule, the root analyzes the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:

根据全局历史标注数据库，对众包标注结果进行相似度对比，若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值，则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。这样就可以筛选出低质量标注结果，进行进一步筛查。According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result. This allows you to screen out low-quality labeling results for further screening.

根据其中另一个示例，所述根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记的步骤具体包括： According to another example, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:

根据标注者的历史标注数据库，对众包标注结果进行聚类分析，若该众包标注结果属于该聚类类别中，则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。这样就可以筛选出低质量标注结果，进行进一步筛查。According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results. This allows you to screen out low-quality labeling results for further screening.

根据其中另一个示例，所述问题标注结果包括错误标注结果，所述对众包标注结果进行自动化审核，获取问题标注结果，并对问题标注结果进行标记的步骤具体包括：According to another example, the problem labeling result includes an error labeling result, the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据意图识别规则对数据意图与众包标注结果进行比对，筛选机器分类与众包标注结果冲突的为错误标注结果并标记。这样就可以筛选出错误标注结果，进行进一步筛查。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked. In this way, the error labeling results can be screened for further screening.

根据其中另一个示例，所述根据意图识别规则对数据意图与众包标注结果进行比对，筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括：According to another example, the step of comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板，若不匹配，则标记为错误标注结果。这样就可以筛选出错误标注结果，进行进一步筛查。It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result. In this way, the error labeling results can be screened for further screening.

判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇，若不包含，则标记为错误标注结果。这样就可以筛选出错误标注结果，进行进一步筛查。It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result. In this way, the error labeling results can be screened for further screening.

根据其中另一个示例，所述整合的算法至少包括聚类算法和标注规则模板，所述使用整合的算法，对所述原始数据数据进行分类与众包分发的步骤具体包括：根据聚类算法和标注规则模板将所述原始数据进行分类和分发。According to another example, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data data using the integrated algorithm specifically includes: according to a clustering algorithm and An annotation rule template classifies and distributes the raw data.

根据其中另一个示例，所述输出经过自动化审核的众包标注结果的步骤具体包括：According to another example, the step of outputting the crowdsourced annotation result by the automated review specifically includes:

输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。这样就可以完成标注的统计结果和存在问题的问题标注结果，不仅可以了解到标注的具体情况，而且还可以据此了解标注者的标注准确率等数据，以便筛选出更好的标注者。 Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks. In this way, the statistical results of the annotations and the problematic annotation results can be completed, and not only the specific conditions of the annotations can be known, but also the data of the annotation accuracy of the labeler can be understood accordingly, so as to select a better labeler.

在更近贴近实际应用的案例中，例如，当标注的数据为语料时，S103对众包标注结果进行自动化审核的具体过程中具体包括：In the case of being closer to the actual application, for example, when the marked data is a corpus, the specific process of the S103 to automatically review the results of the crowdsourcing labeling includes:

在自动化审核之前，需要将众包标注结果汇总(即输入)；Before the automated review, the crowdsourcing labeling results need to be summarized (ie input);

然后统计标注数量，获得该标注人本次任务的实际完成情况；Then count the number of annotations to obtain the actual completion of the current task of the labeler;

通过相似度对比、聚类、有效性检验等，引用历史标注数据库，分析是否有“低质量标注”，作为与上述的低质量标注结果相对应的示例，可以自动标注为“可能低质量”。如，不同类型语料拥有同样的标注；或，明显相似的语料拥有不同标注。例如，对于影响相似的语料，如“我要去吃饭”，“我准备去吃饭”这种，大部分都标注的是“去吃饭”，其中有一个将“我要去吃饭”标注的是“去唱歌”，这就需要将这个标注结果标记为低质量标注结果，例如上述的“可能低质量”；Through the similarity comparison, clustering, validity test, etc., the historical annotation database is cited to analyze whether there is a "low quality labeling". As an example corresponding to the above low quality labeling result, it can be automatically labeled as "possible low quality". For example, different types of corpus have the same label; or, similarly similar corpus have different labels. For example, for corpora with similar influences, such as “I am going to eat” and “I am going to eat”, most of them are labeled “Go to dinner”, and one of them will mark “I am going to eat”. Going to sing, this needs to mark this labeling result as a low-quality labeling result, such as the above-mentioned "possible low quality";

使用意图识别规则，与人工标注的结果进行比对，筛选出及机器分类与人工标注冲突的标注条目，如上述的错误标注结果，可以自动标注为“可能错误”；Use the intent recognition rule to compare with the result of the manual labeling, and filter out the label items that conflict with the machine classification and the manual labeling, such as the above-mentioned error labeling result, which can be automatically marked as “probable error”;

这样经过这一轮自动标注的数据，为接下来的人工抽查标明审查重点，极大降低工作量。In this way, the data automatically marked in this round will mark the review focus for the next manual sampling, greatly reducing the workload.

审核中，还需要统计标注数量，具体的，对于总标注条目数进行统计，以及对必须标注的标注项进行统计。In the audit, you need to count the number of annotations. Specifically, you can count the total number of annotation entries and the statistics of the labels that must be marked.

对于标注为“可能低质量”的标注结果，具体的，For labeling results marked as "possibly low quality", specific,

引用全局历史标注数据库，进行相似度对比。若两条标注语料本身相似度到达特定阈值，则对比其“众包标注结果”。若“众包标注结果”有冲突，则标记为“可能低质量”；Refer to the global history annotation database for similarity comparison. If the similarity of the two labeled corpus itself reaches a certain threshold, then the "crowdsourcing labeling result" is compared. If there is a conflict in the “crowdsourcing labeling result”, it is marked as “possibly low quality”;

作为另一种示例方式，引用该作者历史标注数据库，进行聚类。若本条标注语料位于某一聚类类别内(即自然语言内容类似)，则对比其“众包标注结果”。若历史“众包标注结果”偏离较小，而当前标注结果明显偏离历史标注集范围，则标记为“可能低质量”。As another example, the author's historical annotation database is referenced for clustering. If the corpus of this article is located within a certain cluster category (ie, the natural language content is similar), then the "crowdsourcing labeling result" is compared. If the historical “crowdout labeling result” deviates less, and the current labeling result deviates significantly from the historical labeling set range, it is marked as “possibly low quality”.

对于标注为“可能错误”的标注结果，具体的，For the labeling result labeled "Possible Error", specifically,

判定该标注语料是否符合其人工标注意图所对应的意图句式匹配模板，若不匹配，则标注为“可能错误”； Determining whether the labeled corpus meets the intent sentence matching template corresponding to the intention of manual marking, and if not, marking the "probable error";

作为另一种示例方式，判断该标注语料是否包含其人工标注意图所对应的意图词袋内的任意词汇，若不包含，则标注为“可能错误”。As another example manner, it is determined whether the labeled corpus contains any vocabulary in the intent word bag corresponding to the manual labeling intention, and if not included, it is marked as “probable error”.

审核中，运用统计类工具，对众包标注结果进行统计，包括，运用自然语言处理工具，对众包标注结果进行“初审”，对“有很大概率错误的标注”进行自动标记，并归类。In the audit, statistical tools are used to collect statistics on the results of crowdsourcing, including the use of natural language processing tools, “primary review” of crowdsourcing labeling results, and automatic labeling of “high-probability error labels”. class.

输出结果中，包括输出每个标注者标注任务完成情况的统计结果，以及每个标注者标注任务中可能有问题的标注条目。The output results include the statistical results of outputting each of the caller's labeling tasks, and the labeling entries that each of the labeler may have problems with the task.

实施例二Embodiment 2

如图2所示，本实施例中公开一种基于数据挖掘和众包的数据标注系统，包括：As shown in FIG. 2, in this embodiment, a data annotation system based on data mining and crowdsourcing is disclosed, including:

抓取模块201，用于获取待标注的原始数据；The capture module 201 is configured to obtain original data to be labeled;

分发模块202，用于使用整合的算法，对所述原始数据进行分类与众包分发；a distribution module 202, configured to perform classification and crowdsourcing distribution of the original data using an integrated algorithm;

处理模块203，用于获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；The processing module 203 is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit on the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;

输出模块204，用于输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。The output module 204 is configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.

本实施例公开的数据标注系统由于包括：抓取模块201，用于获取待标注的原始数据；分发模块202，用于使用整合的算法，对所述原始数据进行分类与众包分发；处理模块203，用于获取众包标注结果，使用整合的算法，对众包标注结果进行自动化审核，筛选出问题标注结果，并对问题标注结果进行标记；输出模块204，用于输出经过自动化审核的众包标注结果，所述众包标注结果中包括问题标注结果。这样就可以对众包标注结果进行审核，这样就从所有的众包标注结果中找出可能存在问题的问题标注结果，并且将这些问题标注结果标记起来，这样就可以方便对问题标注结果进行审核和修改，极大的方便了找出有问题的标注结果，提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合，使拥有海量精确标注数据的同时，有效的降低标注成本。The data annotation system disclosed in this embodiment includes: a capture module 201 for acquiring original data to be labeled; a distribution module 202 for classifying and distributing the original data using an integrated algorithm; and processing module 203, used to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically review the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result; the output module 204 is used to output the automated auditing public. The package labeling result includes the problem labeling result in the crowdsourcing labeling result. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.

根据其中一个示例，所述问题标注结果包括低质量标注结果，所述处理模块具体用于：根据历史标注数据库和对比规则，对众包标注结果进行分析，获取低质量标注结果并标记，其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。According to one example, the question labeling result includes a low quality labeling result, where The management module is specifically configured to: analyze the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtain and label the low quality labeling result, wherein the comparison rule includes at least the similarity comparison, the cluster analysis, and the validity check. One.

根据其中另一个示例，所述处理模块具体用于：根据全局历史标注数据库，对众包标注结果进行相似度对比，若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值，则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。According to another example, the processing module is specifically configured to perform a similarity comparison on the crowdsourcing labeling result according to the global history labeling database, and if the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches a threshold , the remaining labeling results that conflict with the crowdsourcing labeling result are marked as low quality labeling results.

根据其中另一个示例，根据标注者的历史标注数据库，对众包标注结果进行聚类分析，若该众包标注结果属于该聚类类别中，则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。According to another example, clustering analysis is performed on the crowdsourcing labeling result according to the labeler's historical labeling database. If the crowdsourcing labeling result belongs to the clustering category, the remaining deviation from the crowdsourcing labeling result is exceeded. The labeling result of the threshold is marked as a low quality labeling result.

根据其中另一个示例，所述问题标注结果包括错误标注结果，所述处理模块具体用于：According to another example, the question labeling result includes an error labeling result, and the processing module is specifically configured to:

根据其中另一个示例，所述处理模块具体用于：判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板，若不匹配，则标记为错误标注结果。According to another example, the processing module is specifically configured to: determine whether the crowdsourcing labeling result meets the intent sentence matching template corresponding to the manual labeling intention, and if not, mark the error labeling result.

根据其中另一个示例，所述处理模块具体用于：判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇，若不包含，则标记为错误标注结果。According to another example, the processing module is specifically configured to: determine whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, mark the error labeling result.

根据其中另一个示例，所述整合的算法至少包括聚类算法和标注规则模板，所述分发模块具体用于：根据聚类算法和标注规则模板将所述原始数据进行分类和分发。According to another example, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the distribution module is specifically configured to: classify and distribute the original data according to a clustering algorithm and an annotation rule template.

根据其中另一个示例，所述输出模块具体用于：输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。According to another example, the output module is specifically configured to: output a statistical result of each of the labeler's labeling task completion status and a question labeling result in each of the labeler's labeling tasks.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims

A data annotation method based on data mining and crowdsourcing, characterized in that it comprises:

Obtain the original data to be labeled;

Classification and crowdsourcing of the raw data using an integrated algorithm;

Obtain the crowdsourcing labeling results, use the integrated algorithm, automate the auditing of the crowdsourcing labeling results, filter out the problem labeling results, and mark the problem labeling results;

The results of the crowdsourced annotations that are automatically reviewed are output, and the results of the crowdsourcing annotations include the problem annotation results.

The data labeling method according to claim 1, wherein the question labeling result comprises a low-quality labeling result, and the integrated algorithm is used to automatically review the crowdsourcing labeling result, and screen the problem labeling result, and The steps of marking the problem labeling result specifically include:

According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check.

The data labeling method according to claim 2, wherein the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically comprises:

According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result.

According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results.

The data labeling method according to claim 1, wherein the question labeling result comprises an error labeling result, and the preset algorithm is used to automatically review the crowdsourcing labeling result, and screen the problem labeling result, and The steps of marking the problem labeling result specifically include:

Compare the data intent with the crowdsourced labeling result according to the intent recognition rule, and filter the machine points. The class conflicts with the result of the crowdsourcing annotation and marks the result with an error.

The data labeling method according to claim 5, wherein the comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and screening the machine classification and the crowdsourcing labeling result conflicting result labeling and marking The steps specifically include:

It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result.

It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result.

The data annotation method according to claim 1, wherein the integrated algorithm comprises at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data using the integrated algorithm Specifically, the method includes: classifying and distributing the original data according to a clustering algorithm and an annotation rule template.

The data labeling method according to claim 1, wherein the step of outputting the crowdsourced labeling result that is automatically audited comprises:

Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks.

A data annotation system based on data mining and crowdsourcing, characterized in that it comprises:

a capture module for obtaining raw data to be labeled;

a distribution module for classifying and crowdsourcing the original data using an integrated algorithm;

The processing module is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit of the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;

An output module, configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.

The data labeling system according to claim 10, wherein the problem labeling result comprises a low quality labeling result, and the processing module is specifically configured to: analyze the crowdsourcing labeling result according to the historical labeling database and the comparison rule, A low quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of a similarity comparison, a cluster analysis, and a validity check.

The data annotation system of claim 11 wherein said processing module The block is specifically used to: compare the similarity degree of the crowdsourcing labeling result according to the global history labeling database, and if the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches a threshold, the remaining and the crowdsourcing are marked The result of the conflicting labeling results is marked as a low quality labeling result.

The data labeling system according to claim 11, wherein the crowdsourcing labeling result is clustered according to the labeler's historical labeling database, and if the crowdsourcing labeling result belongs to the clustering category, the remaining The result of the labeling of the crowdsourcing labeling result that exceeds the threshold is marked as a low-quality labeling result.

The data labeling system according to claim 10, wherein the problem labeling result comprises an error labeling result, and the processing module is specifically configured to:

According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.

The data labeling system according to claim 14, wherein the processing module is configured to: determine whether the crowdsourcing labeling result conforms to an intent sentence matching template corresponding to the manual labeling intention, and if not, Marked as an error labeling result.

The data labeling system according to claim 14, wherein the processing module is configured to: determine whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, if not included , marked as an error labeling result.

The data labeling system according to claim 10, wherein the integrated algorithm comprises at least a clustering algorithm and an annotation rule template, and the distribution module is specifically configured to: use the clustering algorithm and the labeling rule template to Data is classified and distributed.

The data labeling system according to claim 10, wherein the output module is specifically configured to: output a statistical result of each of the labeler's labeling task completion status and a question labeling result in each of the labeler's labeling tasks.