[go: up one dir, main page]

WO2018000269A1 - Data annotation method and system based on data mining and crowdsourcing - Google Patents

Data annotation method and system based on data mining and crowdsourcing Download PDF

Info

Publication number
WO2018000269A1
WO2018000269A1 PCT/CN2016/087754 CN2016087754W WO2018000269A1 WO 2018000269 A1 WO2018000269 A1 WO 2018000269A1 CN 2016087754 W CN2016087754 W CN 2016087754W WO 2018000269 A1 WO2018000269 A1 WO 2018000269A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeling
crowdsourcing
result
labeling result
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2016/087754
Other languages
French (fr)
Chinese (zh)
Inventor
杨新宇
王昊奋
邱楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Gowild Robotics Co Ltd
Original Assignee
Shenzhen Gowild Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Gowild Robotics Co Ltd filed Critical Shenzhen Gowild Robotics Co Ltd
Priority to PCT/CN2016/087754 priority Critical patent/WO2018000269A1/en
Priority to CN201680001749.5A priority patent/CN106489149A/en
Publication of WO2018000269A1 publication Critical patent/WO2018000269A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Definitions

  • the invention relates to the technical field of data annotation, in particular to a data annotation method and system based on data mining and crowdsourcing.
  • Crowdsourcing technology is a distributed problem solving method. This technology uses the wisdom and strength of everyone to solve tasks that are difficult for computers to solve, especially data annotation, object recognition, etc., which are very simple for humans, but are very difficult tasks for computers.
  • Many annotation tasks such as text annotation, image classification, etc., can be published to the Internet through the crowdsourcing platform, and marked by ordinary users from the Internet. Ordinary users complete the data annotation task and receive the financial rewards provided by the publisher.
  • the advantage of the crowdsourcing platform is that the processing is fine, and when the scale is large enough, comprehensive and in-depth data processing results can be obtained.
  • the disadvantages are large investment, low efficiency, and small data processing.
  • the annotators are ordinary users from the Internet, and the quality of the annotation is not guaranteed compared with the traditional expert annotation.
  • the object of the present invention is to provide a data mining method and system based on data mining and crowdsourcing, so as to reduce the labeling cost of the label data and improve the efficiency and quality of the labeling.
  • a data mining method based on data mining and crowdsourcing including:
  • the results of the crowdsourced annotations that are automatically reviewed are output, and the results of the crowdsourcing annotations include the problem annotation results.
  • the problem labeling result includes a low quality labeling result
  • the step of automatically using the integrated algorithm to automatically perform the crowdsourcing labeling result, screening the question labeling result, and marking the question labeling result comprises:
  • the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check.
  • the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:
  • the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result.
  • the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:
  • clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results.
  • the problem labeling result includes an error labeling result
  • the step of automatically checking the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:
  • the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.
  • the step of comparing the data intent with the crowdsourcing labeling result according to the intent identification rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:
  • the step of comparing the data intent with the crowdsourcing labeling result according to the intent identification rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:
  • the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result.
  • the integrated algorithm includes at least a clustering algorithm and an annotation rule template
  • the step of classifying and crowdsourcing the original data using the integrated algorithm specifically includes: according to the clustering algorithm and the labeling rule template
  • the raw data is sorted and distributed.
  • the step of outputting the result of the crowdsourcing labeling that is automatically audited comprises:
  • a data mining system based on data mining and crowdsourcing including:
  • a capture module for obtaining raw data to be labeled
  • a distribution module for classifying and crowdsourcing the original data using an integrated algorithm
  • the processing module is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit of the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;
  • An output module configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.
  • the present invention has the following advantages: in the existing crowdsourcing technology, the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: obtaining the labeling station. Raw data required; distribute the original data according to preset rules; obtain crowdsourcing labeling results, and automatically audit the crowdsourcing labeling results, obtain problem labeling results, and mark problem labeling results; output crowdsourcing labeling Results and questions are annotated with results.
  • the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed.
  • the invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.
  • FIG. 1 is a flowchart of a data mining and crowdsourcing-based data annotation method according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of a data mining and crowdsourcing based data annotation system in accordance with a second embodiment of the present invention.
  • Computer devices include user devices and network devices.
  • the user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.;
  • the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud.
  • the computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network.
  • the network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
  • first means “first,” “second,” and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another.
  • the term “and/or” used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being “connected” or “coupled” to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.
  • a data annotation method based on data mining and crowdsourcing is disclosed in the embodiment, including:
  • S102 Perform classification and crowdsourcing distribution of the original data by using an integrated algorithm.
  • the range of data marked includes, but is not limited to, text, images, audio, statistics, and other data.
  • the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: S101, obtaining original data to be labeled; S102, using an integrated algorithm Classification and crowdsourcing distribution of the original data; S103, obtaining crowdsourcing labeling results, using an integrated algorithm, automatically reviewing the crowdsourcing labeling results, screening out the problem labeling results, and marking the problem labeling results; S104 And outputting the crowdsourcing labeling result that has been automatically reviewed, and the crowdsourcing labeling result includes the problem labeling result.
  • the invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.
  • the invention can be applied to the technical field of robot interaction, and is convenient for the robot to collect the labeled data, so that the robot can collect the required high quality data and better interact with the human.
  • the problem labeling result includes a low quality labeling result
  • the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:
  • the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check.
  • the low-quality labeling result is specifically a possible low-quality labeling, specifically the result of the suspected low-quality labeling, and as a suspect object, further specific inspection is needed.
  • the root analyzes the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:
  • the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result. This allows you to screen out low-quality labeling results for further screening.
  • the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:
  • clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results. This allows you to screen out low-quality labeling results for further screening.
  • the problem labeling result includes an error labeling result
  • the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:
  • the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked. In this way, the error labeling results can be screened for further screening.
  • the step of comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:
  • the step of comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:
  • the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result. In this way, the error labeling results can be screened for further screening.
  • the integrated algorithm includes at least a clustering algorithm and an annotation rule template
  • the step of classifying and crowdsourcing the original data data using the integrated algorithm specifically includes: according to a clustering algorithm and An annotation rule template classifies and distributes the raw data.
  • the step of outputting the crowdsourced annotation result by the automated review specifically includes:
  • the specific process of the S103 to automatically review the results of the crowdsourcing labeling includes:
  • the historical annotation database is cited to analyze whether there is a "low quality labeling".
  • a low quality labeling As an example corresponding to the above low quality labeling result, it can be automatically labeled as "possible low quality”.
  • different types of corpus have the same label; or, similarly similar corpus have different labels.
  • corpora with similar influences such as “I am going to eat” and “I am going to eat”
  • most of them are labeled “Go to dinner”, and one of them will mark “I am going to eat”. Going to sing, this needs to mark this labeling result as a low-quality labeling result, such as the above-mentioned "possible low quality”;
  • the author's historical annotation database is referenced for clustering. If the corpus of this article is located within a certain cluster category (ie, the natural language content is similar), then the "crowdsourcing labeling result" is compared. If the historical “crowdout labeling result” deviates less, and the current labeling result deviates significantly from the historical labeling set range, it is marked as “possibly low quality”.
  • the output results include the statistical results of outputting each of the caller's labeling tasks, and the labeling entries that each of the labeler may have problems with the task.
  • a data annotation system based on data mining and crowdsourcing including:
  • the capture module 201 is configured to obtain original data to be labeled
  • a distribution module 202 configured to perform classification and crowdsourcing distribution of the original data using an integrated algorithm
  • the processing module 203 is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit on the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;
  • the output module 204 is configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.
  • the data annotation system disclosed in this embodiment includes: a capture module 201 for acquiring original data to be labeled; a distribution module 202 for classifying and distributing the original data using an integrated algorithm; and processing module 203, used to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically review the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result; the output module 204 is used to output the automated auditing public.
  • the package labeling result includes the problem labeling result in the crowdsourcing labeling result. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output.
  • the invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.
  • the question labeling result includes a low quality labeling result
  • the management module is specifically configured to: analyze the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtain and label the low quality labeling result, wherein the comparison rule includes at least the similarity comparison, the cluster analysis, and the validity check.
  • the processing module is specifically configured to perform a similarity comparison on the crowdsourcing labeling result according to the global history labeling database, and if the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches a threshold , the remaining labeling results that conflict with the crowdsourcing labeling result are marked as low quality labeling results.
  • clustering analysis is performed on the crowdsourcing labeling result according to the labeler's historical labeling database. If the crowdsourcing labeling result belongs to the clustering category, the remaining deviation from the crowdsourcing labeling result is exceeded. The labeling result of the threshold is marked as a low quality labeling result.
  • the question labeling result includes an error labeling result
  • the processing module is specifically configured to:
  • the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.
  • the processing module is specifically configured to: determine whether the crowdsourcing labeling result meets the intent sentence matching template corresponding to the manual labeling intention, and if not, mark the error labeling result.
  • the processing module is specifically configured to: determine whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, mark the error labeling result.
  • the integrated algorithm includes at least a clustering algorithm and an annotation rule template
  • the distribution module is specifically configured to: classify and distribute the original data according to a clustering algorithm and an annotation rule template.
  • the output module is specifically configured to: output a statistical result of each of the labeler's labeling task completion status and a question labeling result in each of the labeler's labeling tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data annotation method based on data mining and crowdsourcing, comprising: acquiring raw data to be annotated (S101); performing classification and crowdsourcing distribution on the raw data by using an integrated algorithm (S102); acquiring a crowdsourcing annotation result and performing automatic review on the same by using an integrated algorithm, filtering and marking a questionable annotation result (S103); and outputting a crowdsourcing annotation result that has gone through the automatic review, the crowdsourcing annotation result including the questionable annotation result (S104). Questionable annotation results having possible problems are searched for and found from among all of the crowdsourcing annotation results, and the questionable annotation results are marked to facilitate reviewing and modifications thereof, thereby significantly facilitating the finding of questionable annotation results and increasing the annotation quality of the outputted results. The method organically combines the technique of data mining and crowdsourcing platforms, thus enabling a massive amount of accurately annotated data to be generated and the annotation costs to be effectively reduced at the same time.

Description

一种基于数据挖掘和众包的数据标注方法及系统Data mining method and system based on data mining and crowdsourcing 技术领域Technical field

本发明涉及数据标注技术领域,尤其涉及一种基于数据挖掘和众包的数据标注方法及系统。The invention relates to the technical field of data annotation, in particular to a data annotation method and system based on data mining and crowdsourcing.

背景技术Background technique

近年来,随着众包技术的发展,利用众包技术进行数据标注引起了研究者的关注。众包技术是一种分布式的问题求解方式。该技术利用众人的智慧和力量来解决计算机难以解决的任务,尤其是数据标注、对象识别等这类对人类来说非常简单,但是对计算机来讲非常困难的任务。很多标注任务,例如文本标注、图像分类等,均可以通过众包平台发布到互联网上,由来自互联网的普通用户进行标注。普通用户完成数据标注任务并获得发布者提供的经济报酬。In recent years, with the development of crowdsourcing technology, the use of crowdsourcing technology for data labeling has attracted the attention of researchers. Crowdsourcing technology is a distributed problem solving method. This technology uses the wisdom and strength of everyone to solve tasks that are difficult for computers to solve, especially data annotation, object recognition, etc., which are very simple for humans, but are very difficult tasks for computers. Many annotation tasks, such as text annotation, image classification, etc., can be published to the Internet through the crowdsourcing platform, and marked by ordinary users from the Internet. Ordinary users complete the data annotation task and receive the financial rewards provided by the publisher.

众包平台的优点是处理精细,且规模足够大时可以得到全面、深入的数据处理结果。缺点是投入大、效率低、数据处理量小。而且标注者均为来自互联网的普通用户,与传统的专家标注相比,其标注质量的不到保证。The advantage of the crowdsourcing platform is that the processing is fine, and when the scale is large enough, comprehensive and in-depth data processing results can be obtained. The disadvantages are large investment, low efficiency, and small data processing. Moreover, the annotators are ordinary users from the Internet, and the quality of the annotation is not guaranteed compared with the traditional expert annotation.

因此,如何降低标注数据的标注成本,提高标注的效率和质量,是本技术领域亟需解决的技术问题。Therefore, how to reduce the labeling cost of the label data and improve the efficiency and quality of the labeling is a technical problem that needs to be solved in the technical field.

发明内容Summary of the invention

本发明的目的是提供一种基于数据挖掘和众包的数据标注方法及系统,以降低标注数据的标注成本,提高标注的效率和质量。The object of the present invention is to provide a data mining method and system based on data mining and crowdsourcing, so as to reduce the labeling cost of the label data and improve the efficiency and quality of the labeling.

本发明的目的是通过以下技术方案来实现的:The object of the present invention is achieved by the following technical solutions:

一种基于数据挖掘和众包的数据标注方法,包括:A data mining method based on data mining and crowdsourcing, including:

获取待标注的原始数据;Obtain the original data to be labeled;

使用整合的算法,对所述原始数据进行分类与众包分发;Classification and crowdsourcing of the raw data using an integrated algorithm;

获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;Obtain the crowdsourcing labeling results, use the integrated algorithm, automate the auditing of the crowdsourcing labeling results, filter out the problem labeling results, and mark the problem labeling results;

输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。 The results of the crowdsourced annotations that are automatically reviewed are output, and the results of the crowdsourcing annotations include the problem annotation results.

优选地,所述问题标注结果包括低质量标注结果,所述使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记的步骤具体包括:Preferably, the problem labeling result includes a low quality labeling result, and the step of automatically using the integrated algorithm to automatically perform the crowdsourcing labeling result, screening the question labeling result, and marking the question labeling result comprises:

根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记,其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check.

优选地,所述根据历史标注数据库和对比规则,所述根根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括:Preferably, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:

根据全局历史标注数据库,对众包标注结果进行相似度对比,若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值,则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result.

优选地,所述根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括:Preferably, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:

根据标注者的历史标注数据库,对众包标注结果进行聚类分析,若该众包标注结果属于该聚类类别中,则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results.

优选地,所述问题标注结果包括错误标注结果,所述对众包标注结果进行自动化审核,获取问题标注结果,并对问题标注结果进行标记的步骤具体包括:Preferably, the problem labeling result includes an error labeling result, and the step of automatically checking the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.

优选地,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:Preferably, the step of comparing the data intent with the crowdsourcing labeling result according to the intent identification rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标记为错误标注结果。It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result.

优选地,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:Preferably, the step of comparing the data intent with the crowdsourcing labeling result according to the intent identification rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标记为错误标注结果。 It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result.

优选地,所述整合的算法至少包括聚类算法和标注规则模板,所述使用整合的算法,对所述原始数据进行分类与众包分发的步骤具体包括:根据聚类算法和标注规则模板将所述原始数据进行分类和分发。Preferably, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data using the integrated algorithm specifically includes: according to the clustering algorithm and the labeling rule template The raw data is sorted and distributed.

优选地,所述输出经过自动化审核的众包标注结果的步骤具体包括:Preferably, the step of outputting the result of the crowdsourcing labeling that is automatically audited comprises:

输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks.

一种基于数据挖掘和众包的数据标注系统,包括:A data mining system based on data mining and crowdsourcing, including:

抓取模块,用于获取待标注的原始数据;a capture module for obtaining raw data to be labeled;

分发模块,用于使用整合的算法,对所述原始数据进行分类与众包分发;a distribution module for classifying and crowdsourcing the original data using an integrated algorithm;

处理模块,用于获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;The processing module is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit of the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;

输出模块,用于输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。An output module, configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.

相比现有技术,本发明具有以下优点:现有的众包技术中,标注者为来自互联网的普通用户,其标注质量的不到保证,而本发明中采取的标注方法包括:获取标注所需的原始数据;根据预设规则将所述原始数据进行分发;获取众包标注结果,并对众包标注结果进行自动化审核,获取问题标注结果,并对问题标注结果进行标记;输出众包标注结果和问题标注结果。这样就可以对众包标注结果进行审核,这样就从所有的众包标注结果中找出可能存在问题的问题标注结果,并且将这些问题标注结果标记起来,这样就可以方便对问题标注结果进行审核和修改,极大的方便了找出有问题的标注结果,提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合,使拥有海量精确标注数据的同时,有效的降低标注成本。Compared with the prior art, the present invention has the following advantages: in the existing crowdsourcing technology, the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: obtaining the labeling station. Raw data required; distribute the original data according to preset rules; obtain crowdsourcing labeling results, and automatically audit the crowdsourcing labeling results, obtain problem labeling results, and mark problem labeling results; output crowdsourcing labeling Results and questions are annotated with results. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.

附图说明DRAWINGS

图1是本发明实施例一的基于数据挖掘和众包的数据标注方法的流程图;1 is a flowchart of a data mining and crowdsourcing-based data annotation method according to Embodiment 1 of the present invention;

图2是本发明实施例二的基于数据挖掘和众包的数据标注系统的示意图。 2 is a schematic diagram of a data mining and crowdsourcing based data annotation system in accordance with a second embodiment of the present invention.

具体实施方式detailed description

虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。各项操作的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. The order of the operations can be rearranged. Processing may be terminated when its operation is completed, but may also have additional steps not included in the figures. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and the like.

计算机设备包括用户设备与网络设备。其中,用户设备或客户端包括但不限于电脑、智能手机、PDA等;网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算的由大量计算机或网络服务器构成的云。计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。Computer devices include user devices and network devices. The user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud. The computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

在这里可能使用了术语“第一”、“第二”等等来描述各个单元,但是这些单元不应当受这些术语限制,使用这些术语仅仅是为了将一个单元与另一个单元进行区分。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。当一个单元被称为“连接”或“耦合”到另一单元时,其可以直接连接或耦合到所述另一单元,或者可以存在中间单元。The terms "first," "second," and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another. The term "and/or" used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being "connected" or "coupled" to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.

这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指,否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是,这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的存在,而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It is also to be understood that the terms "comprising" and """ Other features, integers, steps, operations, units, components, and/or combinations thereof.

下面结合附图和较佳的实施例对本发明作进一步说明。The invention will now be further described with reference to the drawings and preferred embodiments.

实施例一Embodiment 1

如图1所示,本实施例中公开一种基于数据挖掘和众包的数据标注方法,包括:As shown in FIG. 1 , a data annotation method based on data mining and crowdsourcing is disclosed in the embodiment, including:

S101、获取待标注的原始数据;S101. Acquire original data to be marked.

S102、使用整合的算法,对所述原始数据进行分类与众包分发;S102. Perform classification and crowdsourcing distribution of the original data by using an integrated algorithm.

S103、获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;S103. Obtaining crowdsourcing labeling results, using an integrated algorithm, performing an automated audit of the crowdsourcing labeling results, screening out the problem labeling results, and marking the problem labeling results;

S104、输出经过自动化审核的众包标注结果,所述众包标注结果中包 括问题标注结果。S104. Output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result is included Including the problem labeling results.

其中,标注的数据范围包括但不限于文字,图像,音频,统计数据以及其他的数据。The range of data marked includes, but is not limited to, text, images, audio, statistics, and other data.

现有的众包技术中,标注者为来自互联网的普通用户,其标注质量的不到保证,而本发明中采取的标注方法包括:S101、获取待标注的原始数据;S102、使用整合的算法,对所述原始数据进行分类与众包分发;S103、获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;S104、输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。这样就可以对众包标注结果进行审核,这样就从所有的众包标注结果中找出可能存在问题的问题标注结果,并且将这些问题标注结果标记起来,这样就可以方便对问题标注结果进行审核和修改,极大的方便了找出有问题的标注结果,提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合,使拥有海量精确标注数据的同时,有效的降低标注成本。本发明可以适用于机器人交互的技术领域,方便机器人采集经过标注的数据,这样可以方便机器人收集到需要的高质量数据,更好的与人交互。In the existing crowdsourcing technology, the labeler is an ordinary user from the Internet, and the quality of the labeling is not guaranteed, and the labeling method adopted in the present invention includes: S101, obtaining original data to be labeled; S102, using an integrated algorithm Classification and crowdsourcing distribution of the original data; S103, obtaining crowdsourcing labeling results, using an integrated algorithm, automatically reviewing the crowdsourcing labeling results, screening out the problem labeling results, and marking the problem labeling results; S104 And outputting the crowdsourcing labeling result that has been automatically reviewed, and the crowdsourcing labeling result includes the problem labeling result. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost. The invention can be applied to the technical field of robot interaction, and is convenient for the robot to collect the labeled data, so that the robot can collect the required high quality data and better interact with the human.

根据其中一个示例,所述问题标注结果包括低质量标注结果,所述对众包标注结果进行自动化审核,获取问题标注结果,并对问题标注结果进行标记的步骤具体包括:According to one example, the problem labeling result includes a low quality labeling result, the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记,其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。低质量标注结果具体为可能的低质量标注,具体的讲是疑似低质量的标注结果,作为怀疑对象,需要进一步具体的检查。According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check. The low-quality labeling result is specifically a possible low-quality labeling, specifically the result of the suspected low-quality labeling, and as a suspect object, further specific inspection is needed.

根据其中另一个示例,所述根据历史标注数据库和对比规则,所述根根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括:According to another example, according to the historical annotation database and the comparison rule, the root analyzes the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and the step of obtaining the low-quality labeling result and marking the step specifically includes:

根据全局历史标注数据库,对众包标注结果进行相似度对比,若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值,则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。这样就可以筛选出低质量标注结果,进行进一步筛查。According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result. This allows you to screen out low-quality labeling results for further screening.

根据其中另一个示例,所述根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括: According to another example, the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically includes:

根据标注者的历史标注数据库,对众包标注结果进行聚类分析,若该众包标注结果属于该聚类类别中,则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。这样就可以筛选出低质量标注结果,进行进一步筛查。According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results. This allows you to screen out low-quality labeling results for further screening.

根据其中另一个示例,所述问题标注结果包括错误标注结果,所述对众包标注结果进行自动化审核,获取问题标注结果,并对问题标注结果进行标记的步骤具体包括:According to another example, the problem labeling result includes an error labeling result, the step of automatically reviewing the crowdsourcing labeling result, obtaining the problem labeling result, and marking the question labeling result specifically includes:

根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记。这样就可以筛选出错误标注结果,进行进一步筛查。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked. In this way, the error labeling results can be screened for further screening.

根据其中另一个示例,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:According to another example, the step of comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标记为错误标注结果。这样就可以筛选出错误标注结果,进行进一步筛查。It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result. In this way, the error labeling results can be screened for further screening.

根据其中另一个示例,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:According to another example, the step of comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and the step of screening the machine classification conflicting with the crowdsourcing labeling result as an error labeling result and marking comprises:

判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标记为错误标注结果。这样就可以筛选出错误标注结果,进行进一步筛查。It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result. In this way, the error labeling results can be screened for further screening.

根据其中另一个示例,所述整合的算法至少包括聚类算法和标注规则模板,所述使用整合的算法,对所述原始数据数据进行分类与众包分发的步骤具体包括:根据聚类算法和标注规则模板将所述原始数据进行分类和分发。According to another example, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data data using the integrated algorithm specifically includes: according to a clustering algorithm and An annotation rule template classifies and distributes the raw data.

根据其中另一个示例,所述输出经过自动化审核的众包标注结果的步骤具体包括:According to another example, the step of outputting the crowdsourced annotation result by the automated review specifically includes:

输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。这样就可以完成标注的统计结果和存在问题的问题标注结果,不仅可以了解到标注的具体情况,而且还可以据此了解标注者的标注准确率等数据,以便筛选出更好的标注者。 Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks. In this way, the statistical results of the annotations and the problematic annotation results can be completed, and not only the specific conditions of the annotations can be known, but also the data of the annotation accuracy of the labeler can be understood accordingly, so as to select a better labeler.

在更近贴近实际应用的案例中,例如,当标注的数据为语料时,S103对众包标注结果进行自动化审核的具体过程中具体包括:In the case of being closer to the actual application, for example, when the marked data is a corpus, the specific process of the S103 to automatically review the results of the crowdsourcing labeling includes:

在自动化审核之前,需要将众包标注结果汇总(即输入);Before the automated review, the crowdsourcing labeling results need to be summarized (ie input);

然后统计标注数量,获得该标注人本次任务的实际完成情况;Then count the number of annotations to obtain the actual completion of the current task of the labeler;

通过相似度对比、聚类、有效性检验等,引用历史标注数据库,分析是否有“低质量标注”,作为与上述的低质量标注结果相对应的示例,可以自动标注为“可能低质量”。如,不同类型语料拥有同样的标注;或,明显相似的语料拥有不同标注。例如,对于影响相似的语料,如“我要去吃饭”,“我准备去吃饭”这种,大部分都标注的是“去吃饭”,其中有一个将“我要去吃饭”标注的是“去唱歌”,这就需要将这个标注结果标记为低质量标注结果,例如上述的“可能低质量”;Through the similarity comparison, clustering, validity test, etc., the historical annotation database is cited to analyze whether there is a "low quality labeling". As an example corresponding to the above low quality labeling result, it can be automatically labeled as "possible low quality". For example, different types of corpus have the same label; or, similarly similar corpus have different labels. For example, for corpora with similar influences, such as “I am going to eat” and “I am going to eat”, most of them are labeled “Go to dinner”, and one of them will mark “I am going to eat”. Going to sing, this needs to mark this labeling result as a low-quality labeling result, such as the above-mentioned "possible low quality";

使用意图识别规则,与人工标注的结果进行比对,筛选出及机器分类与人工标注冲突的标注条目,如上述的错误标注结果,可以自动标注为“可能错误”;Use the intent recognition rule to compare with the result of the manual labeling, and filter out the label items that conflict with the machine classification and the manual labeling, such as the above-mentioned error labeling result, which can be automatically marked as “probable error”;

这样经过这一轮自动标注的数据,为接下来的人工抽查标明审查重点,极大降低工作量。In this way, the data automatically marked in this round will mark the review focus for the next manual sampling, greatly reducing the workload.

审核中,还需要统计标注数量,具体的,对于总标注条目数进行统计,以及对必须标注的标注项进行统计。In the audit, you need to count the number of annotations. Specifically, you can count the total number of annotation entries and the statistics of the labels that must be marked.

对于标注为“可能低质量”的标注结果,具体的,For labeling results marked as "possibly low quality", specific,

引用全局历史标注数据库,进行相似度对比。若两条标注语料本身相似度到达特定阈值,则对比其“众包标注结果”。若“众包标注结果”有冲突,则标记为“可能低质量”;Refer to the global history annotation database for similarity comparison. If the similarity of the two labeled corpus itself reaches a certain threshold, then the "crowdsourcing labeling result" is compared. If there is a conflict in the “crowdsourcing labeling result”, it is marked as “possibly low quality”;

作为另一种示例方式,引用该作者历史标注数据库,进行聚类。若本条标注语料位于某一聚类类别内(即自然语言内容类似),则对比其“众包标注结果”。若历史“众包标注结果”偏离较小,而当前标注结果明显偏离历史标注集范围,则标记为“可能低质量”。As another example, the author's historical annotation database is referenced for clustering. If the corpus of this article is located within a certain cluster category (ie, the natural language content is similar), then the "crowdsourcing labeling result" is compared. If the historical “crowdout labeling result” deviates less, and the current labeling result deviates significantly from the historical labeling set range, it is marked as “possibly low quality”.

对于标注为“可能错误”的标注结果,具体的,For the labeling result labeled "Possible Error", specifically,

判定该标注语料是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标注为“可能错误”; Determining whether the labeled corpus meets the intent sentence matching template corresponding to the intention of manual marking, and if not, marking the "probable error";

作为另一种示例方式,判断该标注语料是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标注为“可能错误”。As another example manner, it is determined whether the labeled corpus contains any vocabulary in the intent word bag corresponding to the manual labeling intention, and if not included, it is marked as “probable error”.

审核中,运用统计类工具,对众包标注结果进行统计,包括,运用自然语言处理工具,对众包标注结果进行“初审”,对“有很大概率错误的标注”进行自动标记,并归类。In the audit, statistical tools are used to collect statistics on the results of crowdsourcing, including the use of natural language processing tools, “primary review” of crowdsourcing labeling results, and automatic labeling of “high-probability error labels”. class.

输出结果中,包括输出每个标注者标注任务完成情况的统计结果,以及每个标注者标注任务中可能有问题的标注条目。The output results include the statistical results of outputting each of the caller's labeling tasks, and the labeling entries that each of the labeler may have problems with the task.

实施例二Embodiment 2

如图2所示,本实施例中公开一种基于数据挖掘和众包的数据标注系统,包括:As shown in FIG. 2, in this embodiment, a data annotation system based on data mining and crowdsourcing is disclosed, including:

抓取模块201,用于获取待标注的原始数据;The capture module 201 is configured to obtain original data to be labeled;

分发模块202,用于使用整合的算法,对所述原始数据进行分类与众包分发;a distribution module 202, configured to perform classification and crowdsourcing distribution of the original data using an integrated algorithm;

处理模块203,用于获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;The processing module 203 is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit on the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result;

输出模块204,用于输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。The output module 204 is configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result.

本实施例公开的数据标注系统由于包括:抓取模块201,用于获取待标注的原始数据;分发模块202,用于使用整合的算法,对所述原始数据进行分类与众包分发;处理模块203,用于获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;输出模块204,用于输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。这样就可以对众包标注结果进行审核,这样就从所有的众包标注结果中找出可能存在问题的问题标注结果,并且将这些问题标注结果标记起来,这样就可以方便对问题标注结果进行审核和修改,极大的方便了找出有问题的标注结果,提高了输出的结果的标注质量。本发明将数据挖掘技术与众包平台进行有机结合,使拥有海量精确标注数据的同时,有效的降低标注成本。The data annotation system disclosed in this embodiment includes: a capture module 201 for acquiring original data to be labeled; a distribution module 202 for classifying and distributing the original data using an integrated algorithm; and processing module 203, used to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically review the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result; the output module 204 is used to output the automated auditing public. The package labeling result includes the problem labeling result in the crowdsourcing labeling result. In this way, the crowdsourcing labeling results can be reviewed, so that the results of the problematic labeling can be found out from all the crowdsourcing labeling results, and the result labeling results can be marked, so that the problem labeling results can be easily reviewed. And modification, great convenience to find out the problematic labeling results, improve the quality of the output of the output. The invention organically combines the data mining technology with the crowdsourcing platform, so as to have a large amount of accurate labeling data, and effectively reduce the labeling cost.

根据其中一个示例,所述问题标注结果包括低质量标注结果,所述处 理模块具体用于:根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记,其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。According to one example, the question labeling result includes a low quality labeling result, where The management module is specifically configured to: analyze the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtain and label the low quality labeling result, wherein the comparison rule includes at least the similarity comparison, the cluster analysis, and the validity check. One.

根据其中另一个示例,所述处理模块具体用于:根据全局历史标注数据库,对众包标注结果进行相似度对比,若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值,则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。According to another example, the processing module is specifically configured to perform a similarity comparison on the crowdsourcing labeling result according to the global history labeling database, and if the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches a threshold , the remaining labeling results that conflict with the crowdsourcing labeling result are marked as low quality labeling results.

根据其中另一个示例,根据标注者的历史标注数据库,对众包标注结果进行聚类分析,若该众包标注结果属于该聚类类别中,则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。According to another example, clustering analysis is performed on the crowdsourcing labeling result according to the labeler's historical labeling database. If the crowdsourcing labeling result belongs to the clustering category, the remaining deviation from the crowdsourcing labeling result is exceeded. The labeling result of the threshold is marked as a low quality labeling result.

根据其中另一个示例,所述问题标注结果包括错误标注结果,所述处理模块具体用于:According to another example, the question labeling result includes an error labeling result, and the processing module is specifically configured to:

根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked.

根据其中另一个示例,所述处理模块具体用于:判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标记为错误标注结果。According to another example, the processing module is specifically configured to: determine whether the crowdsourcing labeling result meets the intent sentence matching template corresponding to the manual labeling intention, and if not, mark the error labeling result.

根据其中另一个示例,所述处理模块具体用于:判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标记为错误标注结果。According to another example, the processing module is specifically configured to: determine whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, mark the error labeling result.

根据其中另一个示例,所述整合的算法至少包括聚类算法和标注规则模板,所述分发模块具体用于:根据聚类算法和标注规则模板将所述原始数据进行分类和分发。According to another example, the integrated algorithm includes at least a clustering algorithm and an annotation rule template, and the distribution module is specifically configured to: classify and distribute the original data according to a clustering algorithm and an annotation rule template.

根据其中另一个示例,所述输出模块具体用于:输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。According to another example, the output module is specifically configured to: output a statistical result of each of the labeler's labeling task completion status and a question labeling result in each of the labeler's labeling tasks.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims (18)

一种基于数据挖掘和众包的数据标注方法,其特征在于,包括:A data annotation method based on data mining and crowdsourcing, characterized in that it comprises: 获取待标注的原始数据;Obtain the original data to be labeled; 使用整合的算法,对所述原始数据进行分类与众包分发;Classification and crowdsourcing of the raw data using an integrated algorithm; 获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;Obtain the crowdsourcing labeling results, use the integrated algorithm, automate the auditing of the crowdsourcing labeling results, filter out the problem labeling results, and mark the problem labeling results; 输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。The results of the crowdsourced annotations that are automatically reviewed are output, and the results of the crowdsourcing annotations include the problem annotation results. 根据权利要求1所述的数据标注方法,其特征在于,所述问题标注结果包括低质量标注结果,所述使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记的步骤具体包括:The data labeling method according to claim 1, wherein the question labeling result comprises a low-quality labeling result, and the integrated algorithm is used to automatically review the crowdsourcing labeling result, and screen the problem labeling result, and The steps of marking the problem labeling result specifically include: 根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记,其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。According to the historical labeling database and the comparison rule, the crowdsourcing labeling result is analyzed, and the low-quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of similarity comparison, cluster analysis and validity check. 根据权利要求2所述的数据标注方法,其特征在于,所述根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括:The data labeling method according to claim 2, wherein the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically comprises: 根据全局历史标注数据库,对众包标注结果进行相似度对比,若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值,则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。According to the global history labeling database, the similarity comparison is performed on the crowdsourcing labeling result. If the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches the threshold, the remaining labels that conflict with the crowdsourcing labeling result are collided. The result is marked as a low quality label result. 根据权利要求2所述的数据标注方法,其特征在于,所述根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记的步骤具体包括:The data labeling method according to claim 2, wherein the step of analyzing the crowdsourcing labeling result according to the historical labeling database and the comparison rule, and obtaining the low-quality labeling result and marking the step specifically comprises: 根据标注者的历史标注数据库,对众包标注结果进行聚类分析,若该众包标注结果属于该聚类类别中,则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。According to the historical labeling database of the labeler, clustering analysis is performed on the crowdsourcing labeling result. If the crowdsourcing labeling result belongs to the clustering category, the labeling result of the remaining deviation from the crowdsourcing labeling result exceeding the threshold is marked as Low quality labeling results. 根据权利要求1所述的数据标注方法,其特征在于,所述问题标注结果包括错误标注结果,所述使用预设的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记的步骤具体包括:The data labeling method according to claim 1, wherein the question labeling result comprises an error labeling result, and the preset algorithm is used to automatically review the crowdsourcing labeling result, and screen the problem labeling result, and The steps of marking the problem labeling result specifically include: 根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分 类与众包标注结果冲突的为错误标注结果并标记。Compare the data intent with the crowdsourced labeling result according to the intent recognition rule, and filter the machine points. The class conflicts with the result of the crowdsourcing annotation and marks the result with an error. 根据权利要求5所述的数据标注方法,其特征在于,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:The data labeling method according to claim 5, wherein the comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and screening the machine classification and the crowdsourcing labeling result conflicting result labeling and marking The steps specifically include: 判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标记为错误标注结果。It is determined whether the crowdsourcing labeling result conforms to the intent sentence matching template corresponding to the manual labeling intention, and if it does not match, it is marked as an error labeling result. 根据权利要求5所述的数据标注方法,其特征在于,所述根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记的步骤具体包括:The data labeling method according to claim 5, wherein the comparing the data intent with the crowdsourcing labeling result according to the intent recognition rule, and screening the machine classification and the crowdsourcing labeling result conflicting result labeling and marking The steps specifically include: 判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标记为错误标注结果。It is determined whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, and if not, marking the error labeling result. 根据权利要求1所述的数据标注方法,其特征在于,所述整合的算法至少包括聚类算法和标注规则模板,所述使用整合的算法,对所述原始数据进行分类与众包分发的步骤具体包括:根据聚类算法和标注规则模板将所述原始数据进行分类和分发。The data annotation method according to claim 1, wherein the integrated algorithm comprises at least a clustering algorithm and an annotation rule template, and the step of classifying and crowdsourcing the original data using the integrated algorithm Specifically, the method includes: classifying and distributing the original data according to a clustering algorithm and an annotation rule template. 根据权利要求1所述的数据标注方法,其特征在于,所述输出经过自动化审核的众包标注结果的步骤具体包括:The data labeling method according to claim 1, wherein the step of outputting the crowdsourced labeling result that is automatically audited comprises: 输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。Outputs the statistical results of each caller's labeling task completion and the problem labeling results in each of the labeler's labeling tasks. 一种基于数据挖掘和众包的数据标注系统,其特征在于,包括:A data annotation system based on data mining and crowdsourcing, characterized in that it comprises: 抓取模块,用于获取待标注的原始数据;a capture module for obtaining raw data to be labeled; 分发模块,用于使用整合的算法,对所述原始数据进行分类与众包分发;a distribution module for classifying and crowdsourcing the original data using an integrated algorithm; 处理模块,用于获取众包标注结果,使用整合的算法,对众包标注结果进行自动化审核,筛选出问题标注结果,并对问题标注结果进行标记;The processing module is configured to obtain the crowdsourcing labeling result, use an integrated algorithm, automatically perform an audit of the crowdsourcing labeling result, filter out the problem labeling result, and mark the problem labeling result; 输出模块,用于输出经过自动化审核的众包标注结果,所述众包标注结果中包括问题标注结果。An output module, configured to output an automated audited crowdsourcing labeling result, where the crowdsourcing labeling result includes a problem labeling result. 根据权利要求10所述的数据标注系统,其特征在于,所述问题标注结果包括低质量标注结果,所述处理模块具体用于:根据历史标注数据库和对比规则,对众包标注结果进行分析,获取低质量标注结果并标记,其中所述对比规则包括相似度对比、聚类分析和有效性检验中的至少一种。The data labeling system according to claim 10, wherein the problem labeling result comprises a low quality labeling result, and the processing module is specifically configured to: analyze the crowdsourcing labeling result according to the historical labeling database and the comparison rule, A low quality labeling result is obtained and labeled, wherein the comparison rule includes at least one of a similarity comparison, a cluster analysis, and a validity check. 根据权利要求11所述的数据标注系统,其特征在于,所述处理模 块具体用于:根据全局历史标注数据库,对众包标注结果进行相似度对比,若该众包标注结果与在历史标注数据库中的标注结果的相似度达到阈值,则将其余与该众包标注结果相冲突的标注结果标记为低质量标注结果。The data annotation system of claim 11 wherein said processing module The block is specifically used to: compare the similarity degree of the crowdsourcing labeling result according to the global history labeling database, and if the similarity between the crowdsourcing labeling result and the labeling result in the historical labeling database reaches a threshold, the remaining and the crowdsourcing are marked The result of the conflicting labeling results is marked as a low quality labeling result. 根据权利要求11所述的数据标注系统,其特征在于,根据标注者的历史标注数据库,对众包标注结果进行聚类分析,若该众包标注结果属于该聚类类别中,则将其余与该众包标注结果的偏离度超过阈值的标注结果标记为低质量标注结果。The data labeling system according to claim 11, wherein the crowdsourcing labeling result is clustered according to the labeler's historical labeling database, and if the crowdsourcing labeling result belongs to the clustering category, the remaining The result of the labeling of the crowdsourcing labeling result that exceeds the threshold is marked as a low-quality labeling result. 根据权利要求10所述的数据标注系统,其特征在于,所述问题标注结果包括错误标注结果,所述处理模块具体用于:The data labeling system according to claim 10, wherein the problem labeling result comprises an error labeling result, and the processing module is specifically configured to: 根据意图识别规则对数据意图与众包标注结果进行比对,筛选机器分类与众包标注结果冲突的为错误标注结果并标记。According to the intent recognition rule, the data intent is compared with the crowdsourcing labeling result, and the result of the screening machine classification conflicts with the crowdsourcing labeling result is an error labeling result and is marked. 根据权利要求14所述的数据标注系统,其特征在于,所述处理模块具体用于:判断该众包标注结果是否符合其人工标注意图所对应的意图句式匹配模板,若不匹配,则标记为错误标注结果。The data labeling system according to claim 14, wherein the processing module is configured to: determine whether the crowdsourcing labeling result conforms to an intent sentence matching template corresponding to the manual labeling intention, and if not, Marked as an error labeling result. 根据权利要求14所述的数据标注系统,其特征在于,所述处理模块具体用于:判断该众包标注结果是否包含其人工标注意图所对应的意图词袋内的任意词汇,若不包含,则标记为错误标注结果。The data labeling system according to claim 14, wherein the processing module is configured to: determine whether the crowdsourcing labeling result includes any words in the intent word bag corresponding to the manual labeling intention, if not included , marked as an error labeling result. 根据权利要求10所述的数据标注系统,其特征在于,所述整合的算法至少包括聚类算法和标注规则模板,所述分发模块具体用于:根据聚类算法和标注规则模板将所述原始数据进行分类和分发。The data labeling system according to claim 10, wherein the integrated algorithm comprises at least a clustering algorithm and an annotation rule template, and the distribution module is specifically configured to: use the clustering algorithm and the labeling rule template to Data is classified and distributed. 根据权利要求10所述的数据标注系统,其特征在于,所述输出模块具体用于:输出每个标注者标注任务完成情况的统计结果和每个标注者标注任务中的问题标注结果。 The data labeling system according to claim 10, wherein the output module is specifically configured to: output a statistical result of each of the labeler's labeling task completion status and a question labeling result in each of the labeler's labeling tasks.
PCT/CN2016/087754 2016-06-29 2016-06-29 Data annotation method and system based on data mining and crowdsourcing Ceased WO2018000269A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/087754 WO2018000269A1 (en) 2016-06-29 2016-06-29 Data annotation method and system based on data mining and crowdsourcing
CN201680001749.5A CN106489149A (en) 2016-06-29 2016-06-29 A kind of data mask method based on data mining and mass-rent and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087754 WO2018000269A1 (en) 2016-06-29 2016-06-29 Data annotation method and system based on data mining and crowdsourcing

Publications (1)

Publication Number Publication Date
WO2018000269A1 true WO2018000269A1 (en) 2018-01-04

Family

ID=58286058

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087754 Ceased WO2018000269A1 (en) 2016-06-29 2016-06-29 Data annotation method and system based on data mining and crowdsourcing

Country Status (2)

Country Link
CN (1) CN106489149A (en)
WO (1) WO2018000269A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284315A (en) * 2018-08-24 2019-01-29 大连莫比嗨客智能科技有限公司 A kind of label data Statistical Inference under crowdsourcing model
CN109902285A (en) * 2019-01-08 2019-06-18 平安科技(深圳)有限公司 Corpus classification method, device, computer equipment and storage medium
CN111368929A (en) * 2020-03-09 2020-07-03 西安中科长青医疗科技研究院有限公司 Picture labeling method
CN111783391A (en) * 2020-05-28 2020-10-16 孙炜 Online artificial text marking system and method
CN113220827A (en) * 2021-04-23 2021-08-06 哈尔滨工业大学 Construction method and device of agricultural corpus
CN113297902A (en) * 2021-04-14 2021-08-24 中国科学院计算机网络信息中心 Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode
CN113344387A (en) * 2021-06-08 2021-09-03 浙江工商大学 Crowdsourcing-based intelligent data processing method and device
CN113591016A (en) * 2021-09-30 2021-11-02 中国地质环境监测院(自然资源部地质灾害技术指导中心) Landslide labeling contour generation method based on multi-user cooperation
CN116825212A (en) * 2023-08-29 2023-09-29 山东大学 A data collection and annotation method and system based on a biomedical crowdsourcing platform

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273492B (en) * 2017-06-15 2021-07-23 复旦大学 An interactive method for image annotation tasks based on crowdsourcing platform
CN107273698B (en) * 2017-07-06 2020-03-27 北京大学人民医院 Processing and detecting method and system for artificial intelligence training standard library
CN107705034B (en) * 2017-10-26 2021-06-29 医渡云(北京)技术有限公司 Crowdsourcing platform implementation method and device, storage medium and electronic equipment
CN108536662B (en) * 2018-04-16 2022-04-12 苏州大学 A data labeling method and device
CN108829652B (en) * 2018-04-28 2021-06-08 河海大学 Picture labeling system based on crowdsourcing
CN109033220B (en) * 2018-06-29 2022-09-06 北京京东尚科信息技术有限公司 Automatic selection method, system, equipment and storage medium of labeled data
CN109063043B (en) * 2018-07-17 2021-09-28 北京猎户星空科技有限公司 Data processing method, device, medium and equipment
CN109086814B (en) * 2018-07-23 2021-05-14 腾讯科技(深圳)有限公司 Data processing method and device and network equipment
CN109376260B (en) * 2018-09-26 2021-10-01 四川长虹电器股份有限公司 Method and system for deep learning image annotation
CN109065177A (en) * 2018-10-15 2018-12-21 平安科技(深圳)有限公司 A kind of processing method of medical data, device, server and storage medium
CN109740622A (en) * 2018-11-20 2019-05-10 众安信息技术服务有限公司 Image labeling task crowdsourcing method and system based on the logical card award method of block chain
CN109670727B (en) * 2018-12-30 2023-06-23 湖南网数科技有限公司 Crowd-sourcing-based word segmentation annotation quality evaluation system and evaluation method
CN109934266A (en) * 2019-02-19 2019-06-25 清华大学 Visual analysis system and method for improving the quality of crowdsourced annotation data
CN109993315B (en) * 2019-03-29 2021-05-18 联想(北京)有限公司 Data processing method and device and electronic equipment
CN110297914A (en) * 2019-06-14 2019-10-01 中译语通科技股份有限公司 Corpus labeling method and device
CN110309309B (en) * 2019-07-03 2021-04-13 中国搜索信息科技股份有限公司 Method and system for evaluating quality of manual labeling data
CN110457687A (en) * 2019-07-23 2019-11-15 福建奇点时空数字科技有限公司 A kind of data mining and mask method based on complex neural network modeling
CN110647985A (en) * 2019-08-02 2020-01-03 杭州电子科技大学 A crowdsourced data annotation method based on artificial intelligence model library
CN111078908B (en) * 2019-11-28 2023-06-09 北京云聚智慧科技有限公司 Method and device for detecting data annotation
CN113553144B (en) * 2020-04-24 2023-09-26 杭州海康威视数字技术股份有限公司 Data distribution method, device and system
CN111833872B (en) * 2020-07-08 2021-04-30 北京声智科技有限公司 Voice control method, device, equipment, system and medium for elevator
CN119204007B (en) * 2024-11-27 2025-09-02 龙岩学院 A word segmentation and annotation quality assessment system based on crowdsourcing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158732A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Business application publication
US20130237243A1 (en) * 2012-03-09 2013-09-12 Microsoft Corporation Wireless beacon filtering and untrusted data detection
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
CN103824448A (en) * 2014-01-28 2014-05-28 交通运输部公路科学研究所 Crowd-sourcing mode-based traffic information push service method and system
CN104573359A (en) * 2014-12-31 2015-04-29 浙江大学 Method for integrating crowdsource annotation data based on task difficulty and annotator ability
US20160041958A1 (en) * 2014-08-05 2016-02-11 Linkedin Corporation Leveraging annotation bias to improve annotations

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100426305C (en) * 2005-08-31 2008-10-15 鸿富锦精密工业(深圳)有限公司 Dimensioning automatic avoiding system and method
CN105404896B (en) * 2015-11-03 2019-04-19 北京旷视科技有限公司 Label data processing method and label data processing system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158732A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Business application publication
US20130237243A1 (en) * 2012-03-09 2013-09-12 Microsoft Corporation Wireless beacon filtering and untrusted data detection
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
CN103824448A (en) * 2014-01-28 2014-05-28 交通运输部公路科学研究所 Crowd-sourcing mode-based traffic information push service method and system
US20160041958A1 (en) * 2014-08-05 2016-02-11 Linkedin Corporation Leveraging annotation bias to improve annotations
CN104573359A (en) * 2014-12-31 2015-04-29 浙江大学 Method for integrating crowdsource annotation data based on task difficulty and annotator ability

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284315B (en) * 2018-08-24 2021-04-23 深圳莫比嗨客树莓派智能机器人有限公司 Label data statistical inference method in crowdsourcing mode
CN109284315A (en) * 2018-08-24 2019-01-29 大连莫比嗨客智能科技有限公司 A kind of label data Statistical Inference under crowdsourcing model
CN109902285B (en) * 2019-01-08 2023-09-22 平安科技(深圳)有限公司 Corpus classification method, corpus classification device, computer equipment and storage medium
CN109902285A (en) * 2019-01-08 2019-06-18 平安科技(深圳)有限公司 Corpus classification method, device, computer equipment and storage medium
CN111368929B (en) * 2020-03-09 2023-05-02 西安中科长青医疗科技研究院有限公司 Picture marking method
CN111368929A (en) * 2020-03-09 2020-07-03 西安中科长青医疗科技研究院有限公司 Picture labeling method
CN111783391B (en) * 2020-05-28 2024-06-07 孙炜 Online artificial text marking system and method
CN111783391A (en) * 2020-05-28 2020-10-16 孙炜 Online artificial text marking system and method
CN113297902A (en) * 2021-04-14 2021-08-24 中国科学院计算机网络信息中心 Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode
CN113297902B (en) * 2021-04-14 2023-08-08 中国科学院计算机网络信息中心 A method and device for online annotation of remote sensing images to generate sample data sets based on crowdsourcing mode
CN113220827A (en) * 2021-04-23 2021-08-06 哈尔滨工业大学 Construction method and device of agricultural corpus
CN113344387A (en) * 2021-06-08 2021-09-03 浙江工商大学 Crowdsourcing-based intelligent data processing method and device
CN113591016B (en) * 2021-09-30 2022-01-11 中国地质环境监测院(自然资源部地质灾害技术指导中心) Landslide labeling contour generation method based on multi-user cooperation
CN113591016A (en) * 2021-09-30 2021-11-02 中国地质环境监测院(自然资源部地质灾害技术指导中心) Landslide labeling contour generation method based on multi-user cooperation
CN116825212A (en) * 2023-08-29 2023-09-29 山东大学 A data collection and annotation method and system based on a biomedical crowdsourcing platform
CN116825212B (en) * 2023-08-29 2023-11-28 山东大学 Data collection labeling method and system based on biomedical crowdsourcing platform

Also Published As

Publication number Publication date
CN106489149A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
WO2018000269A1 (en) Data annotation method and system based on data mining and crowdsourcing
CN110209764B (en) Corpus annotation set generation method and device, electronic equipment and storage medium
CN112035653A (en) A method and device for extracting key policy information, storage medium, and electronic device
WO2021047186A1 (en) Method, apparatus, device, and storage medium for processing consultation dialogue
US9104709B2 (en) Cleansing a database system to improve data quality
CN108153729B (en) A Knowledge Extraction Method Oriented to the Financial Field
US20140012855A1 (en) Systems and Methods for Calculating Category Proportions
CN112163424A (en) Data labeling method, device, equipment and medium
CN111753541B (en) Method and system for carrying out natural language processing NLP on contract text data
CN109389418A (en) Electric service client's demand recognition methods based on LDA model
WO2022205768A1 (en) Random contrast test identification method for integrating multiple bert models on the basis of lightgbm
KR20230160596A (en) Method and device for automatic review construction contract documents using semantic text analysis
CN114925674A (en) File compliance checking method and device, electronic equipment and storage medium
CN110674631A (en) A method and system for automatic assignment of software defects based on version submission information
CN111209375A (en) Universal clause and document matching method
CN118674169A (en) Intelligent analysis method, system, device and medium for deep mining of enterprise data
CN112905589A (en) Scientific and technological talent data processing method, system, storage medium and terminal
CN109710730B (en) Patrol information system and analysis method based on natural language analysis processing
CN117610570A (en) Expert consultation intelligent system and method
CN110737749B (en) Entrepreneurship plan evaluation method, entrepreneurship plan evaluation device, computer equipment and storage medium
CN117055842A (en) NLP-based software scale measurement method and system
Kumaresh et al. Mining software repositories for defect categorization
CN116304112A (en) Intelligent monitoring method based on big data technology
CN114579750A (en) Information processing method and device, computer equipment and storage medium
CN114528399A (en) Work order text classification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16906670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16906670

Country of ref document: EP

Kind code of ref document: A1