[go: up one dir, main page]

CN116910050A - Data processing method, device, system and storage medium - Google Patents

Data processing method, device, system and storage medium Download PDF

Info

Publication number
CN116910050A
CN116910050A CN202311048216.0A CN202311048216A CN116910050A CN 116910050 A CN116910050 A CN 116910050A CN 202311048216 A CN202311048216 A CN 202311048216A CN 116910050 A CN116910050 A CN 116910050A
Authority
CN
China
Prior art keywords
data
target
judgment
field
authority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311048216.0A
Other languages
Chinese (zh)
Inventor
丁洪鑫
曹扬
支婷
苑建坤
董厚泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN202311048216.0A priority Critical patent/CN116910050A/en
Publication of CN116910050A publication Critical patent/CN116910050A/en
Priority to LU507965A priority patent/LU507965B1/en
Priority to PCT/CN2023/126885 priority patent/WO2025039361A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据处理方法、装置、系统及存储介质,本申请方法包括:根据经验数据库,构建质量问题字典,质量问题字典中包含有数据质量问题以及对应的问题标识;获取待处理数据,识别待处理数据中存在的目标数据问题;在质量问题字典中查询目标数据问题,获得对应的目标问题标识;通过预定义的字段将目标问题标识添加至目标数据中;读取目标数据中的多个目标字段,根据预设的融合规则对多个目标字段进行判断,并根据判断结果确定是否有冲突数据;当有冲突数据时,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略;根据权威源判断策略处理冲突数据中的各个字段的可信源权威等级,并更新数据库。

This application discloses a data processing method, device, system and storage medium. The method of this application includes: constructing a quality problem dictionary based on an experience database. The quality problem dictionary contains data quality problems and corresponding problem identifiers; obtaining the data to be processed. , identify the target data problems existing in the data to be processed; query the target data problems in the quality problem dictionary and obtain the corresponding target problem identification; add the target problem identification to the target data through the predefined fields; read the target data in the target data Multiple target fields are judged according to the preset fusion rules, and whether there is conflicting data is determined based on the judgment results; when there is conflicting data, the business application scenario of the current data is identified, and the authority is selected based on the identification results. Source judgment strategy: handle the trusted source authority level of each field in the conflicting data according to the authoritative source judgment strategy, and update the database.

Description

一种数据处理方法、装置、系统及存储介质A data processing method, device, system and storage medium

技术领域Technical field

本申请涉及数据处理领域,尤其涉及一种数据处理方法、装置、系统及存储介质。The present application relates to the field of data processing, and in particular, to a data processing method, device, system and storage medium.

背景技术Background technique

在现代企业和组织中,数据通过各种数据源和系统生成和记录。不同部门和个人通常会贡献到这些数据源,导致数据潜在的不一致和不准确。这可能对数据分析、决策和整体业务效率造成重大挑战。在现有技术中,已经尝试了几种方法来解决数据管理问题。这些方法包括手动数据验证、数据去重算法和基本的数据整合技术。In modern businesses and organizations, data is generated and recorded through a variety of data sources and systems. Different departments and individuals often contribute to these data sources, leading to potential inconsistencies and inaccuracies in the data. This can pose significant challenges to data analysis, decision-making, and overall business efficiency. In the prior art, several methods have been tried to solve the data management problem. These methods include manual data validation, data deduplication algorithms, and basic data integration techniques.

然而,这些传统方法存在局限性,因为它们通常缺乏全面的方法来解决数据管理的所有方面,包括数据清洗、实体合并和数据聚合。手动验证过程耗时且容易出现人为错误,而现有的数据去重算法可能无法充分处理复杂的数据关系。此外,当前的数据整合技术可能无法充分利用权威来源或充分解决数据缺失的问题。However, these traditional approaches have limitations because they often lack a comprehensive approach to address all aspects of data management, including data cleaning, entity merging, and data aggregation. Manual verification processes are time-consuming and prone to human error, while existing data deduplication algorithms may not adequately handle complex data relationships. Additionally, current data aggregation techniques may not fully leverage authoritative sources or adequately address missing data issues.

因此,需要一种创新的数据管理和整合系统,能够高效地清洗、合并和聚合来自各个来源的数据,同时确保数据的高准确性、一致性和可用性。Therefore, there is a need for an innovative data management and integration system that can efficiently clean, merge, and aggregate data from various sources while ensuring high accuracy, consistency, and availability of data.

发明内容Contents of the invention

为了解决上述技术问题,本申请提供了一种数据处理方法,所述方法包括:In order to solve the above technical problems, this application provides a data processing method, which method includes:

根据经验数据库,构建质量问题字典,所述质量问题字典中包含有数据质量问题以及对应的问题标识;Construct a quality problem dictionary based on the experience database, and the quality problem dictionary contains data quality problems and corresponding problem identifiers;

获取待处理数据,识别所述待处理数据中存在的目标数据问题;Obtain data to be processed and identify target data problems existing in the data to be processed;

在所述质量问题字典中查询所述目标数据问题,获得对应的目标问题标识;Query the target data problem in the quality problem dictionary to obtain the corresponding target problem identifier;

通过预定义的字段将所述目标问题标识添加至所述目标数据中;Add the target problem identifier to the target data through a predefined field;

读取所述目标数据中的多个目标字段,所述多个目标字段包括:姓名、身份证号、手机号、出生日期、所属组织标识码和企业名称,根据前述多个目标字段生成四种判断因子,所述四种判断因子分别为:Read multiple target fields in the target data. The multiple target fields include: name, ID number, mobile phone number, date of birth, organization identification code and company name, and generate four types of target fields based on the aforementioned multiple target fields. Judgment factors, the four judgment factors are:

第一判断因子:姓名和身份证号组合、第二判断因子:姓名和手机号组合、第三判断因子:姓名和出生日期和所属组织标识码组合,和第四判断因子:姓名和企业名称组合;The first judgment factor: the combination of name and ID number, the second judgment factor: the combination of name and mobile phone number, the third judgment factor: the combination of name, date of birth and organization identification code, and the fourth judgment factor: the combination of name and company name. ;

判断策略为:通过上述四种判断因子对待比较数据进行比较,若在待比较数据中,上述四种判断因子中有一个判断因子相匹配,则确定待比较数据为冲突数据;The judgment strategy is: compare the data to be compared through the above four judgment factors. If in the data to be compared, one of the above four judgment factors matches, then the data to be compared is determined to be conflict data;

其中,判断所述第四判断因子匹配的方法为:Wherein, the method for judging the match of the fourth judgment factor is:

当确定姓名一致后,计算待比较数据中企业名称之间的语义相似度;When it is determined that the names are consistent, the semantic similarity between the company names in the data to be compared is calculated;

若所述语义相似度满足预设阈值,则确定所述第四判断因子匹配;If the semantic similarity meets the preset threshold, it is determined that the fourth judgment factor matches;

当有冲突数据时,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略;When there is conflicting data, identify the business application scenario of the current data and select an authoritative source judgment strategy based on the identification results;

根据所述权威源判断策略处理所述冲突数据中的各个字段的可信源权威等级,并更新数据库;Process the trusted source authority level of each field in the conflict data according to the authoritative source judgment strategy, and update the database;

若不存在冲突数据,则为当前数据生成唯一标识码,并进行入库。If there is no conflicting data, a unique identification code is generated for the current data and stored in the database.

可选的,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略包括:Optionally, identify the business application scenario of the current data and select authoritative source judgment strategies based on the identification results, including:

若当前的业务应用场景为用于对已有历史数据库的数据进行处理,则选择第一权威源判断策略,所述第一权威源判断策略包括:If the current business application scenario is to process data from an existing historical database, select the first authoritative source judgment strategy. The first authoritative source judgment strategy includes:

查询预先构建的数据权威源等级表,并根据查询结果从所述冲突数据中将最高权威等级的数据作为业务主键,并更新所述历史数据库。Query the pre-built data authority source level table, use the data with the highest authority level from the conflict data as the business primary key according to the query results, and update the historical database.

可选的,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略包括:Optionally, identify the business application scenario of the current data and select authoritative source judgment strategies based on the identification results, including:

若当前的业务应用场景为用于对实时的数据进行入库处理,则选择第二权威判断策略,所述第二权威判断策略包括:If the current business application scenario is for warehousing and processing of real-time data, then select the second authoritative judgment strategy. The second authoritative judgment strategy includes:

查询预先构建的字段权威源等级表,并比较所述冲突数据中各个字段对应的权威源等级,根据比较结果,对各个字段选择最高的权威源等级的字段,并生成结果数据,将所述结果数据入库,所述字段权威源等级记录有各个字段的权威源的等级。Query the pre-built field authority source level table and compare the authority source levels corresponding to each field in the conflict data. Based on the comparison results, select the field with the highest authority source level for each field and generate the result data. The data is stored in the database, and the field authoritative source level is recorded with the level of the authoritative source of each field.

可选的,当所述结果数据入库后,基于预先构建的字段权威源等级表,为入库后的数据设定修改规则,所述修改规则包括:Optionally, after the result data is stored in the database, modification rules are set for the data after being stored in the database based on the pre-constructed field authority source level table. The modification rules include:

识别当前被修改的目标字段,根据被修改的字段查询对应的字段权威源登记表;Identify the currently modified target field, and query the corresponding field authoritative source registration table based on the modified field;

根据查询结果比对当前的修改用户的权威源等级与目标字段当前的权威源等级的大小;Compare the authoritative source level of the current modifying user with the current authoritative source level of the target field based on the query results;

若所述修改用户的权威源等级大于或者等于目标字段当前的权威源等级,则允许修改,否则拒绝修改。If the authoritative source level of the modifying user is greater than or equal to the current authoritative source level of the target field, the modification is allowed, otherwise the modification is rejected.

可选的,若允许修改,则在修改后,将目标字段的当前权威源等级更新为所述修改用户的权威源等级。Optionally, if the modification is allowed, after the modification, the current authoritative source level of the target field is updated to the authoritative source level of the modifying user.

可选的,所述识别所述待处理数据中存在的目标数据问题包括:Optionally, identifying target data problems existing in the data to be processed includes:

通过预定的预清洗规则,对所述待处理数据进行统一预清洗处理,在预清洗处理的过程中,标记无法进行统一预清洗处理的目标数据以及对应的目标数据问题。Through predetermined pre-cleaning rules, the data to be processed is uniformly pre-cleaned. During the pre-cleaning process, target data that cannot be uniformly pre-cleaned and corresponding target data problems are marked.

本申请第二方面提供了一种数据处理装置,所述装置包括:A second aspect of this application provides a data processing device, which includes:

构建质量问题字典模块,用于根据经验数据库,构建质量问题字典,所述质量问题字典中包含有数据质量问题以及对应的问题标识;Constructing a quality problem dictionary module is used to construct a quality problem dictionary based on the experience database, where the quality problem dictionary contains data quality problems and corresponding problem identifiers;

目标数据问题识别模块,用于获取待处理数据,识别所述待处理数据中存在的目标数据问题;The target data problem identification module is used to obtain the data to be processed and identify the target data problems existing in the data to be processed;

目标问题标识查询模块,用于在所述质量问题字典中查询所述目标数据问题,获得对应的目标问题标识;A target problem identification query module, used to query the target data problem in the quality problem dictionary and obtain the corresponding target problem identification;

目标问题标识添加模块,用于通过预定义的字段将所述目标问题标识添加至所述目标数据中;A target problem identification adding module, configured to add the target problem identification to the target data through a predefined field;

目标字段判断与冲突检测模块,用于读取所述目标数据中的多个目标字段,所述多个目标字段包括:姓名、身份证号、手机号、出生日期、所属组织标识码和企业名称,根据前述多个目标字段生成四种判断因子,所述四种判断因子分别为:The target field judgment and conflict detection module is used to read multiple target fields in the target data. The multiple target fields include: name, ID number, mobile phone number, date of birth, organization identification code and company name. , four judgment factors are generated based on the aforementioned multiple target fields, and the four judgment factors are:

第一判断因子:姓名和身份证号组合、第二判断因子:姓名和手机号组合、第三判断因子:姓名和出生日期和所属组织标识码组合,和第四判断因子:姓名和企业名称组合;The first judgment factor: the combination of name and ID number, the second judgment factor: the combination of name and mobile phone number, the third judgment factor: the combination of name, date of birth and organization identification code, and the fourth judgment factor: the combination of name and company name. ;

判断策略为:通过上述四种判断因子对待比较数据进行比较,若在待比较数据中,上述四种判断因子中有一个判断因子相匹配,则确定待比较数据为冲突数据;The judgment strategy is: compare the data to be compared through the above four judgment factors. If in the data to be compared, one of the above four judgment factors matches, then the data to be compared is determined to be conflict data;

其中,判断所述第四判断因子匹配的方法为:Wherein, the method for judging the match of the fourth judgment factor is:

当确定姓名一致后,计算待比较数据中企业名称之间的语义相似度;When it is determined that the names are consistent, the semantic similarity between the company names in the data to be compared is calculated;

若所述语义相似度满足预设阈值,则确定所述第四判断因子匹配;If the semantic similarity meets the preset threshold, it is determined that the fourth judgment factor matches;

权威源判断策略选择模块,用于当有冲突数据时,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略;The authoritative source judgment strategy selection module is used to identify the business application scenario of the current data when there is conflicting data, and select the authoritative source judgment strategy based on the identification results;

冲突数据处理模块,用于根据所述权威源判断策略处理所述冲突数据中的各个字段的可信源权威等级,并更新数据库;A conflict data processing module, configured to process the trustworthy source authority level of each field in the conflict data according to the authoritative source judgment strategy, and update the database;

唯一标识码生成模块,用于当不存在冲突数据时,为当前数据生成唯一标识码,并进行入库。The unique identification code generation module is used to generate a unique identification code for the current data when there is no conflicting data and store it in the database.

本申请第三方面提供了一种数据处理系统,所述装置包括:The third aspect of this application provides a data processing system, the device includes:

处理器、存储器、输入输出单元以及总线;Processors, memories, input-output units, and buses;

所述处理器与所述存储器、所述输入输出单元以及所述总线相连;The processor is connected to the memory, the input and output unit and the bus;

所述存储器保存有程序,所述处理器调用所述程序以执行第一方面以及第一方面中任一项可选的所述方法。The memory stores a program, and the processor calls the program to execute the first aspect and any optional method in the first aspect.

本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质上保存有程序,所述程序在计算机上执行时执行第一方面以及第一方面中任一项可选的所述方法。A fourth aspect of the present application provides a computer-readable storage medium. A program is stored on the computer-readable storage medium. When the program is executed on a computer, the program executes the first aspect and any one of the options optional in the first aspect. described method.

从以上技术方案可以看出,本申请具有以下优点:It can be seen from the above technical solutions that this application has the following advantages:

本申请提供的方法通过构建质量问题字典和标识,识别并标记数据中存在的问题,有助于准确识别和处理数据的质量问题,提高数据的准确性和可信度。The method provided by this application identifies and marks problems existing in the data by building a quality problem dictionary and identification, which helps to accurately identify and handle data quality problems and improve the accuracy and credibility of the data.

通过读取目标数据中的多个字段并根据预设的融合规则进行判断,可以识别重复的数据和实体,并将它们合并成一个统一的实体,从而消除了数据的冲突和重复,提高数据的一致性和完整性。By reading multiple fields in the target data and judging based on preset fusion rules, duplicate data and entities can be identified and merged into a unified entity, thus eliminating data conflicts and duplications and improving data quality. Consistency and completeness.

通过识别当前数据的业务应用场景,并根据识别结果选择适当的权威源判断策略,可以对冲突数据中的各个字段的可信源权威等级进行处理,确保数据的准确性和可信度。By identifying the business application scenario of the current data and selecting an appropriate authoritative source judgment strategy based on the identification results, the authoritative level of the trusted source for each field in the conflicting data can be processed to ensure the accuracy and credibility of the data.

方案中提到根据业务应用场景选择权威源判断策略,这使得数据处理更加灵活和适应不同的业务需求,确保数据处理方法在不同场景下的有效性。The plan mentions selecting authoritative source judgment strategies based on business application scenarios, which makes data processing more flexible and adaptable to different business needs, ensuring the effectiveness of data processing methods in different scenarios.

通过预定义字段将目标问题标识添加到目标数据中,可以快速识别和处理数据中存在的问题,提高数据处理的效率。By adding target problem identifiers to the target data through predefined fields, problems existing in the data can be quickly identified and processed, and the efficiency of data processing can be improved.

附图说明Description of the drawings

为了更清楚地说明本申请中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the present application more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1为本申请中提供的数据处理方法的一个实施例流程示意图;Figure 1 is a schematic flow diagram of an embodiment of the data processing method provided in this application;

图2为本申请中判断因子的组合示意图;Figure 2 is a schematic diagram of the combination of judgment factors in this application;

图3为本申请中基于修改规则更新权威源等级方法的流程示意图;Figure 3 is a schematic flow chart of the method of updating authoritative source levels based on modified rules in this application;

图4为本申请中提供的数据处理装置的结构示意图;Figure 4 is a schematic structural diagram of the data processing device provided in this application;

图5为本申请中提供的数据处理系统的结构示意图技术领域。Figure 5 is a schematic structural diagram of the data processing system provided in this application.

具体实施方式Detailed ways

需要说明的是,本申请提供的方法,可以应用于终端也可以应用于系统,还可以应用于服务器上,例如终端可以是智能手机或电脑、平板电脑、智能电视、智能手表、便携计算机终端也可以是台式计算机等固定终端。为方便阐述,本申请中以终端为执行主体进行举例说明。It should be noted that the method provided by this application can be applied to terminals, systems, and servers. For example, the terminal can be a smartphone, computer, tablet, smart TV, smart watch, or portable computer terminal. It can be a fixed terminal such as a desktop computer. For convenience of explanation, in this application, the terminal is used as the execution subject for illustration.

本申请人声明:通过本方案进行个人信息采集时,包括公开渠道的合法采集以及个人许可的前提下采集,且采集到的个人信息不会进行非法泄露。本申请中所涉及到的所有个人信息,均已获得相关主体许可。This applicant declares: When collecting personal information through this program, it includes legal collection through public channels and collection with personal permission, and the collected personal information will not be illegally disclosed. All personal information involved in this application has obtained the permission of the relevant parties.

下面对本申请中提供的方法的具体实施例进行详细描述:Specific embodiments of the methods provided in this application are described in detail below:

请参阅图1,本申请首先提供了一个数据处理方法的实施例,该实施例包括:Referring to Figure 1, this application first provides an embodiment of a data processing method, which includes:

S101、根据经验数据库,构建质量问题字典,所述质量问题字典中包含有数据质量问题以及对应的问题标识;S101. Construct a quality problem dictionary based on the experience database. The quality problem dictionary includes data quality problems and corresponding problem identifiers;

在这一步骤中,首先根据经验数据库,构建一个质量问题字典。这个字典包含了各种可能的数据质量问题,以及对应的问题标识,问题标识用于后续识别和标记目标数据中存在的质量问题。首先需要建立一个经验数据库,该数据库用于记录过去数据处理过程中出现的各种数据质量问题,以及相应的问题标识。这个经验数据库可以由专业数据管理团队、数据质量专家或以往数据处理的经验积累而来。在经验数据库中,收集和整理不同数据源、不同类型数据所常见的数据质量问题,如缺失数据、错误数据、重复数据、格式错误等。每个数据质量问题都应该有一个对应的问题标识。In this step, a quality problem dictionary is first constructed based on the empirical database. This dictionary contains various possible data quality problems and corresponding problem identifiers. The problem identifiers are used to subsequently identify and mark quality problems existing in the target data. First, an experience database needs to be established, which is used to record various data quality problems that occurred in the past data processing process, as well as the corresponding problem identification. This experience database can be accumulated by professional data management teams, data quality experts or previous data processing experience. In the empirical database, common data quality problems from different data sources and different types of data are collected and sorted, such as missing data, wrong data, duplicate data, format errors, etc. Every data quality issue should have a corresponding issue identifier.

例如,首先构造质量问题字典,对企业名称、人名、手机号等信息项的数据质量问题进行整理并赋予问题标识,下面是一个质量问题字典的例子:For example, first construct a quality problem dictionary, organize the data quality problems of information items such as company names, personal names, mobile phone numbers, etc. and assign problem identifiers. The following is an example of a quality problem dictionary:

S102、获取待处理数据,识别所述待处理数据中存在的目标数据问题;S102. Obtain the data to be processed and identify target data problems existing in the data to be processed;

在这一步骤中,获取待处理的数据,并进行识别,以确定待处理数据中存在的目标数据问题。这个步骤用于发现目标数据中可能存在的质量问题。的信息。然后,通过数据处理方法中的算法和规则,对这些待处理数据进行识别,以确定其中可能存在的目标数据问题。In this step, the data to be processed is obtained and identified to determine the target data problems existing in the data to be processed. This step is used to discover possible quality issues in the target data. Information. Then, through the algorithms and rules in the data processing method, the data to be processed is identified to determine the target data problems that may exist in it.

首先,从各个数据源和系统中获取待处理的数据。这些数据可以是结构化的数据表格、数据库中的记录,也可以是非结构化的数据文本、图片、音频等。在得到待处理数据后,通过使用预定义的算法和规则,对这些数据进行识别,以确定其中可能存在的目标数据问题。目标数据问题可以包括但不限于以下几种:First, the data to be processed is obtained from various data sources and systems. These data can be structured data tables, records in the database, or unstructured data text, pictures, audio, etc. After obtaining the data to be processed, the data is identified by using predefined algorithms and rules to determine the target data problems that may exist in it. Target data questions can include but are not limited to the following:

缺失数据:某些字段或属性缺乏必要的数据值。Missing data: Certain fields or attributes lack necessary data values.

错误数据:数据值不符合规定的格式、范围或逻辑条件。Bad data: Data values do not conform to the specified format, range, or logical conditions.

重复数据:存在多个相同或近似相同的数据记录。Duplicate data: There are multiple identical or nearly identical data records.

格式错误:数据的格式与预期的格式不符。Format error: The format of the data does not match the expected format.

逻辑错误:数据之间的逻辑关系不正确或不一致。Logical error: The logical relationship between data is incorrect or inconsistent.

一旦目标数据问题被识别出来,系统可以根据识别结果,在数据中相应的位置或字段中添加问题标识。问题标识可以是预先定义的代码、符号或特定的值,用于标记出存在问题的数据部分。Once the target data problem is identified, the system can add problem identification to the corresponding location or field in the data based on the identification results. A problem identifier can be a predefined code, symbol, or specific value that marks the problematic portion of the data.

此步骤还可以对数据进行质量评估,例如计算缺失数据的比例、检测数据的逻辑一致性等。这些评估可以提供数据质量的整体情况和可能存在的问题。This step can also perform quality assessment on the data, such as calculating the proportion of missing data, detecting the logical consistency of the data, etc. These assessments can provide an overall picture of data quality and possible problems.

该步骤的一种实现方式是,通过预定的预清洗规则,对所述待处理数据进行统一预清洗处理,在预清洗处理的过程中,标记无法进行统一预清洗处理的目标数据以及对应的目标数据问题。One way to implement this step is to perform unified pre-cleaning processing on the data to be processed through predetermined pre-cleaning rules. During the pre-cleaning process, mark the target data that cannot be uniformly pre-cleaned and the corresponding targets. Data issues.

首先,制定一套预定的预清洗规则,这些规则可以包括数据格式化、去除空值或异常值、修复常见错误等。预定的规则可以基于经验数据库或专业领域知识建立,也可以通过数据分析和数据挖掘方法来获取。First, develop a predetermined set of pre-cleaning rules. These rules can include data formatting, removing null values or outliers, fixing common errors, etc. Predetermined rules can be established based on experience databases or professional domain knowledge, or can be obtained through data analysis and data mining methods.

使用预定的预清洗规则对待处理的数据进行统一预清洗处理。在这一步骤中,系统将根据预定规则对数据进行自动处理,将数据转换成统一的格式和标准,修复常见错误,以及去除或填充缺失值等。在预清洗处理的过程中,对于无法进行统一预清洗处理的目标数据,系统可以根据预定义的问题标识进行标记。这些目标数据问题可以是那些不符合预定规则的数据,以及其他无法自动处理的数据质量问题。Use predetermined pre-cleaning rules to uniformly pre-clean the data to be processed. In this step, the system will automatically process the data according to predetermined rules, convert the data into a unified format and standards, fix common errors, and remove or fill missing values. During the pre-cleaning process, the system can mark target data that cannot be uniformly pre-cleaned according to predefined problem identifiers. These target data issues can be those that do not comply with predetermined rules, as well as other data quality issues that cannot be handled automatically.

S103、在所述质量问题字典中查询所述目标数据问题,获得对应的目标问题标识;S103. Query the target data problem in the quality problem dictionary and obtain the corresponding target problem identifier;

在前述步骤中,已经识别出了目标数据中可能存在的数据质量问题。在这一步骤中,系统会根据这些目标数据问题,遍历质量问题字典,逐一检查字典中的问题与目标数据问题是否匹配。当发现字典中存在与目标数据问题匹配的数据质量问题时,系统会获取该问题对应的问题标识。然后,将问题标识添加到目标数据中的相应字段或位置,用于标记目标数据存在的质量问题。如果目标数据问题在质量问题字典中未找到匹配,系统可以单独处理这些未匹配问题,例如生成一个新的问题标识,或将其记录为特定的未知问题。为了提高查询效率和减少数据处理时间,可以使用合适的数据结构和算法来实现质量问题字典的查询,例如使用哈希表、二叉树搜索法、字典树等。In the previous steps, possible data quality issues in the target data have been identified. In this step, the system will traverse the quality problem dictionary based on these target data problems and check one by one whether the problems in the dictionary match the target data problems. When it is found that there is a data quality problem in the dictionary that matches the target data problem, the system will obtain the problem identification corresponding to the problem. Then, add the problem identifier to the corresponding field or location in the target data to mark quality problems in the target data. If the target data issue does not find a match in the quality issue dictionary, the system can handle these unmatched issues individually, such as generating a new issue ID, or logging them as specific unknown issues. In order to improve query efficiency and reduce data processing time, appropriate data structures and algorithms can be used to query the quality problem dictionary, such as hash tables, binary tree search methods, dictionary trees, etc.

S104、通过预定义的字段将所述目标问题标识添加至所述目标数据中;S104. Add the target problem identifier to the target data through a predefined field;

在S104步骤中,将在上一步骤S103中获得的目标问题标识,通过预定义的字段添加到目标数据中,从而标记了存在的问题,并为后续处理提供依据。具体实现方式可以如下:In step S104, the target problem identification obtained in the previous step S103 is added to the target data through predefined fields, thereby marking existing problems and providing a basis for subsequent processing. The specific implementation method can be as follows:

可以预定义一个或多个字段,用于存储目标问题标识。这些字段可以是新建的,也可以是已有的字段。在S103步骤中,通过查询质量问题字典,识别出目标数据中存在的问题,并获取相应的目标问题标识。然后,将这些目标问题标识通过预定义的字段,逐条添加到目标数据的相应位置或字段中。One or more fields can be predefined to store the target problem identification. These fields can be newly created or existing fields. In step S103, problems existing in the target data are identified by querying the quality problem dictionary, and corresponding target problem identifiers are obtained. Then, these target problem identifications are added to the corresponding positions or fields of the target data one by one through predefined fields.

目标问题标识的添加可以采用不同的方式,例如在一个新的字段中记录问题标识的代码或符号,或者将问题标识作为目标字段的一部分进行标记。标记的方式应根据实际需求和数据处理场景来选择。The target problem ID can be added in different ways, such as recording the problem ID's code or symbol in a new field, or marking the problem ID as part of the target field. The marking method should be selected based on actual needs and data processing scenarios.

在将目标问题标识添加到目标数据中后,需要对数据进行整理和存储。这可能涉及到数据清洗、格式转换和数据归档等操作,以确保数据的一致性和完整性。After the target problem identification is added to the target data, the data needs to be organized and stored. This may involve operations such as data cleaning, format conversion, and data archiving to ensure data consistency and integrity.

通过以上实现方式,S104步骤可以将目标问题标识添加到目标数据中,从而标记了数据中存在的问题。这样,数据在后续处理阶段可以根据问题标识进行分类、过滤、融合等操作,提高了数据处理的准确性和效率。同时,目标问题标识的添加也为数据质量评估和问题排查提供了方便和依据。Through the above implementation method, step S104 can add the target problem identifier to the target data, thereby marking the problems existing in the data. In this way, the data can be classified, filtered, fused and other operations based on the problem identification in the subsequent processing stage, which improves the accuracy and efficiency of data processing. At the same time, the addition of target problem identification also provides convenience and basis for data quality assessment and problem troubleshooting.

S105、读取所述目标数据中的多个目标字段,根据预设的融合规则对所述多个目标字段进行判断,并根据判断结果确定是否有冲突数据;S105. Read multiple target fields in the target data, judge the multiple target fields according to the preset fusion rules, and determine whether there is conflicting data based on the judgment results;

在S105步骤中,目标是读取目标数据中的多个目标字段,并根据预设的融合规则对这些字段进行判断,以确定是否存在冲突数据,即相同实体的不一致数据。首先从目标数据中读取需要进行融合判断的多个目标字段。这些字段可能包含不同来源、不同时间或不同格式的数据。根据预设的融合规则,对读取的多个目标字段进行逐一比较和判断。如果在融合判断中发现多个字段表示的实体信息不一致,即存在冲突数据,系统会将这些冲突数据进行标记或记录。可能采用特定的标识符、状态值或其他方式进行标记。In step S105, the goal is to read multiple target fields in the target data and judge these fields according to preset fusion rules to determine whether there is conflicting data, that is, inconsistent data of the same entity. First, multiple target fields that require fusion judgment are read from the target data. These fields may contain data from different sources, different times, or different formats. According to the preset fusion rules, the multiple target fields read are compared and judged one by one. If the entity information represented by multiple fields is found to be inconsistent during the fusion judgment, that is, conflicting data exists, the system will mark or record the conflicting data. May be marked with a specific identifier, status value, or other means.

本申请中,所识别的多个目标字段可以是姓名、身份证号、手机号、出生日期、所属组织标识码和企业名称,在执行冲突数据判断策略时,可以由这些字段组合成判断条件来进行数据的查重,具体的字段组合规则本申请不做限定,下面,本申请给出一种优选的实施例,该实施例包括有一种优选的判断策略:In this application, the multiple target fields identified can be name, ID number, mobile phone number, date of birth, organization identification code and company name. When executing the conflict data judgment strategy, these fields can be combined into judgment conditions. This application does not limit the specific field combination rules for data duplication checking. Below, this application provides a preferred embodiment, which includes a preferred judgment strategy:

参阅图2,根据前述多个目标字段生成四种判断因子,所述四种判断因子分别为:Referring to Figure 2, four judgment factors are generated based on the aforementioned multiple target fields. The four judgment factors are:

第一判断因子:姓名和身份证号组合、第二判断因子:姓名和手机号组合、第三判断因子:姓名和出生日期和所属组织标识码组合,和第四判断因子:姓名和企业名称组合;The first judgment factor: the combination of name and ID number, the second judgment factor: the combination of name and mobile phone number, the third judgment factor: the combination of name, date of birth and organization identification code, and the fourth judgment factor: the combination of name and company name. ;

判断策略为:通过上述四种判断因子对待比较数据进行比较,若在待比较数据中,上述四种判断因子中有一个判断因子相匹配,则确定待比较数据为冲突数据。The judgment strategy is: compare the data to be compared through the above four judgment factors. If in the data to be compared, one of the above four judgment factors matches, the data to be compared is determined to be conflict data.

该实施例中,根据前述多个目标字段生成四种判断因子,每种判断因子由不同的目标字段组合而成。然后,通过这四种判断因子对待比较数据进行比较,如果在待比较数据中,有一个判断因子与目标数据的任意一个判断因子相匹配,则确定待比较数据为冲突数据。In this embodiment, four judgment factors are generated based on the foregoing multiple target fields, and each judgment factor is composed of different target fields. Then, the data to be compared is compared through these four judgment factors. If one judgment factor in the data to be compared matches any judgment factor of the target data, the data to be compared is determined to be conflicting data.

生成判断因子:根据优选的判断策略,从目标数据中提取姓名、身份证号、手机号、出生日期、所属组织标识码和企业名称字段,并按照第一判断因子、第二判断因子、第三判断因子和第四判断因子的组合方式,生成四个判断因子。Generate judgment factors: According to the preferred judgment strategy, extract the name, ID number, mobile phone number, date of birth, organization identification code and company name fields from the target data, and use the first judgment factor, the second judgment factor, and the third judgment factor to The combination of the judgment factor and the fourth judgment factor generates four judgment factors.

第一判断因子:由姓名和身份证号组成。The first judgment factor: consists of name and ID number.

第二判断因子:由姓名和手机号组成。The second judgment factor: consists of name and mobile phone number.

第三判断因子:由姓名、出生日期和所属组织标识码组成。The third judgment factor: consists of name, date of birth and identification code of the organization to which it belongs.

第四判断因子:由姓名和企业名称组成。The fourth judgment factor: consists of name and company name.

在进行数据比较时,对于待比较数据,同样按照第一判断因子、第二判断因子、第三判断因子和第四判断因子的组合方式,生成四个判断因子。When comparing data, for the data to be compared, four judgment factors are also generated according to the combination of the first judgment factor, the second judgment factor, the third judgment factor and the fourth judgment factor.

冲突数据判断:将待比较数据的四个判断因子与目标数据的四个判断因子进行比较。如果在待比较数据中,有一个判断因子与目标数据的任意一个判断因子相匹配,则确定待比较数据为冲突数据。Conflict data judgment: Compare the four judgment factors of the data to be compared with the four judgment factors of the target data. If there is a judgment factor in the data to be compared that matches any judgment factor of the target data, the data to be compared is determined to be conflicting data.

通过这个优选的判断策略,可以有效地识别出冲突数据,即存在相同实体但信息不一致的数据。这样的数据处理方式有助于提高数据质量和可信度,在后续的数据管理和应用过程中具有重要意义。Through this optimal judgment strategy, conflicting data can be effectively identified, that is, data with the same entity but inconsistent information. Such data processing method helps to improve data quality and credibility, and is of great significance in subsequent data management and application processes.

进一步的,本实施例还给出了一种更为具体的实现方式,即对于第四判断因子,在数据处理中,进行企业名称的相似度判断是因为在实际情况中,企业名称可能存在一定的变体、错别字、简称等情况,导致不同表达方式的企业名称实际上指代同一个企业。而这种企业名称的多样性和差异性可能导致数据融合或匹配过程中的误判或遗漏,影响数据的准确性和一致性。因此,对于第四判断因子,若确定姓名一致后,需要计算企业名称之间的语义相似度,如果所计算的语义相似度满足预设阈值,就确定第四判断因子匹配,从而确认待比较数据为冲突数据。Furthermore, this embodiment also provides a more specific implementation method, that is, for the fourth judgment factor, during data processing, the similarity of the company name is judged because in actual situations, there may be certain differences in the company name. Variations, typos, abbreviations, etc. result in company names expressed in different ways actually referring to the same company. This diversity and difference of company names may lead to misjudgments or omissions in the data fusion or matching process, affecting the accuracy and consistency of the data. Therefore, for the fourth judgment factor, if it is determined that the names are consistent, it is necessary to calculate the semantic similarity between the company names. If the calculated semantic similarity meets the preset threshold, it is determined that the fourth judgment factor matches, thereby confirming the data to be compared. for conflicting data.

根据上述优选的判断策略,进行姓名的匹配。在确定姓名一致后,对待比较数据中的企业名称与目标数据的企业名称进行语义相似度计算。语义相似度计算可以采用自然语言处理技术,如文本匹配算法、词向量模型等,来衡量两个企业名称之间的语义相似程度。可以设定一个预设的语义相似度阈值,用于判断待比较数据的企业名称与目标数据的企业名称是否相似到一定程度。将计算得到的语义相似度与预设阈值进行比较。如果语义相似度满足预设阈值,即企业名称之间的语义相似度高于设定的阈值,那么确定第四判断因子匹配,将待比较数据视为冲突数据。否则,第四判断因子不匹配,结束处理。Name matching is performed according to the above-mentioned preferred judgment strategy. After it is determined that the names are consistent, semantic similarity calculation is performed between the company name in the comparison data and the company name in the target data. Semantic similarity calculation can use natural language processing technology, such as text matching algorithms, word vector models, etc., to measure the semantic similarity between two business names. A preset semantic similarity threshold can be set to determine whether the company name of the data to be compared is similar to the company name of the target data to a certain extent. Compare the calculated semantic similarity with a preset threshold. If the semantic similarity meets the preset threshold, that is, the semantic similarity between the company names is higher than the set threshold, then the fourth judgment factor is determined to match, and the data to be compared is regarded as conflicting data. Otherwise, the fourth judgment factor does not match, and the processing ends.

S106、当有冲突数据时,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略;S106. When there is conflicting data, identify the business application scenario of the current data, and select an authoritative source judgment strategy based on the identification results;

在S106步骤中,当发现冲突数据时,需要识别当前数据的业务应用场景,并根据这一场景选择适当的权威源判断策略来处理冲突数据,从而确定字段的可信源权威等级。这样的处理可以确保数据在不同业务应用场景下的准确性和可信度。In step S106, when conflicting data is discovered, the business application scenario of the current data needs to be identified, and an appropriate authority source judgment strategy is selected based on this scenario to process the conflicting data, thereby determining the trusted source authority level of the field. Such processing can ensure the accuracy and credibility of data in different business application scenarios.

该步骤中,需要根据数据管理的业务需求和应用场景,确定当前数据的业务应用场景。这个场景可能涉及到不同的数据处理目标、数据使用方式以及对数据准确性的要求。根据识别到的业务应用场景,预先定义一套权威源判断策略。这些策略可能与不同业务应用场景的需求相匹配,以确保数据在该场景下的权威性和可信度。根据选择的权威源判断策略,对冲突数据中的各个字段进行权威性判断。可能采用不同的权威性评估算法、数据来源考量等方法来判断字段的可信源权威等级。根据对冲突数据字段的权威性判断结果,更新数据库中相应字段的权威等级信息。这样,在后续数据处理和应用过程中,可以根据字段的权威等级来做出相应的决策和使用。In this step, it is necessary to determine the business application scenarios of the current data based on the business needs and application scenarios of data management. This scenario may involve different data processing objectives, data usage methods, and requirements for data accuracy. Based on the identified business application scenarios, a set of authoritative source judgment strategies are predefined. These strategies may match the needs of different business application scenarios to ensure the authority and credibility of the data in that scenario. Based on the selected authoritative source judgment strategy, authoritative judgment is made on each field in the conflict data. Different authoritative evaluation algorithms, data source considerations, and other methods may be used to determine the authoritative level of a field's trusted source. Based on the authoritative judgment results of the conflicting data fields, the authority level information of the corresponding fields in the database is updated. In this way, during subsequent data processing and application, corresponding decisions and uses can be made based on the authority level of the field.

对于步骤S106,本申请提供了两种可能应用的数据处理场景,一种是对于已经存在有历史数据库的数据处理场景,对于这种场景,数据处理目标是将新的数据插入历史数据库中,这时候就需要调整数据的权威源等级,在这种场景下的权威源判断策略是:For step S106, this application provides two possible application data processing scenarios. One is a data processing scenario where a historical database already exists. For this scenario, the data processing goal is to insert new data into the historical database. This Sometimes it is necessary to adjust the authoritative source level of the data. In this scenario, the authoritative source judgment strategy is:

若当前的业务应用场景为用于对已有历史数据库的数据进行处理,则选择第一权威源判断策略,所述第一权威源判断策略包括:If the current business application scenario is to process data from an existing historical database, select the first authoritative source judgment strategy. The first authoritative source judgment strategy includes:

查询预先构建的数据权威源等级表,并根据查询结果从所述冲突数据中将最高权威等级的数据作为业务主键,并更新所述历史数据库。Query the pre-built data authority source level table, use the data with the highest authority level from the conflict data as the business primary key according to the query results, and update the historical database.

本申请提供了一个第一权威源判断策略,下面对该第一权威源判断策略记性描述:This application provides a first authoritative source judgment strategy. The first authoritative source judgment strategy is described below:

预先构建一个数据权威源等级表,记录不同数据源的权威等级信息。该表包含不同数据源的标识,以及对应的权威等级,用于指示数据来源的可信程度。对于发现的冲突数据,通过查询数据权威源等级表,获得涉及冲突数据的各个数据源的权威等级信息。Build a data authority source level table in advance to record the authority level information of different data sources. This table contains the identification of different data sources and the corresponding authority levels, which are used to indicate the trustworthiness of the data source. For the discovered conflicting data, the authority level information of each data source involved in the conflicting data is obtained by querying the data authority source level table.

根据数据权威源等级表的查询结果,选择最高权威等级的数据源作为业务主键的来源。即在冲突数据中,以最高权威等级的数据源提供的信息为准,并将该数据作为业务主键。随后将冲突数据中使用业务主键所确定的信息更新至历史数据库。这样,历史数据库中的相应数据将被更新为来自最高权威等级数据源的信息,保证数据的权威性和准确性。Based on the query results of the data authority source level table, select the data source with the highest authority level as the source of the business primary key. That is, in conflicting data, the information provided by the data source with the highest authoritative level shall prevail, and this data shall be used as the business primary key. The information in the conflicting data, identified using the business primary key, is then updated to the historical database. In this way, the corresponding data in the historical database will be updated with information from the highest authoritative data source, ensuring the authority and accuracy of the data.

通过选择第一权威源判断策略,并根据数据权威源等级表来确定业务主键,可以确保历史数据库中的数据处理具有权威性和可信度。该策略充分利用了数据的权威等级信息,优先选择最可信的数据来源进行数据更新,避免了数据的混乱和错误,同时提高了历史数据库数据的质量和可用性。By selecting the first authoritative source judgment strategy and determining the business primary key based on the data authority source level table, you can ensure that the data processing in the historical database is authoritative and credible. This strategy makes full use of the authoritative level information of the data, giving priority to the most credible data sources for data updates, avoiding data confusion and errors, and at the same time improving the quality and availability of historical database data.

对于另一种数据处理场景,其目标是对实时数据进行处理,在该场景下,要求权威源判断策略能够响应实时的数据处理,因此,本申请也提供了一种策略,本文定义为第二权威源判断策略,下面对第二权威源判断策略进行描述:For another data processing scenario, the goal is to process real-time data. In this scenario, the authoritative source judgment strategy is required to respond to real-time data processing. Therefore, this application also provides a strategy, which is defined in this article as the second Authoritative source judgment strategy, the second authoritative source judgment strategy is described below:

若当前的业务应用场景为用于对实时的数据进行入库处理,则选择第二权威判断策略,所述第二权威判断策略包括:If the current business application scenario is for warehousing and processing of real-time data, then select the second authoritative judgment strategy. The second authoritative judgment strategy includes:

查询预先构建的字段权威源等级表,并比较所述冲突数据中各个字段对应的权威源等级,根据比较结果,对各个字段选择最高的权威源等级的字段,并生成结果数据,将所述结果数据入库,所述字段权威源等级记录有各个字段的权威源的等级。Query the pre-built field authority source level table and compare the authority source levels corresponding to each field in the conflict data. Based on the comparison results, select the field with the highest authority source level for each field and generate the result data. The data is stored in the database, and the field authoritative source level is recorded with the level of the authoritative source of each field.

首先预先构建一个字段权威源等级表,记录各个字段的权威源等级信息。该表包含各个字段的标识,以及对应的权威等级,用于指示每个字段的可信程度。对于发现的冲突数据,查询字段权威源等级表,获得涉及冲突数据中各个字段的权威等级信息。根据查询结果,对冲突数据中的各个字段进行比较,选择最高权威源等级的字段作为结果数据。即在冲突数据中,以拥有最高权威等级的字段的信息为准,生成结果数据。将经过选择的结果数据进行入库处理,将其存储至数据库中。这样,数据库中存储的数据将以具有最高权威等级的字段信息为准,确保数据入库的权威性和可信度。First, a field authority source level table is built in advance to record the authority source level information of each field. The table contains the identification of each field and the corresponding authority level, which is used to indicate the trustworthiness of each field. For the discovered conflicting data, query the field authority source level table to obtain the authority level information of each field in the conflicting data. Based on the query results, compare each field in the conflicting data and select the field with the highest authoritative source level as the result data. That is, in the conflict data, the information of the field with the highest authority level shall prevail to generate the result data. The selected result data is processed and stored in the database. In this way, the data stored in the database will be based on the field information with the highest authority level, ensuring the authority and credibility of the data stored in the database.

第二权威源判断策略适用于实时数据处理场景,能够及时地对冲突数据进行处理和入库。由于实时数据可能在短时间内频繁变化,采用第二权威源判断策略可以在数据到达时快速处理,及时更新数据库,保持数据的实时性和灵活性。The second authoritative source judgment strategy is suitable for real-time data processing scenarios and can process and store conflicting data in a timely manner. Since real-time data may change frequently in a short period of time, using a second authoritative source judgment strategy can quickly process the data when it arrives, update the database in a timely manner, and maintain the real-time nature and flexibility of the data.

该策略根据实际冲突数据中各个字段的权威源等级进行动态选择,适用于不同数据场景和业务需求。这种自适应性能够应对数据源的变化和业务场景的多样性,提高数据处理的灵活性和适应性。This strategy is dynamically selected based on the authoritative source level of each field in the actual conflict data, and is suitable for different data scenarios and business needs. This adaptability can cope with changes in data sources and diversity of business scenarios, improving the flexibility and adaptability of data processing.

通过选择最高权威源等级的字段作为结果数据进行入库,可以优化数据库的存储结构。只保留最可信的数据信息,减少了冗余数据的存储,优化了数据库的空间利用效率。By selecting fields with the highest authoritative source level as result data to be stored in the database, the storage structure of the database can be optimized. Only the most credible data information is retained, reducing the storage of redundant data and optimizing the space utilization efficiency of the database.

S107、根据所述权威源判断策略处理所述冲突数据中的各个字段的可信源权威等级,并更新数据库。S107. Process the trusted source authority level of each field in the conflict data according to the authoritative source judgment strategy, and update the database.

在S107步骤中,根据选择的权威源判断策略,对冲突数据中的各个字段的可信源权威等级进行处理,并将处理后的数据更新至数据库。这样的处理流程有助于确保数据库中的数据具有权威性和可靠性。In step S107, the authoritative source authority level of each field in the conflict data is processed according to the selected authoritative source judgment strategy, and the processed data is updated to the database. Such a process helps ensure that the data in the database is authoritative and reliable.

该实施例中,根据选择的权威源判断策略,对冲突数据中的各个字段的可信源权威等级进行处理。将经过处理后的冲突数据中的各个字段的可信源权威等级信息更新至数据库。这样,数据库中存储的数据将包含最新的可信源权威等级信息,确保数据库中的数据权威性和可靠性。In this embodiment, the authoritative level of the trusted source of each field in the conflicting data is processed according to the selected authoritative source judgment strategy. Update the trusted source authority level information of each field in the processed conflict data to the database. In this way, the data stored in the database will contain the latest authoritative level information of trusted sources, ensuring the authority and reliability of the data in the database.

在更新数据库时,需要对冲突数据进行合并,将相同实体的不一致数据合并为一条记录。合并过程中,优先选择具有最高权威等级的字段信息,确保最终数据的权威性和准确性。When updating the database, conflicting data needs to be merged to merge inconsistent data of the same entity into one record. During the merging process, field information with the highest authority level is given priority to ensure the authority and accuracy of the final data.

S108、若不存在冲突数据,则为当前数据生成唯一标识码,并进行入库。S108. If there is no conflicting data, generate a unique identification code for the current data and store it in the database.

在S107步骤中,如果在S105步骤中未发现冲突数据,说明当前数据无需进一步融合和处理,数据已经具备唯一性。在这一步骤中,可以为当前数据生成一个唯一的标识码,并将数据入库,即将数据保存至数据库。In step S107, if no conflicting data is found in step S105, it means that the current data does not require further fusion and processing, and the data is already unique. In this step, a unique identification code can be generated for the current data and the data can be stored in the database, that is, the data can be saved to the database.

针对当前数据,可以使用一定的算法或规则生成一个唯一的标识码。该标识码可以是一个数字、字符串或其他形式的唯一标识符,用于区分不同的数据实体。For the current data, a certain algorithm or rule can be used to generate a unique identification code. The identification code can be a number, string or other form of unique identifier used to distinguish different data entities.

上述提供了一种本申请中数据处理方法的实施例,旨在解决多数据源数据管理过程中可能出现的数据质量问题和信息错乱。该方法包括清洗、融合和归集三个阶段,每个阶段都配置了相应的算法和步骤,以确保数据的准确性和一致性。The above provides an embodiment of the data processing method in this application, aiming to solve data quality problems and information confusion that may occur in the process of data management from multiple data sources. The method includes three stages: cleaning, fusion, and aggregation. Each stage is configured with corresponding algorithms and steps to ensure the accuracy and consistency of the data.

该实施例的优势在于:The advantages of this embodiment are:

有效处理数据质量问题:通过预先构建质量问题字典和融合规则,能够高效识别和处理数据质量问题,提高数据的准确性和可信度。Effectively handle data quality issues: By pre-building a dictionary of quality issues and fusion rules, data quality issues can be efficiently identified and dealt with, improving the accuracy and credibility of the data.

自适应处理不同数据场景:根据业务应用场景选择不同的权威源判断策略,使得数据处理灵活适应不同数据源和业务需求。Adaptive processing of different data scenarios: Select different authoritative source judgment strategies according to business application scenarios, so that data processing can flexibly adapt to different data sources and business needs.

数据归集和融合提高数据完整性:通过归集规则和融合规则,保证数据的完整性和一致性,提高数据质量。Data aggregation and fusion improve data integrity: through aggregation rules and fusion rules, the integrity and consistency of data are ensured and data quality is improved.

数据库优化和唯一标识码生成:优化数据库存储结构,提高数据库的空间利用效率,并为每条数据生成唯一标识码,保证数据的唯一性。Database optimization and unique identification code generation: Optimize the database storage structure, improve the space utilization efficiency of the database, and generate a unique identification code for each piece of data to ensure the uniqueness of the data.

对于上述实施例提供的第二权威源判断策略,本申请还提供了一种对于权威源等级的动态更新方法,该方法能够自动的将各个字段的权威源等级更新至较高的级别,从而确保数据的修改权限不会发生错乱。下面对该方法进行描述,该方法包括如下步骤:该实施例提供的方法中,当所述结果数据入库后,基于预先构建的字段权威源等级表,为入库后的数据设定修改规则,所述修改规则包括:For the second authoritative source judgment strategy provided in the above embodiment, this application also provides a dynamic update method for the authoritative source level. This method can automatically update the authoritative source level of each field to a higher level, thereby ensuring Data modification permissions will not be confused. The method is described below. The method includes the following steps: In the method provided by this embodiment, after the result data is entered into the database, based on the pre-constructed field authoritative source level table, modifications are set for the data after being entered into the database. Rules, the modification rules include:

S001、识别当前被修改的目标字段,根据被修改的字段查询对应的字段权威源登记表;S001. Identify the currently modified target field, and query the corresponding field authoritative source registration table based on the modified field;

S002、根据查询结果比对当前的修改用户的权威源等级与目标字段当前的权威源等级的大小;S002. Compare the current authoritative source level of the modified user with the current authoritative source level of the target field according to the query results;

S003、若所述修改用户的权威源等级大于或者等于目标字段当前的权威源等级,则允许修改;S003. If the authoritative source level of the modifying user is greater than or equal to the current authoritative source level of the target field, the modification is allowed;

S004、若修改用户的权威源等级小于目标字段当前的权威源等级,则拒绝修改。S004. If the authoritative source level of the modifying user is less than the current authoritative source level of the target field, the modification will be rejected.

S005、在修改后,将目标字段的当前权威源等级更新为所述修改用户的权威源等级。S005. After modification, update the current authoritative source level of the target field to the authoritative source level of the modifying user.

在上述实施例中,该动态更新方法为确保数据的修改权限不发生错乱,防止低权限用户对高权限字段的误修改,从而维护数据的权威性和完整性。该方法的具体描述如下:In the above embodiment, the dynamic update method ensures that the data modification permissions are not confused and prevents low-privilege users from mistakenly modifying high-privilege fields, thereby maintaining the authority and integrity of the data. The specific description of this method is as follows:

基于字段权威源等级表设定修改规则:首先在数据入库后,根据预先构建的字段权威源等级表,为入库的数据设定修改规则。字段权威源等级表记录了各个字段的权威源等级信息,用于指示每个字段的可信程度。Set modification rules based on the field authority source level table: First, after the data is entered into the database, set modification rules for the entered data based on the pre-built field authority source level table. The field authority source level table records the authority source level information of each field and is used to indicate the credibility of each field.

当用户对数据库中的数据进行修改时,系统会识别当前被修改的目标字段。根据被修改的目标字段,在字段权威源等级表中查询对应的字段权威源等级信息。获取当前修改用户的权威源等级,并与目标字段当前的权威源等级进行比较。若当前修改用户的权威源等级大于或等于目标字段当前的权威源等级,则允许修改。否则,拒绝修改。When a user modifies data in the database, the system will identify the currently modified target field. According to the modified target field, query the corresponding field authority source level information in the field authority source level table. Get the authoritative source level of the current modifying user and compare it with the current authoritative source level of the target field. If the authoritative source level of the current modifying user is greater than or equal to the current authoritative source level of the target field, the modification is allowed. Otherwise, the modification is rejected.

在前述动态更新方法中,若经过比对权限等级后确认允许修改,则执行对目标字段的修改操作。In the aforementioned dynamic update method, if it is confirmed that modification is allowed after comparing the permission level, then the modification operation on the target field is performed.

在修改后,将目标字段的当前权威源等级更新为所述修改用户的权威源等级。这样,目标字段的权威源等级就会随着修改用户的权限等级而更新,保持数据管理的一致性,这样,就确保了数据不会被更低权威源等级的用户修改。After modification, the current authoritative source level of the target field is updated to the authoritative source level of the modifying user. In this way, the authority source level of the target field will be updated with the permission level of the modifying user, maintaining the consistency of data management. This ensures that the data will not be modified by users with a lower authority source level.

通过将目标字段的当前权威源等级更新为修改用户的权威源等级,实现了数据权限的动态调整。只有具有足够高权限的用户才能对数据进行修改,并且修改后字段的权威源等级会自动更新,确保数据的权威性和完整性得到维护。这样的动态更新方法有助于确保数据管理的安全性和可靠性,并提供了更灵活的数据权限控制方式。By updating the current authoritative source level of the target field to the authoritative source level of the modifying user, dynamic adjustment of data permissions is achieved. Only users with sufficiently high permissions can modify the data, and the authority source level of the modified field will be automatically updated to ensure that the authority and integrity of the data are maintained. Such a dynamic update method helps ensure the security and reliability of data management and provides a more flexible way to control data permissions.

上述实施例对本申请中的数据处理方法进行了描述,下面对本申请中所涉及的装置、系统及存储介质进行描述:The above embodiments describe the data processing method in this application. The following describes the devices, systems and storage media involved in this application:

参阅图4,本申请提供了一种数据处理装置,包括:Referring to Figure 4, this application provides a data processing device, including:

构建质量问题字典模块401,用于根据经验数据库,构建质量问题字典,所述质量问题字典中包含有数据质量问题以及对应的问题标识;The quality problem dictionary building module 401 is used to build a quality problem dictionary based on the experience database. The quality problem dictionary contains data quality problems and corresponding problem identifiers;

目标数据问题识别模块402,用于获取待处理数据,识别所述待处理数据中存在的目标数据问题;The target data problem identification module 402 is used to obtain the data to be processed and identify the target data problems existing in the data to be processed;

目标问题标识查询模块403,用于在所述质量问题字典中查询所述目标数据问题,获得对应的目标问题标识;The target problem identification query module 403 is used to query the target data problem in the quality problem dictionary and obtain the corresponding target problem identification;

目标问题标识添加模块404,用于通过预定义的字段将所述目标问题标识添加至所述目标数据中;The target problem identification adding module 404 is used to add the target problem identification to the target data through a predefined field;

目标字段判断与冲突检测模块405,用于读取所述目标数据中的多个目标字段,根据预设的融合规则对所述多个目标字段进行判断,并根据判断结果确定是否有冲突数据;The target field judgment and conflict detection module 405 is used to read multiple target fields in the target data, judge the multiple target fields according to the preset fusion rules, and determine whether there is conflicting data based on the judgment results;

权威源判断策略选择模块406,用于当有冲突数据时,识别当前的数据的业务应用场景,并根据识别结果选择权威源判断策略;The authoritative source judgment strategy selection module 406 is used to identify the business application scenario of the current data when there is conflicting data, and select an authoritative source judgment strategy based on the identification results;

冲突数据处理模块407,用于根据所述权威源判断策略处理所述冲突数据中的各个字段的可信源权威等级,并更新数据库。The conflict data processing module 407 is configured to process the trusted source authority level of each field in the conflict data according to the authoritative source judgment policy, and update the database.

可选的,目标字段判断与冲突检测模块405具体用于:Optionally, the target field judgment and conflict detection module 405 is specifically used to:

根据前述多个目标字段生成四种判断因子,所述四种判断因子分别为:Four judgment factors are generated based on the aforementioned multiple target fields. The four judgment factors are:

第一判断因子:姓名和身份证号组合、第二判断因子:姓名和手机号组合、第三判断因子:姓名和出生日期和所属组织标识码组合,和第四判断因子:姓名和企业名称组合;The first judgment factor: the combination of name and ID number, the second judgment factor: the combination of name and mobile phone number, the third judgment factor: the combination of name, date of birth and organization identification code, and the fourth judgment factor: the combination of name and company name. ;

判断策略为:通过上述四种判断因子对待比较数据进行比较,若在待比较数据中,上述四种判断因子中有一个判断因子相匹配,则确定待比较数据为冲突数据。The judgment strategy is: compare the data to be compared through the above four judgment factors. If in the data to be compared, one of the above four judgment factors matches, the data to be compared is determined to be conflict data.

可选的,目标字段判断与冲突检测模块405具体用于:Optionally, the target field judgment and conflict detection module 405 is specifically used to:

当确定姓名一致后,计算待比较数据中企业名称之间的语义相似度;When it is determined that the names are consistent, the semantic similarity between the company names in the data to be compared is calculated;

若所述语义相似度满足预设阈值,则确定所述第四判断因子匹配。If the semantic similarity meets the preset threshold, it is determined that the fourth judgment factor matches.

可选的,权威源判断策略选择模块406具体用于:若当前的业务应用场景为用于对已有历史数据库的数据进行处理,则选择第一权威源判断策略,所述第一权威源判断策略包括:Optionally, the authoritative source judgment strategy selection module 406 is specifically configured to: if the current business application scenario is to process data from an existing historical database, select a first authoritative source judgment strategy. The first authoritative source judgment strategy Strategies include:

查询预先构建的数据权威源等级表,并根据查询结果从所述冲突数据中将最高权威等级的数据作为业务主键,并更新所述历史数据库。Query the pre-built data authority source level table, use the data with the highest authority level from the conflict data as the business primary key according to the query results, and update the historical database.

可选的,权威源判断策略选择模块406具体用于:若当前的业务应用场景为用于对实时的数据进行入库处理,则选择第二权威判断策略,所述第二权威判断策略包括:Optionally, the authoritative source judgment strategy selection module 406 is specifically configured to: if the current business application scenario is for warehousing and processing of real-time data, select a second authoritative judgment strategy. The second authoritative judgment strategy includes:

查询预先构建的字段权威源等级表,并比较所述冲突数据中各个字段对应的权威源等级,根据比较结果,对各个字段选择最高的权威源等级的字段,并生成结果数据,将所述结果数据入库,所述字段权威源等级记录有各个字段的权威源的等级。Query the pre-built field authority source level table and compare the authority source levels corresponding to each field in the conflict data. Based on the comparison results, select the field with the highest authority source level for each field and generate the result data. The data is stored in the database, and the field authoritative source level is recorded with the level of the authoritative source of each field.

可选的,还包括修改规则设定模块408,用于:Optionally, a modification rule setting module 408 is also included for:

当所述结果数据入库后,基于预先构建的字段权威源等级表,为入库后的数据设定修改规则,所述修改规则包括:After the result data is entered into the database, modification rules are set for the entered data based on the pre-constructed field authority source level table. The modification rules include:

识别当前被修改的目标字段,根据被修改的字段查询对应的字段权威源登记表;Identify the currently modified target field, and query the corresponding field authoritative source registration table based on the modified field;

根据查询结果比对当前的修改用户的权威源等级与目标字段当前的权威源等级的大小;Compare the authoritative source level of the current modifying user with the current authoritative source level of the target field based on the query results;

若所述修改用户的权威源等级大于或者等于目标字段当前的权威源等级,则允许修改,否则拒绝修改。If the authoritative source level of the modifying user is greater than or equal to the current authoritative source level of the target field, the modification is allowed, otherwise the modification is rejected.

还用于:Also used for:

若允许修改,则在修改后,将目标字段的当前权威源等级更新为所述修改用户的权威源等级。If the modification is allowed, after the modification, the current authoritative source level of the target field is updated to the authoritative source level of the modifying user.

可选的,还包括唯一标识码生成模块409,用于:Optionally, a unique identification code generation module 409 is also included for:

若不存在冲突数据,则为当前数据生成唯一标识码,并进行入库。If there is no conflicting data, a unique identification code is generated for the current data and stored in the database.

还包括统一预清洗模块410,用于:Also included is a unified pre-cleaning module 410 for:

通过预定的预清洗规则,对所述待处理数据进行统一预清洗处理,在预清洗处理的过程中,标记无法进行统一预清洗处理的目标数据以及对应的目标数据问题。Through predetermined pre-cleaning rules, the data to be processed is uniformly pre-cleaned. During the pre-cleaning process, target data that cannot be uniformly pre-cleaned and corresponding target data problems are marked.

参阅图5,本申请还提供了一种数据处理系统,包括:Referring to Figure 5, this application also provides a data processing system, including:

处理器501、存储器502、输入输出单元503、总线504;Processor 501, memory 502, input and output unit 503, bus 504;

处理器501与存储器502、输入输出单元503以及总线504相连;The processor 501 is connected to the memory 502, the input and output unit 503 and the bus 504;

存储器502保存有程序,处理器501调用程序以执行如上任一数据处理方法。The memory 502 stores a program, and the processor 501 calls the program to execute any of the above data processing methods.

本申请还涉及一种计算机可读存储介质,计算机可读存储介质上保存有程序,其特征在于,当程序在计算机上运行时,使得计算机执行如上任一方法。The present application also relates to a computer-readable storage medium. A program is stored on the computer-readable storage medium. The characteristic is that when the program is run on a computer, it causes the computer to perform any of the above methods.

所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-onlymemory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code.

Claims (9)

1. A method of data processing, the method comprising:
constructing a quality problem dictionary according to an experience database, wherein the quality problem dictionary comprises data quality problems and corresponding problem identifiers;
acquiring data to be processed, and identifying a target data problem in the data to be processed;
inquiring the target data problem in the quality problem dictionary to obtain a corresponding target problem identifier;
adding the target problem identification to the target data through a predefined field;
reading a plurality of target fields in the target data, the plurality of target fields comprising: name, ID card number, mobile phone number, birth date, affiliated organization identification code and enterprise name, and generating four judgment factors according to the target fields, wherein the four judgment factors are respectively:
first judgment factor: name and identification card number combination, second judgment factor: name and phone number combination, third judgment factor: name and date of birth and the organization identifier code combination to which it belongs, and a fourth judgment factor: name and business name combinations;
the judgment strategy is as follows: comparing the data to be compared through the four judging factors, and if one judging factor of the four judging factors is matched in the data to be compared, determining the data to be compared as conflict data;
The method for judging the fourth judgment factor matching comprises the following steps:
after the names are consistent, calculating semantic similarity between enterprise names in the data to be compared;
if the semantic similarity meets a preset threshold, determining that the fourth judgment factor is matched;
when conflict data exists, the current service application scene of the data is identified, and an authoritative source judgment strategy is selected according to the identification result;
processing the authority level of the trusted source of each field in the conflict data according to the authority source judging strategy, and updating a database;
if no conflict data exists, generating a unique identification code for the current data, and warehousing.
2. The data processing method according to claim 1, wherein identifying a service application scenario of the current data, and selecting an authoritative source determination policy based on the identification result comprises:
if the current business application scene is used for processing the data of the existing historical database, selecting a first authority source judgment strategy, wherein the first authority source judgment strategy comprises the following steps:
and inquiring a pre-constructed data authority source level table, taking the data with the highest authority level from the conflict data as a business primary key according to an inquiry result, and updating the historical database.
3. The data processing method according to claim 1, wherein identifying a service application scenario of the current data, and selecting an authoritative source determination policy based on the identification result comprises:
if the current business application scene is used for carrying out warehouse entry processing on real-time data, selecting a second authority judgment strategy, wherein the second authority judgment strategy comprises the following steps:
inquiring a field authority source grade table constructed in advance, comparing authority source grades corresponding to each field in the conflict data, selecting the field with the highest authority source grade for each field according to a comparison result, generating result data, and warehousing the result data, wherein the field authority source grade records the authority source grade of each field.
4. A data processing method according to claim 3, wherein, when the result data is put in storage, a modification rule is set for the put data based on a field authority source level table constructed in advance, the modification rule comprising:
identifying a currently modified target field, and inquiring a corresponding field authority source registry according to the modified field;
comparing the current authority source level of the current modification user with the current authority source level of the target field according to the query result;
And if the authority source level of the modification user is greater than or equal to the current authority source level of the target field, allowing modification, otherwise refusing modification.
5. The data processing method of claim 4, wherein if modification is allowed, updating the current authoritative source class of the target field to the authoritative source class of the modifying user after modification.
6. The data processing method according to claim 1, wherein the identifying the target data problem existing in the data to be processed includes:
and carrying out unified pre-cleaning treatment on the data to be treated through a preset pre-cleaning rule, and marking target data and corresponding target data problems which cannot be subjected to the unified pre-cleaning treatment in the pre-cleaning treatment process.
7. A data processing apparatus, the apparatus comprising:
the quality problem dictionary module is used for constructing a quality problem dictionary according to the experience database, wherein the quality problem dictionary comprises data quality problems and corresponding problem identifiers;
the target data problem identification module is used for acquiring data to be processed and identifying target data problems in the data to be processed;
The target problem identification inquiring module is used for inquiring the target data problem in the quality problem dictionary to obtain a corresponding target problem identification;
a target problem identification adding module for adding the target problem identification to the target data through a predefined field;
the target field judging and collision detecting module is used for reading a plurality of target fields in the target data, wherein the target fields comprise: name, ID card number, mobile phone number, birth date, affiliated organization identification code and enterprise name, and generating four judgment factors according to the target fields, wherein the four judgment factors are respectively:
first judgment factor: name and identification card number combination, second judgment factor: name and phone number combination, third judgment factor: name and date of birth and the organization identifier code combination to which it belongs, and a fourth judgment factor: name and business name combinations;
the judgment strategy is as follows: comparing the data to be compared through the four judging factors, and if one judging factor of the four judging factors is matched in the data to be compared, determining the data to be compared as conflict data;
the method for judging the fourth judgment factor matching comprises the following steps:
After the names are consistent, calculating semantic similarity between enterprise names in the data to be compared;
if the semantic similarity meets a preset threshold, determining that the fourth judgment factor is matched;
the authority source judgment policy selection module is used for identifying the current service application scene of the data when the conflict data exists, and selecting an authority source judgment policy according to the identification result;
the conflict data processing module is used for processing the credible source authority level of each field in the conflict data according to the authority source judgment strategy and updating a database;
and the unique identification code generation module is used for generating a unique identification code for the current data and warehousing when no conflict data exist.
8. A system for data processing, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 6.
9. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 6.
CN202311048216.0A 2023-08-21 2023-08-21 Data processing method, device, system and storage medium Pending CN116910050A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202311048216.0A CN116910050A (en) 2023-08-21 2023-08-21 Data processing method, device, system and storage medium
LU507965A LU507965B1 (en) 2023-08-21 2023-10-26 Data processing method, apparatus, and system, and storage medium
PCT/CN2023/126885 WO2025039361A1 (en) 2023-08-21 2023-10-26 Data processing method, device and system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311048216.0A CN116910050A (en) 2023-08-21 2023-08-21 Data processing method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN116910050A true CN116910050A (en) 2023-10-20

Family

ID=88360268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311048216.0A Pending CN116910050A (en) 2023-08-21 2023-08-21 Data processing method, device, system and storage medium

Country Status (3)

Country Link
CN (1) CN116910050A (en)
LU (1) LU507965B1 (en)
WO (1) WO2025039361A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119069101A (en) * 2024-11-01 2024-12-03 宁波芯联心医疗科技有限公司 Sensor monitoring data optimization method and system for smart medical care
CN119477059A (en) * 2024-10-30 2025-02-18 中国标准化研究院 A method and system for determining enterprise ESG index based on data fusion
WO2025039361A1 (en) * 2023-08-21 2025-02-27 中电科大数据研究院有限公司 Data processing method, device and system, and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120371885A (en) * 2025-06-26 2025-07-25 京发云数智科技(江西)有限公司 Enterprise background investigation method and system
CN120371926A (en) * 2025-06-26 2025-07-25 恒丰银行股份有限公司 Report data asset identification method, system, equipment and medium
CN120470003A (en) * 2025-07-11 2025-08-12 武汉大数据产业发展有限公司 Data processing method and device, electronic device and computer-readable storage medium
CN120578658B (en) * 2025-08-05 2025-09-26 中国科学技术大学 Cross-regional data acquisition method and system based on intelligent decision

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130129490A (en) * 2012-05-21 2013-11-29 인포뱅크 주식회사 Method and apparatus for merging contacts by using sns
CN108255788A (en) * 2016-12-27 2018-07-06 方正国际软件(北京)有限公司 A kind of method and device for the confidence level for assessing data
CN111711623A (en) * 2020-06-15 2020-09-25 深圳前海微众银行股份有限公司 Method and device for data verification
CN112506897A (en) * 2020-11-17 2021-03-16 贵州电网有限责任公司 Method and system for analyzing and positioning data quality problem
CN114595236A (en) * 2022-01-26 2022-06-07 浙江绿城未来数智科技有限公司 Population data management method applied to basic level management

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7155612B2 (en) * 2003-04-30 2006-12-26 International Business Machines Corporation Desktop database data administration tool with row level security
EP1866808A2 (en) * 2005-03-19 2007-12-19 ActivePrime, Inc. Systems and methods for manipulation of inexact semi-structured data
US9195725B2 (en) * 2012-07-23 2015-11-24 International Business Machines Corporation Resolving database integration conflicts using data provenance
US20140222793A1 (en) * 2013-02-07 2014-08-07 Parlance Corporation System and Method for Automatically Importing, Refreshing, Maintaining, and Merging Contact Sets
US9576036B2 (en) * 2013-03-15 2017-02-21 International Business Machines Corporation Self-analyzing data processing job to determine data quality issues
CN113468161A (en) * 2021-07-23 2021-10-01 杭州数梦工场科技有限公司 Data management method and device and electronic equipment
CN116910050A (en) * 2023-08-21 2023-10-20 中电科大数据研究院有限公司 Data processing method, device, system and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130129490A (en) * 2012-05-21 2013-11-29 인포뱅크 주식회사 Method and apparatus for merging contacts by using sns
CN108255788A (en) * 2016-12-27 2018-07-06 方正国际软件(北京)有限公司 A kind of method and device for the confidence level for assessing data
CN111711623A (en) * 2020-06-15 2020-09-25 深圳前海微众银行股份有限公司 Method and device for data verification
CN112506897A (en) * 2020-11-17 2021-03-16 贵州电网有限责任公司 Method and system for analyzing and positioning data quality problem
CN114595236A (en) * 2022-01-26 2022-06-07 浙江绿城未来数智科技有限公司 Population data management method applied to basic level management

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025039361A1 (en) * 2023-08-21 2025-02-27 中电科大数据研究院有限公司 Data processing method, device and system, and storage medium
CN119477059A (en) * 2024-10-30 2025-02-18 中国标准化研究院 A method and system for determining enterprise ESG index based on data fusion
CN119069101A (en) * 2024-11-01 2024-12-03 宁波芯联心医疗科技有限公司 Sensor monitoring data optimization method and system for smart medical care

Also Published As

Publication number Publication date
WO2025039361A1 (en) 2025-02-27
LU507965B1 (en) 2025-02-28

Similar Documents

Publication Publication Date Title
CN116910050A (en) Data processing method, device, system and storage medium
CN112199366B (en) Data table processing method, device and equipment
CN111459985B (en) Identification information processing method and device
CN112750037B (en) Block chain-based data compression and query method and device and electronic equipment
US9116879B2 (en) Dynamic rule reordering for message classification
US11500876B2 (en) Method for duplicate determination in a graph
CN112559526A (en) Data table export method and device, computer equipment and storage medium
CN103973810A (en) Data processing method and device based on IP disk
CN110019542B (en) Generation of enterprise relationship, generation of organization member database and identification of same name member
CN111291002A (en) File account checking method and device, computer equipment and storage medium
US10216771B2 (en) Creating and handling identification for a resource in a configuration database
CN111008220A (en) Method and device for dynamic identification of data source, storage medium and electronic device
CN111953609B (en) OVS-based data packet processing method and related equipment
CN119166612A (en) A heterogeneous data migration method, migration device, equipment and medium
CN111611056B (en) Data processing method, device, computer equipment and storage medium
CN105765570B (en) music recognition
CN113672702B (en) User profile information perfecting method, device, equipment and storage medium
CN114741384A (en) A patient information processing method and device thereof, and a computer-readable storage medium
CN118535771A (en) Address book data correction method and device
CN109885555B (en) User information management method and device
CN114118014B (en) Distributed document control method and device, readable storage medium and electronic equipment
US10664501B2 (en) Deriving and interpreting users collective data asset use across analytic software systems
CN112232970A (en) Data relationship identification method and device, storage medium and electronic equipment
CN114564459A (en) Log file storage method, apparatus, device, and computer-readable storage medium
CN112149173A (en) Information filtering method, computing node and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination