[go: up one dir, main page]

CN111797078A - Data cleaning method, model training method, device, storage medium and equipment - Google Patents

Data cleaning method, model training method, device, storage medium and equipment Download PDF

Info

Publication number
CN111797078A
CN111797078A CN201910282171.0A CN201910282171A CN111797078A CN 111797078 A CN111797078 A CN 111797078A CN 201910282171 A CN201910282171 A CN 201910282171A CN 111797078 A CN111797078 A CN 111797078A
Authority
CN
China
Prior art keywords
cleaning
data
cleaned
rule
cleaning rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910282171.0A
Other languages
Chinese (zh)
Inventor
陈仲铭
何明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN201910282171.0A priority Critical patent/CN111797078A/en
Publication of CN111797078A publication Critical patent/CN111797078A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a data cleaning method, a model training method, a device, a storage medium and equipment, wherein data to be cleaned needing data cleaning can be firstly obtained, the cleaning requirement of the data to be cleaned is obtained, then a target cleaning rule used for carrying out data cleaning on the data to be cleaned is determined according to the obtained data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model, and finally the data to be cleaned is subjected to data cleaning according to the determined target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement. Therefore, as long as the cleaning rule classification model is obtained through pre-training, the cleaning rule classification model can be subsequently utilized to automatically clean data without excessive manual participation, so that the labor cost of data cleaning is reduced, and the efficiency of data cleaning is improved.

Description

数据清洗方法、模型训练方法、装置、存储介质及设备Data cleaning method, model training method, device, storage medium and equipment

技术领域technical field

本申请涉及数据处理技术领域,具体涉及一种数据清洗方法、模型训练方法、装置、存储介质及设备。The present application relates to the technical field of data processing, and in particular, to a data cleaning method, a model training method, an apparatus, a storage medium and a device.

背景技术Background technique

目前,如何对海量的数据进行处理已经成为电子设备不得不面对的考验,而对数据进行处理的首要工作就是数据清洗,通俗的说,即识别并滤除“脏数据”、保留“干净数据”。然而,相关技术中在进行数据清洗时,往往依赖于人工的领域知识、经验等,导致了大量的人力资源消耗,使得数据清洗的人力成本较高。At present, how to process massive data has become a test that electronic devices have to face, and the primary task of data processing is data cleaning. ". However, when performing data cleaning in related technologies, it often relies on manual domain knowledge, experience, etc., resulting in a large amount of human resource consumption and high labor costs for data cleaning.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种数据清洗方法、模型训练方法、装置、存储介质及设备,能够降低数据清洗的人力成本。The embodiments of the present application provide a data cleaning method, a model training method, an apparatus, a storage medium and equipment, which can reduce the labor cost of data cleaning.

第一方面,本申请实施例提供了一种数据清洗方法,应用于电子设备,该数据清洗方法包括:In a first aspect, an embodiment of the present application provides a data cleaning method, which is applied to an electronic device, and the data cleaning method includes:

获取需要进行数据清洗的待清洗数据;Obtain the data to be cleaned that needs to be cleaned;

获取所述待清洗数据的清洗需求;Obtain the cleaning requirements of the data to be cleaned;

根据所述待清洗数据、所述清洗需求以及预训练的清洗规则分类模型,确定用于对所述待清洗数据进行数据清洗的目标清洗规则;According to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model, determine a target cleaning rule for performing data cleaning on the data to be cleaned;

根据所述目标清洗规则对所述待清洗数据进行数据清洗,使得对所述待清洗数据的清洗效果满足所述清洗需求;Perform data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect on the data to be cleaned meets the cleaning requirement;

其中,所述清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征所述清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。The cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output, and the joint feature representing the sample data to be cleaned corresponding to the cleaning rule and the cleaning effect as the training input, and is obtained by model training.

第二方面,本申请实施例提供了一种模型训练方法,应用于电子设备,该模型训练方法包括:In a second aspect, an embodiment of the present application provides a model training method, which is applied to an electronic device, and the model training method includes:

获取多个清洗规则,以及获取对应各所述清洗规则的待清洗样本数据;Acquire a plurality of cleaning rules, and acquire sample data to be cleaned corresponding to each of the cleaning rules;

获取各所述清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果;Obtain the cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each of the cleaning rules;

获取各所述待清洗样本数据及其对应的清洗效果的联合特征,以及获取各所述清洗规则的清洗规则特征;acquiring the joint features of each of the sample data to be cleaned and the corresponding cleaning effects, and acquiring the cleaning rule features of each of the cleaning rules;

将各所述联合特征作为训练输入、将各所述联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。Taking each of the joint features as a training input, and using the cleaning rule feature corresponding to each of the joint features as a target output, model training is performed to obtain a cleaning rule classification model.

第三方面,本申请实施例提供了一种数据清洗装置,应用于电子设备,该数据清洗装置包括:In a third aspect, an embodiment of the present application provides a data cleaning device, which is applied to electronic equipment, and the data cleaning device includes:

数据获取模块,用于获取需要进行数据清洗的待清洗数据;The data acquisition module is used to acquire the data to be cleaned that needs to be cleaned;

需求获取模块,用于获取所述待清洗数据的清洗需求;a requirement acquisition module, used to acquire the cleaning requirements of the data to be cleaned;

规则确定模块,用于根据所述待清洗数据、所述清洗需求以及预训练的清洗规则分类模型,确定用于对所述待清洗数据进行数据清洗的目标清洗规则;a rule determination module, configured to determine a target cleaning rule for performing data cleaning on the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;

数据清洗模块,用于根据所述目标清洗规则对所述待清洗数据进行数据清洗,使得对所述待清洗数据的清洗效果满足所述清洗需求;a data cleaning module, configured to perform data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect on the data to be cleaned meets the cleaning requirement;

其中,所述清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征所述清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。The cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output, and the joint feature representing the sample data to be cleaned corresponding to the cleaning rule and the cleaning effect as the training input, and is obtained by model training.

第四方面,本申请实施例提供了一种模型训练装置,应用于电子设备,该模型训练装置包括:In a fourth aspect, an embodiment of the present application provides a model training device, which is applied to electronic equipment, and the model training device includes:

第一获取模块,用于获取多个清洗规则,以及获取对应各所述清洗规则的待清洗样本数据;a first acquisition module, configured to acquire a plurality of cleaning rules, and acquire sample data to be cleaned corresponding to each of the cleaning rules;

第二获取模块,用于获取各所述清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果;a second acquisition module, configured to acquire the cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each of the cleaning rules;

第三获取模块,用于获取各所述待清洗样本数据及其对应的清洗效果的联合特征,以及获取各所述清洗规则的清洗规则特征;a third acquisition module, configured to acquire the joint features of each of the sample data to be cleaned and their corresponding cleaning effects, and to acquire the cleaning rule features of each of the cleaning rules;

模型训练模块,用于将各所述联合特征作为训练输入、将各所述联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。The model training module is configured to use each of the joint features as a training input, and use the cleaning rule feature corresponding to each of the joint features as a target output to perform model training to obtain a cleaning rule classification model.

第五方面,本申请实施例提供了一种存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如本申请实施例提供的数据清洗方法中的步骤,或者使得所述计算机执行如本申请实施例提供的模型训练方法中的步骤。In a fifth aspect, the embodiments of the present application provide a storage medium on which a computer program is stored, and when the computer program runs on a computer, the computer is made to execute the data cleaning method provided by the embodiments of the present application. steps, or cause the computer to execute the steps in the model training method provided by the embodiments of the present application.

第六方面,本申请实施例提供了一种电子设备,包括处理器和存储器,所述存储器有计算机程序,所述处理器通过调用所述计算机程序,用于执行如本申请实施例提供的数据清洗方法中的步骤,或者执行本申请实施例提供的模型训练方法中的步骤。In a sixth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, the memory has a computer program, and the processor is used to execute the data provided by the embodiment of the present application by calling the computer program Steps in the cleaning method, or perform the steps in the model training method provided by the embodiments of the present application.

本申请实施例中,电子设备可以首先获取需要进行数据清洗的待清洗数据,以及获取待清洗数据的清洗需求,然后根据获取到的待清洗数据、清洗需求以及预先训练的清洗规则分类模型,确定出用于对待清洗数据进行数据清洗的目标清洗规则,最后根据确定出的目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足清洗需求。由此,只要预先训练得到清洗规则分类模型,后续即可利用该清洗规则分类模型实现对数据的自动清洗,而无需过多的人工参与,不仅降低了数据清洗的人力成本,更提高了数据清洗的效率。In the embodiment of the present application, the electronic device may first obtain the data to be cleaned that needs to be cleaned, and obtain the cleaning requirement of the data to be cleaned, and then determine the classification model according to the obtained data to be cleaned, the cleaning requirement, and the pre-trained cleaning rule classification model. The target cleaning rules for cleaning the data to be cleaned are determined, and finally the data to be cleaned is cleaned according to the determined target cleaning rules, so that the cleaning effect of the data to be cleaned meets the cleaning requirements. Therefore, as long as the cleaning rule classification model is obtained by pre-training, the cleaning rule classification model can be used to automatically clean the data in the future without excessive manual participation, which not only reduces the labor cost of data cleaning, but also improves data cleaning. s efficiency.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained from these drawings without creative effort.

图1是本申请实施例提供的全景感知架构的结构示意图。FIG. 1 is a schematic structural diagram of a panoramic perception architecture provided by an embodiment of the present application.

图2是本申请实施例提供的数据清洗方法的一流程示意图。FIG. 2 is a schematic flowchart of a data cleaning method provided by an embodiment of the present application.

图3是本申请实施例提供的数据清洗方法的另一流程示意图。FIG. 3 is another schematic flowchart of a data cleaning method provided by an embodiment of the present application.

图4是本申请实施例中电子设备根据清洗规则分类模型得到目标清洗规则的示意图。FIG. 4 is a schematic diagram of obtaining target cleaning rules by an electronic device according to a cleaning rule classification model in an embodiment of the present application.

图5是本申请实施例提供的模型训练方法的一流程示意图。FIG. 5 is a schematic flowchart of a model training method provided by an embodiment of the present application.

图6是本申请实施例提供的模型训练方法的另一流程示意图。FIG. 6 is another schematic flowchart of the model training method provided by the embodiment of the present application.

图7是本申请实施例中进行模型训练的应用场景示意图。FIG. 7 is a schematic diagram of an application scenario of model training in an embodiment of the present application.

图8是本申请实施例提供的数据清洗装置的一结构示意图。FIG. 8 is a schematic structural diagram of a data cleaning apparatus provided by an embodiment of the present application.

图9是本申请实施例提供的模型训练装置的一结构示意图。FIG. 9 is a schematic structural diagram of a model training apparatus provided by an embodiment of the present application.

图10是本申请实施例提供的电子设备的一结构示意图。FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

图11是本申请实施例提供的电子设备的另一结构示意图。FIG. 11 is another schematic structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the drawings, wherein the same component symbols represent the same components, and the principles of the present application are exemplified by being implemented in a suitable computing environment. The following description is based on illustrated specific embodiments of the present application and should not be construed as limiting other specific embodiments of the present application not detailed herein.

随着传感器的小型化、智能化,如手机、平板电脑等电子设备集成了越来越多的传感器,比如光线传感器、距离传感器、位置传感器、加速度传感器以及重力传感器,等等。电子设备能够通过其配置的传感器以更小的功耗采集到更多的数据。同时,电子设备在运行过程中还会采集到自身状态相关的数据以及用户状态相关的数据,等等。笼统的说,电子设备能够获取到外部环境相关的数据(比如温度、光照、地点、声音、天气等)、用户状态相关的数据(比如姿势、速度、使用习惯、个人基本信息等)以及电子设备状态相关的数据(比如耗电量、资源使用状况、网络状况等)。本申请实施例中,将电子设备能够获取到的这些数据记为全景数据。With the miniaturization and intelligence of sensors, electronic devices such as mobile phones and tablet computers integrate more and more sensors, such as light sensors, distance sensors, position sensors, acceleration sensors, and gravity sensors, and so on. Electronic devices can collect more data with less power consumption through the sensors they configure. At the same time, the electronic device will also collect data related to its own state and data related to the state of the user during the running process, and so on. Generally speaking, electronic devices can obtain data related to the external environment (such as temperature, light, location, sound, weather, etc.), data related to user status (such as posture, speed, usage habits, personal basic information, etc.) and electronic devices. Status-related data (such as power consumption, resource usage, network status, etc.). In this embodiment of the present application, these data that can be acquired by the electronic device are recorded as panoramic data.

本申请实施例中,为了能够对电子设备获取到的这些数据进行处理,提出了一种全景感知架构。请参照图1,图1为本申请实施例提供的全景感知架构的结构示意图,应用于电子设备,其包括由下至上的信息感知层、数据处理层、特征抽取层、情景建模层以及智能服务层。In the embodiment of the present application, in order to be able to process the data obtained by the electronic device, a panoramic perception architecture is proposed. Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of a panoramic perception architecture provided by an embodiment of the application, applied to an electronic device, and includes a bottom-up information perception layer, a data processing layer, a feature extraction layer, a context modeling layer, and an intelligent service layer.

作为全景感知架构的最底层,信息感知层用于获取能够描述用户的各类型情景的原始数据,即全景数据。其中,信息感知层由多个用于数据采集的传感器组成,包括但不限于图示的用于检测电子设备与外部物体之间的距离的距离传感器、用于检测电子设备所处环境的磁场信息的磁场传感器、用于检测电子设备所处环境的光线信息的光线传感器、用于检测电子设备的加速度数据的加速度传感器、用于采集用户的指纹信息的指纹传感器、用于感应磁场信息的霍尔传感器、用于检测电子设备当前所处的地理位置的位置传感器、用于检测电子设备在各个方向上的角速度的陀螺仪、用于检测电子设备的运动数据惯性传感器、用于感应电子设备的姿态信息的姿态感应器、用于检测电子设备所处环境的气压的气压计以及用于检测用户的心率信息的心率传感器等。As the bottom layer of the panoramic perception architecture, the information perception layer is used to obtain raw data that can describe various types of user scenarios, that is, panoramic data. The information perception layer is composed of a plurality of sensors for data collection, including but not limited to the distance sensor shown in the figure for detecting the distance between the electronic device and external objects, and the magnetic field information for detecting the environment in which the electronic device is located. Magnetic field sensor, light sensor for detecting light information of the environment where electronic equipment is located, acceleration sensor for detecting acceleration data of electronic equipment, fingerprint sensor for collecting user's fingerprint information, Hall for sensing magnetic field information Sensor, position sensor for detecting the current geographic location of the electronic device, gyroscope for detecting the angular velocity of the electronic device in various directions, inertial sensor for detecting the motion data of the electronic device, for sensing the attitude of the electronic device Information attitude sensors, barometers used to detect the air pressure of the environment where the electronic device is located, and heart rate sensors used to detect the user's heart rate information, etc.

作为全景感知架构的次底层,数据处理层用于对信息感知层获取到的原始数据进行处理,消除原始数据存在的噪声、不一致等问题。其中,数据处理层可以对信息感知层获取到的数据进行数据清理、数据集成、数据变换、数据归约等处理。As the second bottom layer of the panoramic perception architecture, the data processing layer is used to process the original data obtained by the information perception layer to eliminate noise and inconsistency in the original data. Among them, the data processing layer can perform data cleaning, data integration, data transformation, data reduction and other processing on the data obtained by the information perception layer.

作为全景感知架构的中间层,特征抽取层用于对数据处理层处理后的数据进行特征抽取,以提取所述数据中包括的特征。其中,特征抽取层可以通过过滤法、包装法、集成法等方法来提取特征或者对提取到的特征进行处理。As the middle layer of the panoramic perception architecture, the feature extraction layer is used to perform feature extraction on the data processed by the data processing layer, so as to extract the features included in the data. Among them, the feature extraction layer can extract features or process the extracted features by filtering method, packaging method, integration method and other methods.

过滤法是指对提取到的特征进行过滤,以删除冗余的特征数据。包装法用于对提取到的特征进行筛选。集成法是指将多种特征提取方法集成到一起,以构建一种更加高效、更加准确的特征提取方法,用于提取特征。The filtering method refers to filtering the extracted features to remove redundant feature data. The packing method is used to filter the extracted features. The integration method refers to the integration of multiple feature extraction methods to construct a more efficient and accurate feature extraction method for feature extraction.

作为全景感知架构的次高层,情景建模层用于根据特征抽取层提取到的特征来构建模型,所得到的模型可以用于表示电子设备的状态或者用户状态或者环境状态等。例如,情景建模层可以根据特征抽取层提取到的特征来构建关键值模型、模式标识模型、图模型、实体联系模型、面向对象模型等。As the next level of the panoramic perception architecture, the context modeling layer is used to construct a model according to the features extracted by the feature extraction layer, and the obtained model can be used to represent the state of the electronic device, the user state, or the environment state, etc. For example, the scenario modeling layer can construct a key value model, a pattern identification model, a graph model, an entity relationship model, an object-oriented model, etc. according to the features extracted by the feature extraction layer.

作为全景感知架构的最高层,智能服务层用于根据情景建模层所构建的模型提供智能化服务。比如,智能服务层可以为用户提供基础应用服务,可以为电子设备进行系统智能优化服务,还可以为用户提供个性化智能服务等。As the highest layer of the panoramic perception architecture, the intelligent service layer is used to provide intelligent services according to the model constructed by the context modeling layer. For example, the intelligent service layer can provide users with basic application services, can provide system intelligent optimization services for electronic devices, and can also provide users with personalized intelligent services.

此外,全景感知架构中还包括算法库,算法库中包括但不限于图示的马尔科夫算法、隐含狄里克雷分布算法、贝叶斯分类算法、支持向量机、K均值聚类算法、K近邻算法、条件随机场、残差网络、长短期记忆网络、卷积神经网络以及循环神经网络等算法。In addition, the panoramic perception architecture also includes an algorithm library, which includes but is not limited to the illustrated Markov algorithm, implicit Dirichlet distribution algorithm, Bayesian classification algorithm, support vector machine, and K-means clustering algorithm. , K-Nearest Neighbor Algorithm, Conditional Random Field, Residual Network, Long Short-Term Memory Network, Convolutional Neural Network and Recurrent Neural Network.

本申请实施例首先提供一种数据清洗方法,该数据清洗方法的执行主体可以是本申请实施例提供的数据清洗装置,或者集成了该数据清洗装置的电子设备,其中该数据清洗装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等配置有处理器而具有处理能力的设备。Embodiments of the present application first provide a data cleaning method. The execution body of the data cleaning method may be the data cleaning apparatus provided in the embodiments of the present application, or an electronic device integrated with the data cleaning apparatus, wherein the data cleaning apparatus may adopt hardware or software implementation. The electronic device may be a device equipped with a processor and having processing capabilities, such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.

基于本申请实施例所提供的数据清洗方法,信息感知层中将采集到的全景数据提供给数据处理层;数据处理层将来自于信息感知层的全景数据作为需要进行数据清洗的待清洗数据,并对其进行数据清洗,将清洗后的数据提供给特征抽取层;特征抽取层对来自于数据处理层的数据进行特征抽取,得到能够表征数据的特征,将抽取到的特征提供给情景建模层;情景建模层基于来自于特征抽取层的特征进行建模,利用建模得到的模型来表征电子设备的状态或者用户状态或者环境状态等;最后,智能服务层根据情景建模层所构建的模型提供对应的智能化服务,比如基础应用服务、系统优化服务、个性化服务等。Based on the data cleaning method provided by the embodiment of the present application, the information perception layer provides the collected panoramic data to the data processing layer; the data processing layer uses the panoramic data from the information perception layer as the data to be cleaned that needs to be cleaned, And perform data cleaning on it, and provide the cleaned data to the feature extraction layer; the feature extraction layer performs feature extraction on the data from the data processing layer to obtain features that can characterize the data, and provides the extracted features to the scenario modeling. layer; the scenario modeling layer is based on the features from the feature extraction layer, and uses the model obtained by modeling to represent the state of the electronic device or the user state or the environment state, etc.; finally, the intelligent service layer is constructed according to the scenario modeling layer. The model provides corresponding intelligent services, such as basic application services, system optimization services, personalized services, etc.

请参照图2,图2为本申请实施例提供的数据清洗方法的流程示意图,该数据清洗方法方法实现于全景感知架构的数据处理层,如图2所示,本申请实施例提供的数据清洗方法的流程可以如下:Please refer to FIG. 2, which is a schematic flowchart of a data cleaning method provided by an embodiment of the present application. The data cleaning method is implemented in the data processing layer of the panoramic perception architecture. As shown in FIG. 2, the data cleaning method provided by the embodiment of the present application The flow of the method can be as follows:

在101中,获取需要进行数据清洗的待清洗数据。In 101, data to be cleaned that needs to be cleaned is obtained.

比如,电子设备可以从本地获取需要进行数据清洗的待清洗数据,也可以从其它电子设备处获取需要进行数据清洗的待清洗数据,还可以从网络获取需要进行数据清洗的待清洗数据,等等。For example, the electronic device can obtain the data to be cleaned that needs to be cleaned locally, it can also obtain the data to be cleaned that needs to be cleaned from other electronic devices, and it can also obtain the data to be cleaned that needs to be cleaned from the network, etc. .

在102中,获取待清洗数据的清洗需求。In 102, the cleaning requirements of the data to be cleaned are obtained.

本领域普通技术人员可以理解的是,现实世界的数据往往是多维度的、不完整的、有噪声的以及不一致的,数据清洗的目的就在于填充缺失的值、光滑噪声并识别离群点、纠正数据中的不一致等。Those of ordinary skill in the art can understand that real-world data is often multi-dimensional, incomplete, noisy and inconsistent. The purpose of data cleaning is to fill in missing values, smooth noise and identify outliers, Correct inconsistencies in data, etc.

本申请实施例中,电子设备在获取到需要进行数据清洗的待清洗数据之后,进一步获取到待清洗数据的清洗需求。通俗的说,清洗需求描述了对待清洗数据进行数据清洗想要达到的清洗效果,比如,原始的待清洗数据含有多个维度的数据,而这些维度之间往往不是独立的,也就是说也许其中之间若干的维度之间存在关联,也许有他就可以没有我,这样,待执行数据的清洗需求可以是将待清洗数据降维到指定维度。In the embodiment of the present application, after acquiring the data to be cleaned that needs to be cleaned, the electronic device further acquires the cleaning requirement of the data to be cleaned. In layman's terms, the cleaning requirement describes the cleaning effect that the data to be cleaned needs to be cleaned. For example, the original data to be cleaned contains data of multiple dimensions, and these dimensions are often not independent, that is to say, maybe one of them There is an association between several dimensions, and maybe there is no need for me. In this way, the cleaning requirement of the data to be executed can be to reduce the dimension of the data to be cleaned to a specified dimension.

本领域普通技术人员可以理解的是,清洗需求取决于电子设备进行数据处理的实际所需,本申请实施例对此不做具体限制。It can be understood by those skilled in the art that the cleaning requirement depends on the actual requirement for data processing performed by the electronic device, which is not specifically limited in this embodiment of the present application.

在103中,根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则。In 103, according to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model, a target cleaning rule for performing data cleaning on the data to be cleaned is determined.

应当说明的是,本申请实施例中,在电子设备配置有用于选取何种清洗规则对待清洗数据进行数据清洗的清洗规则分类模型,该清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。It should be noted that, in the embodiment of the present application, the electronic device is configured with a cleaning rule classification model for selecting which cleaning rule to clean the data to be cleaned, and the cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output. , and the joint features representing the sample data to be cleaned corresponding to the cleaning rules and their cleaning effects are used as training inputs, and are obtained by model training.

比如,可以预先整合所有可能的清洗规则,同时收集每个清洗规则对应的待清洗样本数据及其清洗效果;然后,获取能够表征清洗规则的清洗规则特征,以及获取能够表征待清洗样本数据及其清洗效果的联合特征;然后,将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,按照预先设定的训练算法来进行模型训练,以训练得到用于选取何种清洗规则对待清洗数据进行数据清洗的清洗规则分类模型。For example, all possible cleaning rules can be pre-integrated, and the sample data to be cleaned corresponding to each cleaning rule and its cleaning effect can be collected at the same time; Combined features of cleaning effect; then, each joint feature is used as training input, and the cleaning rule feature corresponding to each joint feature is used as target output, and model training is performed according to a preset training algorithm, so as to train to obtain which cleaning method is used to select The rule is a cleaning rule classification model for data cleaning of the data to be cleaned.

由此,电子设备在获取到需要进行数据清洗的待清洗数据,以及获取到待清洗数据的清洗需求之后,即可将待清洗数据和清洗需求输入到清洗规则分类模型,使得清洗规则分类模型输出能够对待清洗数据进行数据清洗且清洗效果满足清洗需求的清洗规则,将该清洗规则作为对待清洗数据进行数据清洗的目标清洗规则。Therefore, after the electronic device obtains the data to be cleaned that needs to be cleaned, and obtains the cleaning requirement of the data to be cleaned, the electronic device can input the data to be cleaned and the cleaning requirement into the cleaning rule classification model, so that the cleaning rule classification model outputs A cleaning rule that can perform data cleaning on the data to be cleaned and the cleaning effect meets the cleaning requirements, and the cleaning rule is used as the target cleaning rule for data cleaning of the data to be cleaned.

在104中,根据目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求。In 104, data cleaning is performed on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements.

本申请实施例中,电子设备在确定用于对待清洗数据进行数据清洗的目标清洗规则之后,即可根据该目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求,最终得到所需的数据。In the embodiment of the present application, after determining the target cleaning rule for cleaning the data to be cleaned, the electronic device can perform data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements, Finally get the required data.

由上可知,本申请实施例中,电子设备可以首先获取需要进行数据清洗的待清洗数据,以及获取待清洗数据的清洗需求,然后根据获取到的待清洗数据、清洗需求以及预先训练的清洗规则分类模型,确定出用于对待清洗数据进行数据清洗的目标清洗规则,最后根据确定出的目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足清洗需求。由此,只要预先训练得到清洗规则分类模型,后续即可利用该清洗规则分类模型实现对数据的自动清洗,而无需过多的人工参与,不仅降低了数据清洗的人力成本,更提高了数据清洗的效率。As can be seen from the above, in the embodiment of the present application, the electronic device can first obtain the data to be cleaned that needs to be cleaned, and obtain the cleaning requirements of the data to be cleaned, and then obtain the data to be cleaned, the cleaning requirements and the pre-trained cleaning rules according to the obtained data to be cleaned. The classification model determines the target cleaning rules for cleaning the data to be cleaned, and finally cleans the data to be cleaned according to the determined target cleaning rules, so that the cleaning effect of the data to be cleaned meets the cleaning needs. Therefore, as long as the cleaning rule classification model is obtained by pre-training, the cleaning rule classification model can be used to automatically clean the data in the future without excessive manual participation, which not only reduces the labor cost of data cleaning, but also improves data cleaning. s efficiency.

请参照图3,图3为本申请实施例提供的数据清洗方法的另一种流程示意图。该数据清洗方法可以应用于电子设备,该数据清洗方法的流程可以包括:Please refer to FIG. 3 , which is another schematic flowchart of a data cleaning method provided by an embodiment of the present application. The data cleaning method can be applied to electronic equipment, and the process of the data cleaning method can include:

在201中,获取传感器采集的传感器数据,将获取到的传感器数据作为待清洗数据。In 201, sensor data collected by a sensor is acquired, and the acquired sensor data is used as data to be cleaned.

应当说明的是,电子设备通常配置有多种传感器,通过这些传感器可以感知自身所处的环境、自身的运动等等。其中,电子设备配置的传感器包括但不限于重力传感器、加速度传感器、定位传感器(如卫星定位传感器、基站定位传感器等)、声音传感器以及光线传感器等。It should be noted that an electronic device is usually configured with a variety of sensors, through which the environment where it is located, its own movement, and the like can be sensed. The sensors configured in the electronic device include but are not limited to gravity sensors, acceleration sensors, positioning sensors (such as satellite positioning sensors, base station positioning sensors, etc.), sound sensors, and light sensors.

然而,这些传感器采集的到的传感器数据并不都是是电子设备所需的,这就需要对电子设备对这些传感器数据进行清洗,从中清洗出实际所需的数据。However, the sensor data collected by these sensors are not all required by the electronic device, which requires the electronic device to clean the sensor data and clean the actual data from it.

因此,本申请实施例中,电子设备可以获取自身配置的传感器所采集到的传感器数据,并将获取到的这些传感器数据作为待清洗数据。Therefore, in this embodiment of the present application, the electronic device may acquire sensor data collected by sensors configured by itself, and use the acquired sensor data as data to be cleaned.

在202中,获取待清洗数据的清洗需求。In 202, the cleaning requirements of the data to be cleaned are obtained.

本领域普通技术人员可以理解的是,现实世界的数据往往是多维度的、不完整的、有噪声的以及不一致的,数据清洗的目的就在于填充缺失的值、光滑噪声并识别离群点、纠正数据中的不一致等。Those of ordinary skill in the art can understand that real-world data is often multi-dimensional, incomplete, noisy and inconsistent. The purpose of data cleaning is to fill in missing values, smooth noise and identify outliers, Correct inconsistencies in data, etc.

本申请实施例中,电子设备在获取到需要进行数据清洗的待清洗数据之后,进一步获取到待清洗数据的清洗需求。通俗的说,清洗需求描述了对待清洗数据进行数据清洗想要达到的清洗效果,比如,原始的待清洗数据含有多个维度的数据,而这些维度之间往往不是独立的,也就是说也许其中之间若干的维度之间存在关联,也许有他就可以没有我,这样,待执行数据的清洗需求可以是将待清洗数据降维到指定维度。In the embodiment of the present application, after acquiring the data to be cleaned that needs to be cleaned, the electronic device further acquires the cleaning requirement of the data to be cleaned. In layman's terms, the cleaning requirement describes the cleaning effect that the data to be cleaned needs to be cleaned. For example, the original data to be cleaned contains data of multiple dimensions, and these dimensions are often not independent, that is to say, maybe one of them There is an association between several dimensions, and maybe there is no need for me. In this way, the cleaning requirement of the data to be executed can be to reduce the dimension of the data to be cleaned to a specified dimension.

应当说明的是,本领域普通技术人员可以理解的是,清洗需求取决于电子设备进行数据处理的实际所需,本申请实施例对此不做具体限制。It should be noted that those of ordinary skill in the art can understand that the cleaning requirement depends on the actual requirement for data processing performed by the electronic device, which is not specifically limited in this embodiment of the present application.

在203中,获取待清洗数据以及清洗需求的联合特征。In 203, the combined characteristics of the data to be cleaned and the cleaning requirement are obtained.

在204中,将获取到的联合特征输入清洗规则分类模型,得到清洗规则分类模型输出的清洗规则特征。In 204, the obtained joint features are input into the cleaning rule classification model, and the cleaning rule features output by the cleaning rule classification model are obtained.

在205中,确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则,作为对待清洗数据进行数据清洗的目标清洗规则。In 205, a cleaning rule matching the cleaning rule feature output by the cleaning rule classification model is determined as a target cleaning rule for performing data cleaning on the data to be cleaned.

应当说明的是,本申请实施例中,在电子设备配置有用于选取何种清洗规则对待清洗数据进行数据清洗的清洗规则分类模型,该清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。It should be noted that, in the embodiment of the present application, the electronic device is configured with a cleaning rule classification model for selecting which cleaning rule to clean the data to be cleaned, and the cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output. , and the joint features representing the sample data to be cleaned corresponding to the cleaning rules and their cleaning effects are used as training inputs, and are obtained by model training.

比如,可以预先整合所有可能的清洗规则,同时收集每个清洗规则对应的待清洗样本数据及其清洗效果;然后,获取能够表征清洗规则的清洗规则特征,以及获取能够表征待清洗样本数据及其清洗效果的联合特征;然后,将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,按照预先设定的训练算法来进行模型训练,以训练得到用于选取何种清洗规则对待清洗数据进行数据清洗的清洗规则分类模型。For example, all possible cleaning rules can be pre-integrated, and the sample data to be cleaned corresponding to each cleaning rule and its cleaning effect can be collected at the same time; Combined features of cleaning effect; then, each joint feature is used as training input, and the cleaning rule feature corresponding to each joint feature is used as target output, and model training is performed according to a preset training algorithm, so as to train to obtain which cleaning method is used to select The rule is a cleaning rule classification model for data cleaning of the data to be cleaned.

由此,电子设备在获取到需要进行数据清洗的待清洗数据,以及获取到待清洗数据的清洗需求之后,即可将待清洗数据和清洗需求输入到清洗规则分类模型,使得清洗规则分类模型输出能够对待清洗数据进行数据清洗且清洗效果满足清洗需求的清洗规则,将该清洗规则作为对待清洗数据进行数据清洗的目标清洗规则。Therefore, after the electronic device obtains the data to be cleaned that needs to be cleaned, and obtains the cleaning requirement of the data to be cleaned, the electronic device can input the data to be cleaned and the cleaning requirement into the cleaning rule classification model, so that the cleaning rule classification model outputs A cleaning rule that can perform data cleaning on the data to be cleaned and the cleaning effect meets the cleaning requirements, and the cleaning rule is used as the target cleaning rule for data cleaning of the data to be cleaned.

应当说明的是,本申请实施例中,将待清洗数据和清洗需求输入到清洗规则分类模型,并不是将待清洗数据和清洗需求本身输入到清洗规则分类模型,而是将能够表征待清洗数据和清洗需求的特征输入到清洗规则分类模型。It should be noted that, in the embodiment of the present application, the data to be cleaned and the cleaning requirements are input into the cleaning rule classification model, not the data to be cleaned and the cleaning requirements themselves are input into the cleaning rule classification model, but the data to be cleaned can be represented. and cleaning requirement features are input into the cleaning rule classification model.

因此,在本申请实施例中,电子设备在获取到待清洗数据以及获取到待清洗数据的清洗需求之后,进一步获取待清洗数据以及清洗需求的联合特征,利用该联合特征来对待清洗数据及其清洗需求进行联合深度表征。Therefore, in the embodiment of the present application, after acquiring the data to be cleaned and the cleaning requirement of the data to be cleaned, the electronic device further acquires the combined feature of the data to be cleaned and the cleaning requirement, and uses the combined feature to analyze the data to be cleaned and its cleaning requirements. Combined depth characterization of cleaning requirements.

而在获取到的待清洗数据及其清洗需求的联合特征之后,电子设备将该联合特征输入到预训练的清洗规则分类模型中进行处理。另一方面,清洗规则分类模型对输入的联合特征进行处理,输出对应的清洗规则特征,该清洗规则特征表征能够对待清洗数据进行数据清洗且清洗效果达到清洗需求的清洗规则。After obtaining the joint features of the data to be cleaned and the cleaning requirements, the electronic device inputs the joint features into the pre-trained cleaning rule classification model for processing. On the other hand, the cleaning rule classification model processes the input joint features and outputs the corresponding cleaning rule features. The cleaning rule features represent the cleaning rules that can clean the data to be cleaned and the cleaning effect meets the cleaning requirements.

在电子设备得到清洗规则分类模型所输出的清洗规则特征之后,进一步确定与该清洗规则特征所匹配的清洗规则,将该清洗规则作为对待清洗数据进行数据清洗的目标清洗规则。After the electronic device obtains the cleaning rule feature output by the cleaning rule classification model, it further determines a cleaning rule matching the cleaning rule feature, and uses the cleaning rule as a target cleaning rule for data cleaning of the data to be cleaned.

比如,请参照图4,电子设备获取到待清洗数据及其清洗需求的联合特征A,将该联合特征A输入到的清洗规则分类模型进行处理,得到清洗规则分类模型输出的清洗规则特征A,匹配出清洗规则A作为目标清洗规则。For example, referring to FIG. 4 , the electronic device obtains the joint feature A of the data to be cleaned and its cleaning requirements, processes the cleaning rule classification model inputted by the joint feature A, and obtains the cleaning rule feature A output by the cleaning rule classification model, Match out cleaning rule A as the target cleaning rule.

在206中,根据目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足清洗需求。In 206, data cleaning is performed on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement.

本申请实施例中,电子设备在确定用于对待清洗数据进行数据清洗的目标清洗规则之后,即可根据该目标清洗规则对待清洗数据进行数据清洗,得到对待清洗数据的清洗效果满足其对应的清洗需求。In the embodiment of the present application, after determining the target cleaning rule for cleaning the data to be cleaned, the electronic device can perform data cleaning on the data to be cleaned according to the target cleaning rule, and obtain the cleaning effect of the data to be cleaned that satisfies the corresponding cleaning effect. need.

在一实施方式中,“获取待清洗数据以及清洗需求的联合特征”可以包括:In one embodiment, "obtaining the combined characteristics of data to be cleaned and cleaning requirements" may include:

根据生成对抗网络获取待清洗数据以及清洗需求的联合特征。According to the generative adversarial network, the joint features of the data to be cleaned and the cleaning requirements are obtained.

在本申请实施例中,考虑到生成对抗网络能够基于已有数据生成更多的样本数据,并且具有较强的特征学习能力,电子设备可以根据生成对抗网络获取待清洗数据以及清洗需求的联合特征。In the embodiment of the present application, considering that the generative adversarial network can generate more sample data based on the existing data, and has strong feature learning ability, the electronic device can obtain the data to be cleaned and the joint feature of the cleaning requirement according to the generative adversarial network. .

其中,电子设备在获取待清洗数据以及清洗需求的联合特征时,将待清洗数据和清洗需求组成数据对,表示为<待清洗数据,清洗需求>,然后根据生成对抗网络构建<待清洗数据,清洗需求>的联合特征。Among them, when the electronic device obtains the combined features of the data to be cleaned and the cleaning requirements, the data to be cleaned and the cleaning requirements are formed into a data pair, which is represented as <data to be cleaned, cleaning requirements>, and then constructed according to the generative adversarial network <Data to be cleaned, Cleaning Requirements > Joint Features.

应当说明的是,在其它实施方式中,可由本领域普通技术人员根据实际需求选取合适的特征构建方式来待清洗数据以及清洗需求的联合特征。It should be noted that, in other implementation manners, a person of ordinary skill in the art can select an appropriate feature construction manner according to actual requirements to obtain the combined features of the data to be cleaned and the cleaning requirements.

在一实施方式中,“确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则”包括:In one embodiment, "determining a cleaning rule that matches the cleaning rule feature output by the cleaning rule classification model" includes:

(1)获取清洗规则分类模型输出的清洗规则特征与预存的多个清洗规则的清洗规则特征之间的相似度;(1) Obtain the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of a plurality of pre-stored cleaning rules;

(2)将相似度达到预设相似度的清洗规则作为与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则。(2) Taking the cleaning rule whose similarity reaches the preset similarity as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model.

应当说明的是,在本申请实施例中,与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则,是指该清洗规则的清洗规则特征与清洗规则分类模型输出的清洗规则特征之间的相似度达到预设相似度。It should be noted that, in the embodiment of the present application, the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model refers to the difference between the cleaning rule feature of the cleaning rule and the cleaning rule feature output by the cleaning rule classification model. The similarity reaches the preset similarity.

因此,电子设备在确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则时,可以首先获取清洗规则分类模型输出的清洗规则特征与预存的多个清洗规则的清洗规则特征之间的相似度,然后将相似度达到预设相似度的清洗规则作为与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则,也即是后续用于对待清洗数据进行数据清洗的目标清洗规则。Therefore, when the electronic device determines the cleaning rule that matches the cleaning rule feature output by the cleaning rule classification model, it can first obtain the similarity between the cleaning rule feature output by the cleaning rule classification model and the cleaning rule features of multiple pre-stored cleaning rules and then take the cleaning rule whose similarity reaches the preset similarity as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model, that is, the target cleaning rule used for subsequent data cleaning of the data to be cleaned.

比如,假设电子设备预存有清洗规则A的清洗规则特征A、清洗规则B的清洗规则特征B、清洗规则C的清洗规则特征C,且预设相似度被配置为85%。若电子设备获取到清洗规则A的清洗规则特征A与清洗规则分类模型输出的清洗规则特征的相似度为40%、清洗规则B的清洗规则特征B与清洗规则分类模型输出的清洗规则特征的相似度为45%、清洗规则C的清洗规则特征C与清洗规则分类模型输出的清洗规则特征的相似度为86%,可以看出,清洗规则C的清洗规则特征C与清洗规则分类模型输出的清洗规则特征的相似度达到预设相似度(85%),此时,电子设备将清洗规则C确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则。For example, suppose that the electronic device has pre-stored cleaning rule feature A of cleaning rule A, cleaning rule feature B of cleaning rule B, and cleaning rule feature C of cleaning rule C, and the preset similarity is configured as 85%. If the similarity between the cleaning rule feature A of the cleaning rule A and the cleaning rule feature output by the cleaning rule classification model is 40%, the cleaning rule feature B of the cleaning rule B is similar to the cleaning rule feature output by the cleaning rule classification model. The degree of similarity is 45%, and the similarity between the cleaning rule feature C of the cleaning rule C and the cleaning rule feature output by the cleaning rule classification model is 86%. It can be seen that the cleaning rule feature C of the cleaning rule C and the cleaning rule classification model output cleaning The similarity of the rule features reaches a preset similarity (85%). At this time, the electronic device determines the cleaning rule C to match the cleaning rule features output by the cleaning rule classification model.

其中,在计算两个清洗规则特征之间的的相似度时,电子设备可以使用两个清洗规则特征之间的特征距离来衡量两个清洗规则特征之间的相似度,也即是计算两个清洗规则特征之间的特征距离(可由本领域普通技术人员根据实际需要选取任意一种特征距离,比如欧氏距离、曼哈顿距离、切比雪夫距离以及余弦距离等),作为两个清洗规则特征之间的相似度。可由本领域普通技术人员根据实际需要选取任意一种特征距离。Wherein, when calculating the similarity between the two cleaning rule features, the electronic device can use the feature distance between the two cleaning rule features to measure the similarity between the two cleaning rule features, that is, to calculate the two The feature distance between cleaning rule features (any one of the feature distances can be selected by those of ordinary skill in the art according to actual needs, such as Euclidean distance, Manhattan distance, Chebyshev distance, and cosine distance, etc.), as one of the two cleaning rule features. similarity between. Any one of the characteristic distances can be selected by those of ordinary skill in the art according to actual needs.

在一实施方式中,“根据目标清洗规则对待清洗数据进行数据清洗”包括:In one embodiment, "performing data cleaning on the data to be cleaned according to the target cleaning rule" includes:

调用目标清洗规则对应的一个或多个清洗函数,对待清洗数据进行数据清洗。One or more cleaning functions corresponding to the target cleaning rule are called to clean the data to be cleaned.

应当说明的是,在本申请实施例中,每一清洗规则均由一个或多个清洗函数构成,清洗函数用于实际实现清洗操作,包括但不限于缺失值处理、标准化处理、噪声消除处理等等。其中,清洗函数本身可由相关技术人员采用计算机程序语言(比如C语言、Java语言以及Python语言等)编写得到,比如正则表达式、过滤函数、SQL表达式等。It should be noted that, in the embodiments of the present application, each cleaning rule is composed of one or more cleaning functions, and the cleaning functions are used to actually implement cleaning operations, including but not limited to missing value processing, normalization processing, noise elimination processing, etc. Wait. Wherein, the cleaning function itself can be written by the relevant technical personnel using computer programming languages (such as C language, Java language, Python language, etc.), such as regular expressions, filter functions, SQL expressions, and the like.

因此,电子设备在根据目标清洗规则对待清洗数据进行数据清洗时,可以调用目标清洗规则对应的一个或多个清洗函数,对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求,最终得到所需的数据。Therefore, when the electronic device cleans the data to be cleaned according to the target cleaning rule, it can call one or more cleaning functions corresponding to the target cleaning rule to clean the data to be cleaned, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements. Finally get the required data.

请参照图5,图5为本申请实施例提供的模型训练方法,该模型训练方法用于训练出本申请实施例提供的数据清洗方法中所需的清洗规则分类模型,该模型训练方法的执行主体可以是本申请实施例提供的模型训练装置,或者集成了该模型训练装置的电子设备,其中该模型训练装置可以采用硬件或者软件的方式实现。如图5所示,本申请实施例提供的模型训练方法的流程可以如下:Please refer to FIG. 5. FIG. 5 is a model training method provided by an embodiment of the present application. The model training method is used to train a cleaning rule classification model required in the data cleaning method provided by the embodiment of the present application. The execution of the model training method The main body may be the model training apparatus provided in the embodiments of the present application, or an electronic device integrating the model training apparatus, where the model training apparatus may be implemented in hardware or software. As shown in FIG. 5 , the process of the model training method provided by the embodiment of the present application may be as follows:

在301中,获取多个清洗规则,以及获取对应各清洗规则的待清洗样本数据。In 301, a plurality of cleaning rules are obtained, and sample data to be cleaned corresponding to each cleaning rule is obtained.

本申请实施例中,可以预先在电子设备创建面向清洗规则的数据库,其中,该面向清洗规则的数据库包括清洗规则子数据库、待清洗样本数据子数据库以及清洗效果子数据库。In this embodiment of the present application, a cleaning rule-oriented database may be created on the electronic device in advance, where the cleaning rule-oriented database includes a cleaning rule sub-database, a sample data sub-database to be cleaned, and a cleaning effect sub-database.

在进行模型训练时,电子设备可以整合所有可能的清洗规则,并将这些清洗规则存储在清洗规则子数据库中。比如,电子设备将获取到的多个清洗规则以字符串的形式储存在清洗规则子数据库中。During model training, the electronic device can integrate all possible cleaning rules and store these cleaning rules in the cleaning rules sub-database. For example, the electronic device stores the acquired cleaning rules in the cleaning rules sub-database in the form of character strings.

此外,对于获取到的、并存储在清洗规则子数据库中的这些清洗规则,电子设备还进一步获取各清洗规则所对应的待清洗样本数据,并将这些待清洗样本数据存储到待清洗样本数据子数据库中,比如,将待清洗样本数据本身存储到待清洗样本子数据库中,如数字类型的待清洗样本数据仍然以数字类型存储到待清洗样本子数据库中。In addition, for the acquired cleaning rules and stored in the cleaning rule sub-database, the electronic device further acquires the sample data to be cleaned corresponding to each cleaning rule, and stores the sample data to be cleaned in the sample data to be cleaned sub-database. In the database, for example, the sample data to be cleaned is stored in the sample sub-database to be cleaned, and the sample data to be cleaned of the digital type is still stored in the sample sub-database to be cleaned in the digital type.

应当说明的是,电子设备可以从本地获取待清洗样本数据,也可以从其它电子设备处获取待清洗样本数据,还可以从互联网获取待清洗样本数据。It should be noted that the electronic device can obtain the sample data to be cleaned locally, also can obtain the sample data to be cleaned from other electronic devices, and can also obtain the sample data to be cleaned from the Internet.

在302中,获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果。In 302, a cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each cleaning rule is obtained.

本申请实施例中,电子设备在获取到多个清洗规则及各清洗规则对应的待清洗样本数据之后,进一步获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果,并将这些清洗效果存储到清洗效果子数据库中。比如,可以将清洗效果以表格的形式存储到的清洗效果子数据库中。In the embodiment of the present application, after acquiring a plurality of cleaning rules and the sample data to be cleaned corresponding to each cleaning rule, the electronic device further acquires the cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each cleaning rule, and uses these cleaning rules. Cleaning effects are stored in the cleaning effects sub-database. For example, the cleaning effect can be stored in the cleaning effect sub-database in the form of a table.

在303中,获取各待清洗样本数据及其对应的清洗效果的联合特征,以及获取各清洗规则的清洗规则特征。In 303, a joint feature of each sample data to be cleaned and its corresponding cleaning effect is acquired, and a cleaning rule feature of each cleaning rule is acquired.

本申请实施例中,对于获取到的各待清洗样本数据及其对应的清洗效果,电子设备还获取各待清洗样本数据及其对应的清洗效果的联合特征,使用联合特征来对待清洗样本数据及其对应的清洗效果进行联合深度表征。In the embodiment of the present application, for the obtained sample data to be cleaned and the corresponding cleaning effect, the electronic device also obtains the joint feature of each sample data to be cleaned and the corresponding cleaning effect, and uses the joint feature to analyze the sample data to be cleaned and the corresponding cleaning effect. The corresponding cleaning effect is characterized by joint depth.

此外,电子设备还获取各清洗规则的清洗规则特征,使用清洗规则特征来对清洗规则进行表征。比如,电子设备可以获取各清洗规则对应的一个或多个清洗函数的词汇特征,作为各清洗规则的清洗规则特征。In addition, the electronic device also acquires cleaning rule features of each cleaning rule, and uses the cleaning rule features to characterize the cleaning rules. For example, the electronic device may acquire lexical features of one or more cleaning functions corresponding to each cleaning rule, as cleaning rule features of each cleaning rule.

在304中,将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。In 304, model training is performed using each joint feature as a training input and the cleaning rule feature corresponding to each joint feature as a target output to obtain a cleaning rule classification model.

本申请实施例中,在获取到的各待清洗样本数据及其对应的清洗效果的联合特征,以及获取到各清洗规则的清洗规则特征之后,电子设备即可将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,按照预先设定的训练算法来进行模型训练,以训练得到用于进行自动选取清洗规则的清洗规则分类模型。In the embodiment of the present application, after obtaining the joint features of each sample data to be cleaned and their corresponding cleaning effects, and obtaining the cleaning rule features of each cleaning rule, the electronic device can use each joint feature as a training input, The cleaning rule feature corresponding to each joint feature is used as the target output, and the model is trained according to the preset training algorithm to obtain a cleaning rule classification model for automatically selecting cleaning rules.

其中,训练算法为机器学习算法,机器学习算法可以通过不断的特征学习来实现各种功能,比如,可以给定一待清洗数据及其对应的清洗需求,自动选取能够将对该待清洗数据进行数据清洗且清洗效果达到清洗需求的清洗规则。机器学习算法可以包括:决策树模型、逻辑回归模型、贝叶斯模型、神经网络模型、聚类模型等等。Among them, the training algorithm is a machine learning algorithm, and the machine learning algorithm can realize various functions through continuous feature learning. For example, a data to be cleaned and its corresponding cleaning requirements can be given, and automatic selection can be performed on the data to be cleaned. Cleaning rules for data cleaning and the cleaning effect meets the cleaning requirements. Machine learning algorithms can include: decision tree models, logistic regression models, Bayesian models, neural network models, clustering models, and the like.

此外,机器学习算法的算法类型可以根据各种情况划分,比如,可以基于学习方式可以将机器学习算法划分成:监督式学习算法、非监控式学习算法、半监督式学习算法、强化学习算法等等。In addition, the algorithm types of machine learning algorithms can be divided according to various situations. For example, machine learning algorithms can be divided into: supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms, reinforcement learning algorithms, etc. Wait.

在监督式学习下,输入数据被称为“训练数据”,每组训练数据有一个明确的标识或结果,如对防垃圾邮件系统中“垃圾邮件”“非垃圾邮件”,对手写数字识别中的“1“,”2“,”3“,”4“等。在建立识别模型的时候,监督式学习建立一个学习过程,将场景类型信息与“训练数据”的实际结果进行比较,不断的调整识别模型,直到模型的场景类型信息达到一个预期的准确率。监督式学习的常见应用场景如分类问题和回归问题。常见算法有逻辑回归(Logistic Regression)和反向传递神经网络(Back Propagation Neural Network)。Under supervised learning, the input data is called "training data", and each set of training data has a clear identification or result, such as "spam" and "non-spam" in the anti-spam system, and for handwritten digit recognition. "1", "2", "3", "4" etc. When establishing the recognition model, supervised learning establishes a learning process, compares the scene type information with the actual results of the "training data", and continuously adjusts the recognition model until the scene type information of the model reaches an expected accuracy rate. Common application scenarios of supervised learning are classification problems and regression problems. Common algorithms are Logistic Regression and Back Propagation Neural Network.

在非监督式学习中,数据并不被特别标识,识别模型是为了推断出数据的一些内在结构。常见的应用场景包括关联规则的学习以及聚类等。常见算法包括Apriori算法以及k-Means算法等。In unsupervised learning, the data is not specifically identified and the model is identified to infer some intrinsic structure of the data. Common application scenarios include the learning of association rules and clustering. Common algorithms include Apriori algorithm and k-Means algorithm.

半监督式学习算法,在此学习方式下,输入数据被部分标识,这种学习模型可以用来进行类型识别,但是模型首先需要学习数据的内在结构以便合理的组织数据来进行预测。应用场景包括分类和回归,算法包括一些对常用监督式学习算法的延伸,这些算法首先试图对未标识数据进行建模,在此基础上再对标识的数据进行预测。如图论推理算法(Graph Inference)或者拉普拉斯支持向量机(Laplacian SVM)等。Semi-supervised learning algorithm, in which the input data is partially identified, this learning model can be used for type recognition, but the model first needs to learn the internal structure of the data in order to organize the data reasonably to make predictions. Application scenarios include classification and regression, and algorithms include some extensions to commonly used supervised learning algorithms that first attempt to model unlabeled data and then make predictions on labeled data. Graph Inference or Laplacian SVM, etc.

强化学习算法,在这种学习模式下,输入数据作为对模型的反馈,不像监督模型那样,输入数据仅仅是作为一个检查模型对错的方式,在强化学习下,输入数据直接反馈到模型,模型必须对此立刻作出调整。常见的应用场景包括动态系统以及机器人控制等。常见算法包括Q-Learning以及时间差学习(Temporal difference learning)。Reinforcement learning algorithm. In this learning mode, the input data is used as feedback to the model. Unlike the supervised model, the input data is only used as a way to check whether the model is right or wrong. In reinforcement learning, the input data is directly fed back to the model. The model must adjust to this immediately. Common application scenarios include dynamic systems and robot control. Common algorithms include Q-Learning and Temporal difference learning.

此外,还可以基于根据算法的功能和形式的类似性将机器学习算法划分成:In addition, machine learning algorithms can also be divided into:

回归算法,常见的回归算法包括:最小二乘法(Ordinary Least Square),逻辑回归(Logistic Regression),逐步式回归(Stepwise Regression),多元自适应回归样条(Multivariate Adaptive Regression Splines)以及本地散点平滑估计(LocallyEstimated Scatterplot Smoothing)。Regression algorithms, common regression algorithms include: Ordinary Least Square, Logistic Regression, Stepwise Regression, Multivariate Adaptive Regression Splines and Local Scatter Smoothing Estimated (LocallyEstimated Scatterplot Smoothing).

基于实例的算法,包括k-Nearest Neighbor(KNN),学习矢量量化(LearningVector Quantization,LVQ),以及自组织映射算法(Self-Organizing Map,SOM)。Instance-based algorithms include k-Nearest Neighbor (KNN), Learning Vector Quantization (LVQ), and Self-Organizing Map (SOM).

正则化方法,常见的算法包括:Ridge Regression,Least Absolute Shrinkageand Selection Operator(LASSO),以及弹性网络(Elastic Net)。Regularization methods, common algorithms include: Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic Net.

决策树算法,常见的算法包括:分类及回归树(Classification And RegressionTree,CART),ID3(Iterative Dichotomiser 3),C4.5,Chi-squared AutomaticInteraction Detection(CHAID),Decision Stump,随机森林(Random Forest),多元自适应回归样条(MARS)以及梯度推进机(Gradient Boosting Machine,GBM)。Decision tree algorithm, common algorithms include: Classification and Regression Tree (CART), ID3 (Iterative Dichotomiser 3), C4.5, Chi-squared Automatic Interaction Detection (CHAID), Decision Stump, Random Forest (Random Forest) , Multivariate Adaptive Regression Splines (MARS) and Gradient Boosting Machine (GBM).

贝叶斯方法算法,包括:朴素贝叶斯算法,平均单依赖估计(Averaged One-Dependence Estimators,AODE),以及Bayesian Belief Network(BBN)。Bayesian method algorithms, including: Naive Bayesian algorithm, Averaged One-Dependence Estimators (AODE), and Bayesian Belief Network (BBN).

请参照图6,图6为本申请实施例提供的模型训练方法的另一种流程示意图。该模型训练方法可以应用于电子设备,该模型训练方法的流程可以包括:Please refer to FIG. 6 , which is another schematic flowchart of a model training method provided by an embodiment of the present application. The model training method can be applied to electronic equipment, and the process of the model training method can include:

在401中,获取多个清洗规则,以及对应各清洗规则的待清洗样本数据。In 401, a plurality of cleaning rules and sample data to be cleaned corresponding to each cleaning rule are acquired.

本申请实施例中,可以预先在电子设备创建面向清洗规则的数据库,其中,该面向清洗规则的数据库包括清洗规则子数据库、待清洗样本数据子数据库以及清洗效果子数据库。In this embodiment of the present application, a cleaning rule-oriented database may be created on the electronic device in advance, where the cleaning rule-oriented database includes a cleaning rule sub-database, a sample data sub-database to be cleaned, and a cleaning effect sub-database.

在进行模型训练时,电子设备可以整合所有可能的清洗规则,并将这些清洗规则存储在清洗规则子数据库中。比如,电子设备将获取到的多个清洗规则以字符串的形式储存在清洗规则子数据库中。During model training, the electronic device can integrate all possible cleaning rules and store these cleaning rules in the cleaning rules sub-database. For example, the electronic device stores the acquired cleaning rules in the cleaning rules sub-database in the form of character strings.

此外,对于获取到的、并存储在清洗规则子数据库中的这些清洗规则,电子设备还进一步获取各清洗规则所对应的待清洗样本数据,并将这些待清洗样本数据存储到待清洗样本数据子数据库中,比如,将待清洗样本数据本身存储到待清洗样本子数据库中,如数字类型的待清洗样本数据仍然以数字类型存储到待清洗样本子数据库中。In addition, for the acquired cleaning rules and stored in the cleaning rule sub-database, the electronic device further acquires the sample data to be cleaned corresponding to each cleaning rule, and stores the sample data to be cleaned in the sample data to be cleaned sub-database. In the database, for example, the sample data to be cleaned is stored in the sample sub-database to be cleaned, and the sample data to be cleaned of the digital type is still stored in the sample sub-database to be cleaned in the digital type.

应当说明的是,电子设备可以从本地获取待清洗样本数据,也可以从其它电子设备处获取待清洗样本数据,还可以从互联网获取待清洗样本数据。It should be noted that the electronic device can obtain the sample data to be cleaned locally, also can obtain the sample data to be cleaned from other electronic devices, and can also obtain the sample data to be cleaned from the Internet.

在402中,获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果。In 402, a cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each cleaning rule is obtained.

本申请实施例中,电子设备在获取到多个清洗规则及各清洗规则对应的待清洗样本数据之后,进一步获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果,并将这些清洗效果存储到清洗效果子数据库中。比如,可以将清洗效果以表格的形式存储到的清洗效果子数据库中。In the embodiment of the present application, after acquiring a plurality of cleaning rules and the sample data to be cleaned corresponding to each cleaning rule, the electronic device further acquires the cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each cleaning rule, and uses these cleaning rules. Cleaning effects are stored in the cleaning effects sub-database. For example, the cleaning effect can be stored in the cleaning effect sub-database in the form of a table.

在403中,根据生成对抗网络获取各待清洗样本数据及其对应的清洗效果的联合特征。In 403, the joint features of each sample data to be cleaned and their corresponding cleaning effects are obtained according to the generative adversarial network.

本申请实施例中,对于获取到的各待清洗样本数据及其对应的清洗效果,电子设备还获取各待清洗样本数据及其对应的清洗效果的联合特征,使用联合特征来对待清洗样本数据及其对应的清洗效果进行联合深度表征。In the embodiment of the present application, for the obtained sample data to be cleaned and the corresponding cleaning effect, the electronic device also obtains the joint feature of each sample data to be cleaned and the corresponding cleaning effect, and uses the joint feature to analyze the sample data to be cleaned and the corresponding cleaning effect. The corresponding cleaning effect is characterized by joint depth.

考虑到生成对抗网络能够基于已有数据生成更多的样本数据,并且具有较强的特征学习能力,电子设备可以根据生成对抗网络获取待清洗样本数据以及清洗效果的联合特征。Considering that the generative adversarial network can generate more sample data based on the existing data, and has strong feature learning ability, the electronic device can obtain the joint feature of the sample data to be cleaned and the cleaning effect according to the generative adversarial network.

其中,电子设备在获取待清洗样本数据及其对应的清洗效果的联合特征时,将待清洗样本数据及其对应的清洗效果组成数据对,表示为<待清洗样本数据,清洗效果>,然后根据生成对抗网络构建<待清洗样本数据,清洗效果>的联合特征。Wherein, when the electronic device obtains the joint features of the sample data to be cleaned and the corresponding cleaning effect, the sample data to be cleaned and the corresponding cleaning effect are formed into a data pair, which is represented as <sample data to be cleaned, cleaning effect>, and then according to Generative adversarial network constructs joint features of <sample data to be cleaned, cleaning effect>.

在404中,根据编码器神经网络获取各清洗规则对应的一个或多个清洗函数的词汇特征,作为各清洗规则的清洗规则特征。In 404, lexical features of one or more cleaning functions corresponding to each cleaning rule are acquired according to the encoder neural network, as cleaning rule features of each cleaning rule.

本申请实施例中,电子设备还获取各清洗规则的清洗规则特征,使用清洗规则特征来对清洗规则进行表征。In the embodiment of the present application, the electronic device further acquires the cleaning rule feature of each cleaning rule, and uses the cleaning rule feature to characterize the cleaning rule.

应当说明的是,在本申请实施例中,每一清洗规则均由一个或多个清洗函数构成,清洗函数用于实际实现清洗操作,包括但不限于缺失值处理、标准化处理、噪声消除处理等等。其中,清洗函数本身可由相关技术人员采用计算机程序语言(比如C语言、Java语言以及Python语言等)编写得到,比如正则表达式、过滤函数、SQL表达式等。It should be noted that, in the embodiments of the present application, each cleaning rule is composed of one or more cleaning functions, and the cleaning functions are used to actually implement cleaning operations, including but not limited to missing value processing, normalization processing, noise elimination processing, etc. Wait. Wherein, the cleaning function itself can be written by the relevant technical personnel using computer programming languages (such as C language, Java language, Python language, etc.), such as regular expressions, filter functions, SQL expressions, and the like.

电子设备在获取各清洗规则的清洗规则特征时,对于任一清洗规则,电子设备将该清洗规则对应的一个或多个清洗函数进行分词操作,得到该清洗规则的词汇序列,然后将词序列输入到编码器神经网络进行编码处理,得到具有表征能力的词汇特征向量,作为前述清洗规则的清洗规则特征。When the electronic device acquires the cleaning rule characteristics of each cleaning rule, for any cleaning rule, the electronic device performs word segmentation on one or more cleaning functions corresponding to the cleaning rule, obtains the vocabulary sequence of the cleaning rule, and then inputs the word sequence into the word sequence. Go to the encoder neural network for encoding processing, and obtain a lexical feature vector with representational ability, which is used as the cleaning rule feature of the aforementioned cleaning rule.

比如,对于一清洗规则,电子设备对其进行分词操作后得到词汇序列C=(c1,c2,……,cn),将该词汇序列C输入到的编码器神经网络进行编码后得到词汇特征向量V=(v1,v2,……vn),将词汇特征向量V作为该清洗规则的清洗规则特征。For example, for a cleaning rule, the electronic device performs word segmentation on it to obtain a word sequence C=(c 1 , c 2 , ..., c n ), and after encoding the word sequence C into the encoder neural network to obtain The lexical feature vector V=(v 1 , v 2 , ... v n ), and the lexical feature vector V is used as the cleaning rule feature of the cleaning rule.

应当说明的是,本申请实施例并不限定编码器神经网络的具体模型和拓扑结构,可以采用单层的递归神经网络进行训练得到编码器神经网络,也可以采用多层的递归神经网络进行训练得到编码器神经网络还可以采用卷积神经网络、或者其变种、或者其他网络结构的神经网络进行训练,得到编码器神经网络。比如,本申请实施例中可以采用循环神经网络来构建编码器神经网络。It should be noted that the embodiments of the present application do not limit the specific model and topology of the encoder neural network. A single-layer recurrent neural network can be used for training to obtain an encoder neural network, or a multi-layer recurrent neural network can be used for training. To obtain the encoder neural network, a convolutional neural network, a variant thereof, or a neural network with other network structures can also be used for training to obtain an encoder neural network. For example, in this embodiment of the present application, a recurrent neural network may be used to construct an encoder neural network.

在405中、将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,利用条件循环神经网络进行模型训练,得到清洗规则分类模型。In 405, each joint feature is used as a training input, and the cleaning rule feature corresponding to each joint feature is used as a target output, and a conditional recurrent neural network is used for model training to obtain a cleaning rule classification model.

本申请实施例中,在获取到的各待清洗样本数据及其对应的清洗效果的联合特征,以及获取到各清洗规则的清洗规则特征之后,电子设备即可将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,利用条件循环神经网络进行进行模型训练,以训练得到用于进行自动选取清洗规则的清洗规则分类模型。In the embodiment of the present application, after obtaining the joint features of each sample data to be cleaned and their corresponding cleaning effects, and obtaining the cleaning rule features of each cleaning rule, the electronic device can use each joint feature as a training input, The cleaning rule feature corresponding to each joint feature is used as the target output, and the conditional recurrent neural network is used for model training to obtain a cleaning rule classification model for automatically selecting cleaning rules.

为了更清楚的理解本申请实施例,请参照图7,图7为本申请实施例中进行模型训练的应用场景示意图。For a clearer understanding of the embodiment of the present application, please refer to FIG. 7 , which is a schematic diagram of an application scenario of model training in the embodiment of the present application.

首先,构建面向清洗规则的数据库,包括三个子数据库,分别为清洗规则子数据库、待清洗样本数据子数据库和清洗效果子数据库。整合所有可能清洗规则,同时收集每个清洗规则对应的待清洗样本数据及其清洗效果,将清洗规则以字符串的形式存储至清洗规则子数据库中,将待清洗样本数据本身存储至待清洗样本数据子数据库中,将清洗效果以表格的形式存储至清洗效果子数据库中。First, a database for cleaning rules is constructed, including three sub-databases, namely, a sub-database of cleaning rules, a sub-database of sample data to be cleaned, and a sub-database of cleaning effects. Integrate all possible cleaning rules, collect the sample data to be cleaned corresponding to each cleaning rule and its cleaning effect, store the cleaning rules in the form of strings in the cleaning rules sub-database, and store the sample data to be cleaned in the sample to be cleaned. In the data sub-database, the cleaning effect is stored in the cleaning effect sub-database in the form of a table.

其次,使用循环神经网络构建的编码器神经网络对所有清洗规则进行编码,得到对应的词汇特征向量,作为各清洗规则的清洗规则特征。同时,将每一清洗规则对应的待清洗样本数据及其清洗效果构建数据对,表示为<待清洗样本数据,清洗效果>,利用生成对抗网络对每一清洗规则的<待清洗样本数据,清洗效果>对进行学习获取到<待清洗样本数据,清洗效果>联合特征。Secondly, the encoder neural network constructed by the recurrent neural network is used to encode all cleaning rules, and the corresponding lexical feature vector is obtained, which is used as the cleaning rule feature of each cleaning rule. At the same time, a data pair is constructed between the sample data to be cleaned and its cleaning effect corresponding to each cleaning rule, which is expressed as <sample data to be cleaned, cleaning effect>, and the <sample data to be cleaned, cleaning effect> of each cleaning rule is generated by using the generative adversarial network. Effect> Perform learning to obtain <sample data to be cleaned, cleaning effect> joint features.

最后,将每一清洗规则对应的<待清洗样本数据,清洗效果>联合特征作为训练输入,将其词汇特征向量作为目标输出,利用条件循环神经网络进行模型训练,得到清洗规则分类模型。Finally, the joint feature of <sample data to be cleaned, cleaning effect> corresponding to each cleaning rule is used as the training input, and its lexical feature vector is used as the target output, and the conditional recurrent neural network is used for model training to obtain the cleaning rule classification model.

由此,只要将需要进行数据清洗的待清洗数据和清洗需求输入到的训练好的清洗规则分类模型,即可获得清洗规则分类模型输出的清洗规则,利用该清洗规则对待清洗数据进行数据清洗的清洗效果即可满足清洗需求。Therefore, as long as the data to be cleaned that needs to be cleaned and the trained cleaning rule classification model to which the cleaning requirements are input are input, the cleaning rules output by the cleaning rule classification model can be obtained, and the data to be cleaned can be cleaned by using the cleaning rules. The cleaning effect can meet the cleaning needs.

本申请实施例还提供一种数据清洗装置。请参照图8,图8为本申请实施例提供的数据清洗装置的结构示意图。其中该数据清洗装置应用于电子设备,该数据清洗装置包括数据获取模块501、需求获取模块502、规则确定模块503以及数据清洗模块504,如下:Embodiments of the present application further provide a data cleaning device. Please refer to FIG. 8 , which is a schematic structural diagram of a data cleaning apparatus provided by an embodiment of the present application. The data cleaning device is applied to electronic equipment, and the data cleaning device includes a data acquisition module 501, a demand acquisition module 502, a rule determination module 503 and a data cleaning module 504, as follows:

数据获取模块501,用于获取需要进行数据清洗的待清洗数据;A data acquisition module 501, configured to acquire data to be cleaned that needs to be cleaned;

需求获取模块502,用于获取待清洗数据的清洗需求;A requirement acquisition module 502, configured to acquire cleaning requirements of the data to be cleaned;

规则确定模块503,用于根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则;The rule determination module 503 is configured to determine the target cleaning rules for performing data cleaning on the data to be cleaned according to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model;

数据清洗模块504,用于根据目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求;The data cleaning module 504 is configured to perform data cleaning on the data to be cleaned according to the target cleaning rules, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements;

其中,清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。Among them, the cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output, and the joint feature representing the to-be-cleaned sample data corresponding to the cleaning rule and its cleaning effect as the training input, and is obtained by model training.

在一实施方式中,在根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则时,规则确定模块503可以用于:In one embodiment, when determining a target cleaning rule for performing data cleaning on the data to be cleaned according to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model, the rule determination module 503 can be used to:

获取待清洗数据以及清洗需求的联合特征;Obtain the combined characteristics of the data to be cleaned and the cleaning requirements;

将获取到的联合特征输入清洗规则分类模型,得到清洗规则分类模型输出的清洗规则特征;Input the obtained joint features into the cleaning rule classification model, and obtain the cleaning rule features output by the cleaning rule classification model;

确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则,作为对待清洗数据进行数据清洗的目标清洗规则。A cleaning rule matching the cleaning rule feature output by the cleaning rule classification model is determined as a target cleaning rule for performing data cleaning on the data to be cleaned.

在一实施方式中,在确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则时,规则确定模块503可以用于:In one embodiment, when determining the cleaning rule matching the cleaning rule feature output by the cleaning rule classification model, the rule determination module 503 may be used to:

获取清洗规则分类模型输出的清洗规则特征与预存的多个清洗规则的清洗规则特征之间的相似度;Obtaining the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of multiple pre-stored cleaning rules;

将相似度达到预设相似度的清洗规则作为与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则。The cleaning rule whose similarity reaches the preset similarity is used as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model.

在一实施方式中,在根据目标清洗规则对待清洗数据进行数据清洗时,数据清洗模块504可以用于:In one embodiment, when performing data cleaning on the data to be cleaned according to the target cleaning rule, the data cleaning module 504 may be used to:

调用目标清洗规则对应的一个或多个清洗函数,对待清洗数据进行数据清洗。One or more cleaning functions corresponding to the target cleaning rule are called to clean the data to be cleaned.

在一实施方式中,在获取需要进行数据清洗的待清洗数据时,数据获取模块501可以用于:In one embodiment, when acquiring data to be cleaned that needs to be cleaned, the data acquisition module 501 can be used to:

获取传感器采集的传感器数据,将获取到的传感器数据作为待清洗数据。The sensor data collected by the sensor is acquired, and the acquired sensor data is used as the data to be cleaned.

本申请实施例还提供一种模型训练装置。请参照图9,图9为本申请实施例提供的模型训练装置的结构示意图。其中该模型训练装置应用于电子设备,该数据清洗装置包括第一获取模块601、第二获取模块602、第三获取模块603和模型训练模块604,如下:The embodiment of the present application also provides a model training device. Please refer to FIG. 9 , which is a schematic structural diagram of a model training apparatus provided by an embodiment of the present application. The model training device is applied to electronic equipment, and the data cleaning device includes a first acquisition module 601, a second acquisition module 602, a third acquisition module 603 and a model training module 604, as follows:

第一获取模块601,用于获取多个清洗规则,以及获取对应各清洗规则的待清洗样本数据;The first acquisition module 601 is configured to acquire a plurality of cleaning rules, and acquire sample data to be cleaned corresponding to each cleaning rule;

第二获取模块602,用于获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果;The second acquiring module 602 is configured to acquire the cleaning effect of performing data cleaning on the corresponding sample data to be cleaned by each cleaning rule;

第三获取模块603,用于获取各待清洗样本数据及其对应的清洗效果的联合特征,以及获取各清洗规则的清洗规则特征;The third acquisition module 603 is configured to acquire the joint features of each sample data to be cleaned and its corresponding cleaning effect, and acquire the cleaning rule feature of each cleaning rule;

模型训练模块604,用于将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。The model training module 604 is configured to use each joint feature as a training input, and use the cleaning rule feature corresponding to each joint feature as a target output to perform model training to obtain a cleaning rule classification model.

在一实施方式中,在获取各清洗规则的清洗规则特征时,第三获取模块603可以用于:In one embodiment, when acquiring the cleaning rule features of each cleaning rule, the third acquiring module 603 may be used to:

获取各清洗规则对应的一个或多个清洗函数的词汇特征,作为各清洗规则的清洗规则特征。The lexical features of one or more cleaning functions corresponding to each cleaning rule are acquired as cleaning rule features of each cleaning rule.

在一实施方式中,在获取各清洗规则对应的一个或多个清洗函数的词汇特征时,第三获取模块603可以用于:In one embodiment, when acquiring the lexical features of one or more cleaning functions corresponding to each cleaning rule, the third acquiring module 603 may be used to:

根据编码器神经网络获取各清洗规则对应的一个或多个清洗函数的词汇特征。The lexical features of one or more cleaning functions corresponding to each cleaning rule are obtained according to the encoder neural network.

在一实施方式中,在获取各待清洗样本数据及其对应的清洗效果的联合特征时,第三获取模块603可以用于:In one embodiment, when acquiring the joint features of the sample data to be cleaned and their corresponding cleaning effects, the third acquisition module 603 can be used to:

根据生成对抗网络获取各待清洗样本数据及其对应的清洗效果的联合特征。The joint feature of each sample data to be cleaned and its corresponding cleaning effect is obtained according to the generative adversarial network.

在一实施方式中,在将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型时,模型训练模块604可以用于:In one embodiment, when using each joint feature as a training input and the cleaning rule feature corresponding to each joint feature as a target output for model training to obtain a cleaning rule classification model, the model training module 604 can be used for:

将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,利用条件循环神经网络进行模型训练,得到清洗规则分类模型。Taking each joint feature as the training input and the cleaning rule feature corresponding to each joint feature as the target output, the conditional recurrent neural network is used for model training, and the cleaning rule classification model is obtained.

本申请实施例提供一种计算机可读的存储介质,其上存储有计算机程序,当其存储的计算机程序在计算机上执行时,使得计算机执行如本实施例提供的数据清洗方法中的步骤,或者使得计算机执行如本实施例提供的模型训练方法中的步骤。其中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)或者随机存取器(Random AccessMemory,RAM)等。An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the stored computer program is executed on a computer, the computer is made to execute the steps in the data cleaning method provided by this embodiment, or The computer is caused to execute the steps in the model training method provided by this embodiment. The storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM,) or a random access device (Random Access Memory, RAM), and the like.

本申请实施例还提供一种电子设备,包括存储器,处理器,处理器通过调用存储器中存储的计算机程序,执行本实施例提供的数据清洗方法中的步骤,或者执行如本实施例提供的模型训练方法中的步骤。An embodiment of the present application further provides an electronic device, including a memory and a processor. The processor executes the steps in the data cleaning method provided by this embodiment by calling a computer program stored in the memory, or executes the model provided by this embodiment. Steps in the training method.

在一实施例中,还提供一种电子设备。请参照图10,电子设备包括处理器701以及存储器702。其中,处理器701与存储器702电性连接。In one embodiment, an electronic device is also provided. Referring to FIG. 10 , the electronic device includes a processor 701 and a memory 702 . The processor 701 is electrically connected to the memory 702 .

处理器701是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器702内的计算机程序,以及调用存储在存储器702内的数据,执行电子设备的各种功能并处理数据。The processor 701 is the control center of the electronic device, uses various interfaces and lines to connect various parts of the entire electronic device, executes the electronic device by running or loading the computer program stored in the memory 702, and calling the data stored in the memory 702. various functions and process data.

存储器702可用于存储软件程序以及模块,处理器701通过运行存储在存储器702的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器702还可以包括存储器控制器,以提供处理器701对存储器702的访问。The memory 702 can be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by running the computer programs and modules stored in the memory 702 . The memory 702 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, a computer program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; Data created by the use of electronic equipment, etc. Additionally, memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 702 may also include a memory controller to provide processor 701 access to memory 702 .

在本申请实施例中,电子设备中的处理器701会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器702中,并由处理器701运行存储在存储器702中的计算机程序,从而实现各种功能,如下:In this embodiment of the present application, the processor 701 in the electronic device loads the instructions corresponding to the processes of one or more computer programs into the memory 702 according to the following steps, and is executed by the processor 701 and stored in the memory 702 A computer program that implements various functions, as follows:

获取需要进行数据清洗的待清洗数据;Obtain the data to be cleaned that needs to be cleaned;

获取待清洗数据的清洗需求;Obtain the cleaning requirements of the data to be cleaned;

根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则;According to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model, determine the target cleaning rules for data cleaning of the data to be cleaned;

根据目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求;Data cleaning is performed on the data to be cleaned according to the target cleaning rules, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements;

其中,清洗规则分类模型利用表征清洗规则的清洗规则特征作为目标输出、表征清洗规则对应的待清洗样本数据及其清洗效果的联合特征作为训练输入,进行模型训练得到。Among them, the cleaning rule classification model uses the cleaning rule feature representing the cleaning rule as the target output, and the joint feature representing the to-be-cleaned sample data corresponding to the cleaning rule and its cleaning effect as the training input, and is obtained by model training.

或者,电子设备中的处理器701会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器702中,并由处理器701运行存储在存储器702中的计算机程序,从而实现各种功能,如下:Alternatively, the processor 701 in the electronic device loads the instructions corresponding to the processes of one or more computer programs into the memory 702 according to the following steps, and the processor 701 runs the computer program stored in the memory 702, thereby Implement various functions, as follows:

获取多个清洗规则,以及获取对应各清洗规则的待清洗样本数据;Obtain multiple cleaning rules, and obtain sample data to be cleaned corresponding to each cleaning rule;

获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果;Obtain the cleaning effect of data cleaning performed on the corresponding sample data to be cleaned by each cleaning rule;

获取各待清洗样本数据及其对应的清洗效果的联合特征,以及获取各清洗规则的清洗规则特征;Obtain the joint features of each sample data to be cleaned and its corresponding cleaning effect, and obtain the cleaning rule features of each cleaning rule;

将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。Taking each joint feature as the training input, and taking the cleaning rule feature corresponding to each joint feature as the target output, model training is performed to obtain a cleaning rule classification model.

请参照图11,图11为本申请实施例提供的电子设备的另一结构示意图,与图10所示电子设备的区别在于,电子设备还包括输入单元703和输出单元704等组件。Please refer to FIG. 11 , which is another schematic structural diagram of the electronic device provided by the embodiment of the present application. The difference from the electronic device shown in FIG. 10 is that the electronic device further includes components such as an input unit 703 and an output unit 704 .

其中,输入单元703可用于接收输入的数字、字符信息或用户特征信息(比如指纹),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入等。The input unit 703 can be used to receive input numbers, character information or user feature information (such as fingerprints), and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

输出单元704可用于显示由用户输入的信息或提供给用户的信息,如屏幕。The output unit 704 may be used to display information input by the user or information provided to the user, such as a screen.

在本申请实施例中,电子设备中的处理器701会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器702中,并由处理器701运行存储在存储器702中的计算机程序,从而实现各种功能,如下:In this embodiment of the present application, the processor 701 in the electronic device loads the instructions corresponding to the processes of one or more computer programs into the memory 702 according to the following steps, and is executed by the processor 701 and stored in the memory 702 A computer program that implements various functions, as follows:

获取需要进行数据清洗的待清洗数据;Obtain the data to be cleaned that needs to be cleaned;

获取待清洗数据的清洗需求;Obtain the cleaning requirements of the data to be cleaned;

根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则;According to the data to be cleaned, the cleaning requirements and the pre-trained cleaning rule classification model, determine the target cleaning rules for data cleaning of the data to be cleaned;

根据目标清洗规则对待清洗数据进行数据清洗,使得对待清洗数据的清洗效果满足前述清洗需求。Data cleaning is performed on the data to be cleaned according to the target cleaning rules, so that the cleaning effect of the data to be cleaned meets the aforementioned cleaning requirements.

在一实施方式中,在根据待清洗数据、清洗需求以及预训练的清洗规则分类模型,确定用于对待清洗数据进行数据清洗的目标清洗规则时,处理器701可以执行:In one embodiment, when determining a target cleaning rule for performing data cleaning on the data to be cleaned according to the data to be cleaned, cleaning requirements, and a pre-trained cleaning rule classification model, the processor 701 may execute:

获取待清洗数据以及清洗需求的联合特征;Obtain the combined characteristics of the data to be cleaned and the cleaning requirements;

将获取到的联合特征输入清洗规则分类模型,得到清洗规则分类模型输出的清洗规则特征;Input the obtained joint features into the cleaning rule classification model, and obtain the cleaning rule features output by the cleaning rule classification model;

确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则,作为对待清洗数据进行数据清洗的目标清洗规则。A cleaning rule matching the cleaning rule feature output by the cleaning rule classification model is determined as a target cleaning rule for performing data cleaning on the data to be cleaned.

在一实施方式中,在确定与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则时,处理器701可以执行:In one embodiment, when determining the cleaning rule matching the cleaning rule feature output by the cleaning rule classification model, the processor 701 may execute:

获取清洗规则分类模型输出的清洗规则特征与预存的多个清洗规则的清洗规则特征之间的相似度;Obtaining the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of multiple pre-stored cleaning rules;

将相似度达到预设相似度的清洗规则作为与清洗规则分类模型输出的清洗规则特征所匹配的清洗规则。The cleaning rule whose similarity reaches the preset similarity is used as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model.

在一实施方式中,在根据目标清洗规则对待清洗数据进行数据清洗时,处理器701可以执行:In one embodiment, when performing data cleaning on the data to be cleaned according to the target cleaning rule, the processor 701 may execute:

调用目标清洗规则对应的一个或多个清洗函数,对待清洗数据进行数据清洗。One or more cleaning functions corresponding to the target cleaning rule are called to clean the data to be cleaned.

在一实施方式中,在获取需要进行数据清洗的待清洗数据时,处理器701可以执行:In one embodiment, when acquiring data to be cleaned that needs to be cleaned, the processor 701 may execute:

获取传感器采集的传感器数据,将获取到的传感器数据作为待清洗数据。The sensor data collected by the sensor is acquired, and the acquired sensor data is used as the data to be cleaned.

或者,电子设备中的处理器701会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器702中,并由处理器701运行存储在存储器702中的计算机程序,从而实现各种功能,如下:Alternatively, the processor 701 in the electronic device loads the instructions corresponding to the processes of one or more computer programs into the memory 702 according to the following steps, and the processor 701 runs the computer program stored in the memory 702, thereby Implement various functions, as follows:

获取多个清洗规则,以及获取对应各清洗规则的待清洗样本数据;Obtain multiple cleaning rules, and obtain sample data to be cleaned corresponding to each cleaning rule;

获取各清洗规则对其对应的待清洗样本数据进行数据清洗的清洗效果;Obtain the cleaning effect of data cleaning performed on the corresponding sample data to be cleaned by each cleaning rule;

获取各待清洗样本数据及其对应的清洗效果的联合特征,以及获取各清洗规则的清洗规则特征;Obtain the joint features of each sample data to be cleaned and its corresponding cleaning effect, and obtain the cleaning rule features of each cleaning rule;

将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型。Taking each joint feature as the training input, and taking the cleaning rule feature corresponding to each joint feature as the target output, model training is performed to obtain a cleaning rule classification model.

在一实施方式中,在获取各清洗规则的清洗规则特征时,处理器701可以执行:In one embodiment, when acquiring the cleaning rule characteristics of each cleaning rule, the processor 701 may execute:

获取各清洗规则对应的一个或多个清洗函数的词汇特征,作为各清洗规则的清洗规则特征。The lexical features of one or more cleaning functions corresponding to each cleaning rule are acquired as cleaning rule features of each cleaning rule.

在一实施方式中,在获取各清洗规则对应的一个或多个清洗函数的词汇特征时,处理器701可以执行:In one embodiment, when acquiring the lexical features of one or more cleaning functions corresponding to each cleaning rule, the processor 701 may execute:

根据编码器神经网络获取各清洗规则对应的一个或多个清洗函数的词汇特征。The lexical features of one or more cleaning functions corresponding to each cleaning rule are obtained according to the encoder neural network.

在一实施方式中,在获取各待清洗样本数据及其对应的清洗效果的联合特征时,处理器701可以执行:In one embodiment, when acquiring the joint features of the sample data to be cleaned and their corresponding cleaning effects, the processor 701 may execute:

根据生成对抗网络获取各待清洗样本数据及其对应的清洗效果的联合特征。The joint feature of each sample data to be cleaned and its corresponding cleaning effect is obtained according to the generative adversarial network.

在一实施方式中,在将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出进行模型训练,得到清洗规则分类模型时,处理器701可以执行:In one embodiment, when using each joint feature as a training input and the cleaning rule feature corresponding to each joint feature as a target output to perform model training to obtain a cleaning rule classification model, the processor 701 may execute:

将各联合特征作为训练输入、将各联合特征对应的清洗规则特征作为目标输出,利用条件循环神经网络进行模型训练,得到清洗规则分类模型。Taking each joint feature as the training input and the cleaning rule feature corresponding to each joint feature as the target output, the conditional recurrent neural network is used for model training, and the cleaning rule classification model is obtained.

在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

需要说明的是,对本申请实施例的数据清洗方法/模型训练方法而言,本领域普通测试人员可以理解实现本申请实施例的数据清洗方法/模型训练方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如数据清洗方法/模型训练方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that, for the data cleaning method/model training method of the embodiment of the present application, ordinary testers in the art can understand that all or part of the process of implementing the data cleaning method/model training method of the embodiment of the present application can be implemented through a computer. The computer program can be stored in a computer-readable storage medium, such as a memory of an electronic device, and executed by at least one processor in the electronic device. During the execution process The process of the embodiment such as the data cleaning method/model training method may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.

对本申请实施例的数据清洗装置/模型训练装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。For the data cleaning device/model training device of the embodiment of the present application, each functional module thereof may be integrated in a processing chip, or each module may exist physically alone, or two or more modules may be integrated in one module. . The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, etc. .

以上对本申请实施例所提供的一种数据清洗方法、模型训练方法、装置、存储介质及设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。A data cleaning method, model training method, device, storage medium and device provided by the embodiments of the present application have been described in detail above. The principles and implementations of the present application are described with specific examples in this article. The above embodiments The description is only used to help understand the method of the present application and its core idea; meanwhile, for those skilled in the art, according to the idea of the present application, there will be changes in the specific embodiments and application scope. In summary, the above , the contents of this specification should not be construed as limiting the application.

Claims (15)

1. A data cleaning method is applied to electronic equipment and is characterized by comprising the following steps:
acquiring data to be cleaned, which needs to be cleaned;
acquiring the cleaning requirement of the data to be cleaned;
determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect on the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
2. The data cleaning method of claim 1, wherein the determining a target cleaning rule for data cleaning of the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model comprises:
acquiring the data to be cleaned and the combined characteristics of the cleaning requirements;
inputting the combined features into the cleaning rule classification model to obtain cleaning rule features output by the cleaning rule classification model;
and determining the cleaning rule matched with the cleaning rule characteristic as the target cleaning rule.
3. The data cleansing method of claim 2, wherein the determining the cleansing rule matching the cleansing rule feature comprises:
acquiring the similarity between the cleaning rule features and the cleaning rule features of a plurality of pre-stored cleaning rules;
and determining the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule characteristic.
4. The data cleansing method of claim 1, wherein the data cleansing of the data to be cleansed according to the target cleansing rule comprises:
and calling one or more cleaning functions corresponding to the target cleaning rule to perform data cleaning on the data to be cleaned.
5. The data cleaning method of claim 1, wherein obtaining data to be cleaned for which data cleaning is required comprises:
and acquiring sensor data acquired by a sensor, and taking the sensor data as data to be cleaned.
6. A model training method is applied to electronic equipment and is characterized by comprising the following steps:
acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
acquiring a cleaning effect of each cleaning rule for cleaning data of the corresponding sample data to be cleaned;
acquiring the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect thereof, and acquiring the cleaning rule characteristics of the cleaning rules;
and performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
7. The data cleansing method of claim 6, wherein the obtaining cleansing rule features of each of the cleansing rules comprises:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule as the cleaning rule characteristics of each cleaning rule.
8. The method of claim 7, wherein the obtaining the vocabulary characteristics of the one or more cleansing functions corresponding to each of the cleansing rules comprises:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule according to the neural network of the encoder.
9. The method according to claim 8, wherein the obtaining of the joint feature of each sample data to be cleaned and the corresponding cleaning effect thereof comprises:
and acquiring the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect thereof according to the generated countermeasure network.
10. The data cleaning method of claim 6, wherein performing model training using each of the joint features as a training input and using the cleaning rule feature corresponding to each of the joint features as a target output to obtain a cleaning rule classification model, comprises:
and taking each joint feature as training input, taking the cleaning rule feature corresponding to each joint feature as target output, and performing model training by using a conditional cycle neural network to obtain the cleaning rule classification model.
11. A data cleaning device is applied to electronic equipment and is characterized by comprising:
the data acquisition module is used for acquiring data to be cleaned, which needs to be cleaned;
the requirement acquisition module is used for acquiring the cleaning requirement of the data to be cleaned;
the rule determining module is used for determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
the data cleaning module is used for cleaning the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
12. A model training device applied to electronic equipment is characterized by comprising:
the first acquisition module is used for acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
the second acquisition module is used for acquiring the cleaning effect of each cleaning rule for cleaning the data of the corresponding sample data to be cleaned;
the third acquisition module is used for acquiring the combined characteristics of the sample data to be cleaned and the cleaning effect corresponding to the sample data to be cleaned and acquiring the cleaning rule characteristics of the cleaning rules;
and the model training module is used for performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
13. A storage medium having stored thereon a computer program for causing a computer to perform a data cleansing method according to any one of claims 1 to 5 or a model training method according to any one of claims 6 to 10 when the computer program is run on the computer.
14. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor is configured to perform the data cleansing method according to any one of claims 1 to 5 by invoking the computer program.
15. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor is configured to perform the model training method of any one of claims 6 to 10 by invoking the computer program.
CN201910282171.0A 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment Pending CN111797078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910282171.0A CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910282171.0A CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN111797078A true CN111797078A (en) 2020-10-20

Family

ID=72805340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910282171.0A Pending CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN111797078A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113190542A (en) * 2021-05-19 2021-07-30 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113420623A (en) * 2021-06-09 2021-09-21 山东师范大学 5G base station detection method and system based on self-organizing mapping neural network
WO2021189960A1 (en) * 2020-10-22 2021-09-30 平安科技(深圳)有限公司 Method and apparatus for training adversarial network, method and apparatus for supplementing medical data, and device and medium
CN115144025A (en) * 2022-06-22 2022-10-04 大庆恒驰电气有限公司 A sand condition detection system
CN115423115A (en) * 2022-07-28 2022-12-02 名日之梦(北京)科技有限公司 Data processing method, computer readable storage medium and electronic device
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN116061189A (en) * 2023-03-08 2023-05-05 国网瑞嘉(天津)智能机器人有限公司 A robot operation data processing system, method, device, equipment and medium
CN116775639A (en) * 2023-08-08 2023-09-19 阿里巴巴(中国)有限公司 Data processing method, storage medium and electronic device
CN116842317A (en) * 2023-06-28 2023-10-03 中国平安财产保险股份有限公司 Data cleaning methods, devices, equipment and computer-readable storage media
CN118520229A (en) * 2024-07-23 2024-08-20 北京海天瑞声科技股份有限公司 Data cleaning method, device, product and medium based on large language model
CN120994653A (en) * 2025-10-15 2025-11-21 杭州微风企科技有限公司 Data intelligent cleaning methods, devices, computer equipment and storage media

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016165378A1 (en) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Energy storage power station mass data cleaning method and system
CN108734330A (en) * 2017-04-24 2018-11-02 北京京东尚科信息技术有限公司 Data processing method and device
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Data set construction method and device, mobile terminal, readable storage medium
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 Training method and device of classification model, mobile terminal and readable storage medium
CN109299233A (en) * 2018-09-19 2019-02-01 平安科技(深圳)有限公司 Text data processing method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016165378A1 (en) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Energy storage power station mass data cleaning method and system
CN108734330A (en) * 2017-04-24 2018-11-02 北京京东尚科信息技术有限公司 Data processing method and device
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Data set construction method and device, mobile terminal, readable storage medium
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 Training method and device of classification model, mobile terminal and readable storage medium
CN109299233A (en) * 2018-09-19 2019-02-01 平安科技(深圳)有限公司 Text data processing method, device, computer equipment and storage medium

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021189960A1 (en) * 2020-10-22 2021-09-30 平安科技(深圳)有限公司 Method and apparatus for training adversarial network, method and apparatus for supplementing medical data, and device and medium
CN112632051B (en) * 2020-12-25 2024-06-14 中国工商银行股份有限公司 Database cleaning method and system based on neural network
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113190542B (en) * 2021-05-19 2023-02-24 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113190542A (en) * 2021-05-19 2021-07-30 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113420623A (en) * 2021-06-09 2021-09-21 山东师范大学 5G base station detection method and system based on self-organizing mapping neural network
CN115144025A (en) * 2022-06-22 2022-10-04 大庆恒驰电气有限公司 A sand condition detection system
CN115423115A (en) * 2022-07-28 2022-12-02 名日之梦(北京)科技有限公司 Data processing method, computer readable storage medium and electronic device
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN115438183B (en) * 2022-08-31 2023-07-04 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN116061189A (en) * 2023-03-08 2023-05-05 国网瑞嘉(天津)智能机器人有限公司 A robot operation data processing system, method, device, equipment and medium
CN116842317A (en) * 2023-06-28 2023-10-03 中国平安财产保险股份有限公司 Data cleaning methods, devices, equipment and computer-readable storage media
CN116775639A (en) * 2023-08-08 2023-09-19 阿里巴巴(中国)有限公司 Data processing method, storage medium and electronic device
CN118520229A (en) * 2024-07-23 2024-08-20 北京海天瑞声科技股份有限公司 Data cleaning method, device, product and medium based on large language model
CN120994653A (en) * 2025-10-15 2025-11-21 杭州微风企科技有限公司 Data intelligent cleaning methods, devices, computer equipment and storage media

Similar Documents

Publication Publication Date Title
CN111797078A (en) Data cleaning method, model training method, device, storage medium and equipment
CN110827129B (en) Commodity recommendation method and device
Nandedkar et al. A fuzzy min-max neural network classifier with compensatory neuron architecture
US20170344884A1 (en) Semantic class localization in images
CN110414550B (en) Training method, device and system of face recognition model and computer readable medium
CN111191136A (en) An information recommendation method and related equipment
CN111798018A (en) Behavior prediction method, behavior prediction device, storage medium and electronic equipment
CN105549885A (en) Method and device for recognizing user emotion during screen sliding operation
WO2020168451A1 (en) Sleep prediction method and apparatus, storage medium, and electronic device
Steyer et al. Elastic analysis of irregularly or sparsely sampled curves
CN114139630A (en) Gesture recognition method and device, storage medium and electronic equipment
WO2025039385A1 (en) Prediction model training method and apparatus, and storage medium and electronic device
CN112418256A (en) Classification, model training and information searching method, system and equipment
CN111797849B (en) User activity identification method, device, storage medium and electronic device
CN111816211B (en) Emotion recognition method and device, storage medium and electronic equipment
CN113164056A (en) Sleep prediction method, device, storage medium and electronic equipment
CN111797862A (en) Task processing method, device, storage medium and electronic device
CN113901880A (en) A real-time event stream identification method and system
US12314305B1 (en) System and method for generating an updated terminal node projection
Kasaei et al. An adaptive object perception system based on environment exploration and Bayesian learning
CN111797080A (en) Model training method, data recovery method, device, storage medium and equipment
CN111798000A (en) Data optimization method, device, storage medium and electronic device
CN111814812A (en) Modeling method, device, storage medium, electronic device and scene recognition method
CN111797075B (en) Data recovery method and device, storage medium and electronic equipment
CN111797856A (en) Modeling method, device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20241227

AD01 Patent right deemed abandoned