CN116894073A - Sensitive data identification method, device and storage medium - Google Patents
Sensitive data identification method, device and storage medium Download PDFInfo
- Publication number
- CN116894073A CN116894073A CN202310833297.9A CN202310833297A CN116894073A CN 116894073 A CN116894073 A CN 116894073A CN 202310833297 A CN202310833297 A CN 202310833297A CN 116894073 A CN116894073 A CN 116894073A
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- fields
- sensitive
- sampled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioethics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请公开了一种敏感数据识别方法、装置及存储介质。该方法具体包括:电子设备获取第一数据集及第二数据集,第一数据集包括N个字段的敏感数据,第二数据集包括M个字段的待检测数据;其中,第一数据集为预先存储的经过标注的敏感数据集,第二数据集为采集到的未经标注的数据集。电子设备对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,获得第三数据集,第三数据集包括S个字段的数据。电子设备对第三数据集中的数据进行聚类,获得R类数据。电子设备确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异小于第一预设阈值,则确定第三数据集中的数据为敏感数据。通过该方法可以提高敏感数据识别的效率。
This application discloses a sensitive data identification method, device and storage medium. The method specifically includes: the electronic device acquires a first data set and a second data set. The first data set includes N fields of sensitive data, and the second data set includes M fields of data to be detected; wherein the first data set is The pre-stored annotated sensitive data set, and the second data set is the collected unlabeled data set. The electronic device merges the data of fields with the same character type in the first data set and the second data set to obtain a third data set. The third data set includes data of S fields. The electronic device clusters the data in the third data set to obtain R-type data. The electronic device determines the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is less than the first preset threshold, it determines that the data in the third data set is sensitive data. This method can improve the efficiency of sensitive data identification.
Description
技术领域Technical field
本发明涉及计算机技术领域,尤其涉及一种敏感数据识别方法、装置及存储介质。The present invention relates to the field of computer technology, and in particular to a sensitive data identification method, device and storage medium.
背景技术Background technique
结构化数据指存储于数据库、表格文本文件(例如,excel、csv等)中的数据。结构化的数据通常可通过字段进行标识,可以通过数据的字段判断该字段对应的数据是否为敏感数据。例如,银行的数据库中存储有某类数值型的数据,该数据的字段为“金额”,则可确定该字段中的数据为用户的敏感数据。然而,在某些情况下,可能存在数据字段缺失的情况,因此需要通过数据本身来判断是否为敏感数据。目前主要通过人工的方式进行敏感数据识别,其效率较低。Structured data refers to data stored in databases, tabular text files (for example, excel, csv, etc.). Structured data can usually be identified by fields, and you can determine whether the data corresponding to the field is sensitive data through the fields of the data. For example, if a certain type of numerical data is stored in a bank's database, and the field of this data is "amount", it can be determined that the data in this field is the user's sensitive data. However, in some cases, there may be missing data fields, so it is necessary to judge whether it is sensitive data through the data itself. Currently, sensitive data identification is mainly done manually, which is less efficient.
发明内容Contents of the invention
本申请提供了一种敏感数据识别方法、装置及存储介质,用以解决对敏感数据识别的效率较低的问题。This application provides a sensitive data identification method, device and storage medium to solve the problem of low efficiency in sensitive data identification.
第一方面,本申请提供了一种敏感数据识别方法。该方法可应用于具有处理能力的电子设备,该方法具体包括:电子设备获取第一数据集及第二数据集,第一数据集包括N个字段的敏感数据,第二数据集包括M个字段的待检测数据;其中,第一数据集为预先存储的经过标注的敏感数据集,第二数据集为采集到的未经标注的数据集,M、N均为正整数。电子设备对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,获得第三数据集,第三数据集包括S个字段的数据,S为正整数。电子设备对第三数据集中的数据进行聚类,获得R类数据,R为正整数。电子设备确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异小于第一预设阈值,则确定第三数据集中的数据为敏感数据。In the first aspect, this application provides a sensitive data identification method. The method can be applied to electronic devices with processing capabilities. The method specifically includes: the electronic device acquires a first data set and a second data set. The first data set includes N fields of sensitive data, and the second data set includes M fields. The data to be detected; among them, the first data set is a pre-stored annotated sensitive data set, the second data set is a collected unlabeled data set, and M and N are both positive integers. The electronic device merges the data of fields with the same character type in the first data set and the second data set to obtain a third data set. The third data set includes data of S fields, and S is a positive integer. The electronic device clusters the data in the third data set to obtain R type data, where R is a positive integer. The electronic device determines the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is less than the first preset threshold, it determines that the data in the third data set is sensitive data.
在本申请实施例中,通过将第二数据集的M个字段中与第一数据集N个字段字符类型(例如,浮点数、字母等)相同的字段进行合并,可以对第二数据集中的字段进行初步筛选,删除明显不属于敏感数据的字段,得到第三数据集。电子设备进一步对第三数据集中的数据进行聚类,得到R类数据,可以根据第三数据集的敏感数据与待检测数据在R类数据中的分布差异,确定第三数据集中的数据是否为敏感数据。在进行敏感数据识别时,通过对第一数据集与第二数据集进行合并,可以删除明显不属于敏感数据的数据,提高识别效率。进一步将合并得到的第三数据集进行聚类得到敏感数据与待检测数据的分布情况,最终得到识别结果,通过分布情况对敏感数据进行识别效率较高。In this embodiment of the present application, by merging the M fields in the second data set with the same character type (for example, floating point numbers, letters, etc.) as the N fields in the first data set, the fields in the second data set can be merged. The fields are initially screened, and fields that are obviously not sensitive data are deleted to obtain the third data set. The electronic device further clusters the data in the third data set to obtain R-type data, and can determine whether the data in the third data set is based on the distribution difference between the sensitive data in the third data set and the data to be detected in the R-type data. Sensitive data. When identifying sensitive data, by merging the first data set and the second data set, data that is obviously not sensitive data can be deleted and the identification efficiency can be improved. The merged third data set is further clustered to obtain the distribution of sensitive data and data to be detected, and finally the identification results are obtained. It is more efficient to identify sensitive data through the distribution.
可选的,电子设备对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,包括:电子设备对第一数据集的N个字段中的第一字段、第二数据集的M个字段中的第二字段分别进行采样,得到多个第一采样数据和多个第二采样数据。电子设备确定多个第一采样数据和多个第二采样数据的字符类型是否相同。若字符类型相同,电子设备对第一字段的数据与第二字段的数据进行合并,合并后的数据为第三数据集中任一字段的数据。Optionally, the electronic device merges the data of the fields with the same character type in the first data set and the second data set, including: the electronic device merges the first field among the N fields of the first data set and the second data set. The second fields among the M fields are sampled respectively to obtain a plurality of first sampled data and a plurality of second sampled data. The electronic device determines whether character types of the plurality of first sampling data and the plurality of second sampling data are the same. If the character types are the same, the electronic device merges the data in the first field with the data in the second field, and the merged data is the data in any field in the third data set.
在本申请实施例中,通过采样来比较第一数据集及第二数据集的两个字段的字符类型是否相同,若相同,则将两个字段的数据合并为第三数据集中一个字段的数据。通过采样的方式可以提高合并的效率。In the embodiment of this application, sampling is used to compare whether the character types of the two fields in the first data set and the second data set are the same. If they are the same, the data of the two fields are merged into the data of one field in the third data set. . The efficiency of merging can be improved through sampling.
可选的,电子设备确定多个第一采样数据和多个第二采样数据的字符类型是否相同,包括:电子设备分别确定多个第一采样数据及多个第二采样数据的统计参数,统计参数包括均值和/或方差。电子设备根据多个第一采样数据、多个第二采样数据的统计参数的差异是否小于第二预设阈值确定多个第一采样数据与多个第二采样数据的字符类型是否相同。Optionally, the electronic device determines whether the character types of the plurality of first sampling data and the plurality of second sampling data are the same, including: the electronic device determines statistical parameters of the plurality of first sampling data and the plurality of second sampling data respectively, and the statistics Parameters include mean and/or variance. The electronic device determines whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same based on whether the difference in statistical parameters of the plurality of first sampled data and the plurality of second sampled data is less than a second preset threshold.
在本申请实施例中,由于电子设备判断字符类型存在一定准确度,在判断多个第一采样数据和多个第二采样数据的字符类型是否相同时,可以通过采样数据的统计参数来判断,若统计参数的差异小于第二预设阈值,则可以认为多个第一采样数据和多个第二采样数据的统计分布大致相同,可以确定多个第一采样数据和多个第二采样数据的字符类型相同。In the embodiment of the present application, since the electronic device has a certain accuracy in determining the character type, when determining whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same, it can be determined based on the statistical parameters of the sampled data. If the difference in statistical parameters is less than the second preset threshold, it can be considered that the statistical distributions of the plurality of first sampling data and the plurality of second sampling data are approximately the same, and the distribution of the plurality of first sampling data and the plurality of second sampling data can be determined. Character types are the same.
可选的,电子设备确定第三数据集的敏感数据与待检测数据在R类数据中的分布差异,包括:电子设备根据敏感数据在R类数据中的每类数据的占比确定第一集合,第一集合用以表征敏感数据在R类数据中的分布情况。电子设备根据待检测数据在R类数据中的每类数据的占比确定第二集合,第二集合用以表征待检测数据在R类数据中的分布情况。电子设备确定第一集合与所述第二集合的欧式距离为敏感数据与待检测数据在R类数据中的分布差异。Optionally, the electronic device determines the distribution difference between the sensitive data in the third data set and the data to be detected in the R-type data, including: the electronic device determines the first set based on the proportion of the sensitive data in each type of data in the R-type data. , the first set is used to characterize the distribution of sensitive data in R type data. The electronic device determines the second set according to the proportion of each type of data in the R type data of the data to be detected, and the second set is used to characterize the distribution of the data to be detected in the R type data. The electronic device determines that the Euclidean distance between the first set and the second set is the distribution difference between the sensitive data and the data to be detected in the R-type data.
在本申请实施例中,通过对第三数据集中的数据进行聚类,得到敏感数据对应的第一集合、待检测数据的第二集合。可以根据第一集合、第二集合间的欧式距离确定第三数据集中的待检测数据是否为敏感数据,效率较高。In this embodiment of the present application, by clustering the data in the third data set, a first set corresponding to sensitive data and a second set of data to be detected are obtained. Whether the data to be detected in the third data set is sensitive data can be determined based on the Euclidean distance between the first set and the second set, which is more efficient.
可选的,电子设备在对第三数据集中的数据进行聚类之前,该方法还包括:电子设备将第三数据集中的字符型数据转换为数值型数据。Optionally, before the electronic device clusters the data in the third data set, the method further includes: the electronic device converts the character data in the third data set into numerical data.
在本申请实施例中,将第三数据集中的字符型数据转换为数值型数据,可以提高电子设备的处理效率。In this embodiment of the present application, converting the character data in the third data set into numerical data can improve the processing efficiency of the electronic device.
第二方面,本申请提供了一种敏感数据识别装置。该装置包括:获取模块、合并模块、聚类模块及确定模块。其中,获取模块用于获取第一数据集及第二数据集,第一数据集包括N个字段的敏感数据,第二数据集包括M个字段的待检测数据;其中,第一数据集为预先存储的经过标注的敏感数据集,第二数据集为采集到的未经标注的数据集,M、N均为正整数。合并模块用于对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,获得第三数据集,第三数据集包括S个字段的数据,S为正整数。聚类模块用于对第三数据集中的数据进行聚类,获得R类数据,R为正整数。确定模块用于确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异小于第一预设阈值,则确定第三数据集中的数据为敏感数据。In a second aspect, this application provides a sensitive data identification device. The device includes: an acquisition module, a merging module, a clustering module and a determining module. Wherein, the acquisition module is used to acquire a first data set and a second data set. The first data set includes N fields of sensitive data, and the second data set includes M fields of data to be detected; wherein, the first data set is pre- The stored labeled sensitive data set, the second data set is the collected unlabeled data set, M and N are both positive integers. The merging module is used to merge the data of fields with the same character type in the first data set and the second data set to obtain a third data set. The third data set includes data of S fields, and S is a positive integer. The clustering module is used to cluster the data in the third data set to obtain R type data, where R is a positive integer. The determination module is used to determine the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is less than the first preset threshold, it is determined that the data in the third data set is sensitive data.
可选的,合并模块具体用于:对第一数据集的N个字段中的第一字段、第二数据集的M个字段中的第二字段分别进行采样,得到多个第一采样数据和多个第二采样数据。确定多个第一采样数据和多个第二采样数据的字符类型是否相同。若字符类型相同,对第一字段的数据与第二字段的数据进行合并,合并后的数据为第三数据集中任一字段的数据。Optionally, the merging module is specifically configured to: respectively sample the first field among the N fields of the first data set and the second field among the M fields of the second data set to obtain multiple first sampled data and Multiple second sampled data. Determine whether the character types of the plurality of first sampling data and the plurality of second sampling data are the same. If the character types are the same, the data in the first field and the data in the second field are merged, and the merged data is the data in any field in the third data set.
可选的,合并模块具体用于:分别确定多个第一采样数据及多个第二采样数据的统计参数,统计参数包括均值和/或方差。根据多个第一采样数据、多个第二采样数据的统计参数的差异是否小于第二预设阈值确定多个第一采样数据与多个第二采样数据的字符类型是否相同。Optionally, the merging module is specifically configured to determine statistical parameters of multiple first sampling data and multiple second sampling data respectively, where the statistical parameters include mean and/or variance. Whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same is determined based on whether the difference in statistical parameters of the plurality of first sampled data and the plurality of second sampled data is less than a second preset threshold.
可选的,确定模块具体用于:根据敏感数据在R类数据中的每类数据的占比确定第一集合,第一集合用以表征敏感数据在R类数据中的分布情况。根据待检测数据在R类数据中的每类数据的占比确定第二集合,第二集合用以表征待检测数据在R类数据中的分布情况。确定第一集合与所述第二集合的欧式距离为敏感数据与待检测数据在R类数据中的分布差异。Optionally, the determination module is specifically configured to: determine the first set according to the proportion of sensitive data in each type of data in the R type data, and the first set is used to characterize the distribution of sensitive data in the R type data. The second set is determined according to the proportion of each type of data in the R type data of the data to be detected, and the second set is used to characterize the distribution of the data to be detected in the R type data. The Euclidean distance between the first set and the second set is determined to be the distribution difference between the sensitive data and the data to be detected in the R-type data.
可选的,聚类模块还用于:将第三数据集中的字符型数据转换为数值型数据。Optionally, the clustering module is also used to convert character data in the third data set into numerical data.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器以及与处理器通信连接的存储器。其中,存储器存储有计算机执行指令,该指令被处理器执行,以使处理器能够执行上述第一方面中任一项所述的方法。In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory communicatively connected to the processor. Wherein, the memory stores computer execution instructions, and the instructions are executed by the processor, so that the processor can execute the method described in any one of the above first aspects.
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机执行指令,当该计算机执行指令被处理器执行时,使得处理器执行上述第一方面中任一项所述的方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium that stores computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor is caused to execute the above-described first aspect. any of the methods described.
第五方面,本申请实施例提供一种计算机程序产品,该计算机程序产品包括计算机程序,其存储在计算机可读存储介质中,处理器可以从计算机可读存储介质读取计算机程序,处理器执行计算机程序时可实现上述第一方面中任一项所述的方法。In a fifth aspect, embodiments of the present application provide a computer program product. The computer program product includes a computer program, which is stored in a computer-readable storage medium. The processor can read the computer program from the computer-readable storage medium, and the processor executes the computer program. The method described in any one of the above first aspects can be implemented as a computer program.
附图说明Description of the drawings
图1为本申请实施例提供的敏感数据识别方法的流程示意图;Figure 1 is a schematic flow chart of a sensitive data identification method provided by an embodiment of the present application;
图2为本申请实施例提供的敏感数据识别装置的结构示意图;Figure 2 is a schematic structural diagram of a sensitive data identification device provided by an embodiment of the present application;
图3为本申请实施例提供的电子设备的结构示意图。FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了更好地理解本发明实施例提供的方案,首先介绍本发明实施例所涉及的一些技术概念。需要说明的是,在本申请技术方案中,对数据的采集、传播、使用等,均符合国家相关法律法规要求。In order to better understand the solutions provided by the embodiments of the present invention, some technical concepts involved in the embodiments of the present invention are first introduced. It should be noted that in the technical solution of this application, the collection, dissemination, and use of data are all in compliance with relevant national laws and regulations.
如前述的,数据库表中的结构化数据可以通过数据的字段进行标识,以判断该数据是否为敏感数据。然而在某些情况下可能存在数据字段缺失的情况,需要通过数据本身对其进行识别。现结合下表1、2为例进行说明。As mentioned above, the structured data in the database table can be identified by the fields of the data to determine whether the data is sensitive data. However, in some cases there may be missing data fields, which need to be identified through the data itself. Let’s take the following tables 1 and 2 as an example for explanation.
表1Table 1
表2Table 2
如上表1,为正常情况下数据库表中的数据的存储情况。可见,表1中的数据字段清晰,可以根据数据的字段得知该数据是否为敏感数据。例如,“姓名”字段下包括用户的姓名,其中存在敏感数据“Hu_Jingtao”(元首名)。又例如,“金额”字段下的数据由于涉及用户隐私,均为敏感数据。通过数据对应的字段可以迅速得知该数据是否为敏感数据。如表2,当数据字段缺失时(缺失的字段用“?”表示),就无法直接判断该数据是否为敏感数据。例如,表2中的数据“4256.27”、“54212.65”、“1684.74”代表的含义就无法直观确定。Table 1 above shows the storage situation of data in the database table under normal circumstances. It can be seen that the data fields in Table 1 are clear, and you can know whether the data is sensitive data based on the data fields. For example, the "name" field includes the user's name, and there is sensitive data "Hu_Jingtao" (name of the head of state). For another example, the data under the "amount" field is sensitive data because it involves user privacy. You can quickly know whether the data is sensitive data through the fields corresponding to the data. As shown in Table 2, when a data field is missing (the missing field is represented by "?"), it is impossible to directly determine whether the data is sensitive data. For example, the meanings represented by the data "4256.27", "54212.65", and "1684.74" in Table 2 cannot be determined intuitively.
较为简单的方式是通过人工的方法进行逐一识别,但人工的方法在数据量较大时,识别效率较低。目前存在一些机器学习算法对敏感数据识别,但这些机器学习算法都需要将待识别的数据输入模型进行逐一识别,识别效率也较低。The simpler way is to use manual methods to identify one by one, but manual methods have low identification efficiency when the amount of data is large. There are currently some machine learning algorithms for identifying sensitive data, but these machine learning algorithms need to input the data to be identified into the model for identification one by one, and the identification efficiency is also low.
鉴于此,本申请实施例提供了一种敏感数据识别方法,通过将待检测数据与敏感数据进行合并,并对合并后的数据进行聚类,可以根据聚类后待检测数据域敏感数据的分布差异确定待检测数据是否为敏感数据。由于是基于数据的分布对敏感数据进行识别,识别效率较高。In view of this, embodiments of the present application provide a sensitive data identification method. By merging the data to be detected and the sensitive data, and clustering the merged data, the distribution of sensitive data in the data domain to be detected can be determined based on the clustering. The difference determines whether the data to be detected is sensitive data. Since sensitive data is identified based on the distribution of data, the identification efficiency is high.
请参见图1,示出了本申请实施例提供的敏感数据识别方法的流程示意图。图1所示流程,以电子设备执行敏感数据识别为例。电子设备可以是任意具备数据存储功能的设备,例如电子设备可以是终端,例如智能手机、平板电脑、台式计算机等。Please refer to Figure 1, which shows a schematic flow chart of the sensitive data identification method provided by the embodiment of the present application. The process shown in Figure 1 takes electronic equipment performing sensitive data identification as an example. The electronic device can be any device with data storage function. For example, the electronic device can be a terminal, such as a smart phone, a tablet computer, a desktop computer, etc.
为了便于描述,在下文的介绍中,可以将字段清晰的数据集与字段缺失的数据集进行区分,将字段清晰的数据集称为第一数据集,如表1。将字段缺失的数据集称为第二数据集,如表2。也就是说,需要对第二数据集中的数据进行识别。应理解,这里提及的“第一”及“第二”只是用于对数据集进行区分,不用于限定数据集的大小、内容、顺序、时序、优先级或者重要程度等。For the convenience of description, in the following introduction, the data set with clear fields can be distinguished from the data set with missing fields, and the data set with clear fields is called the first data set, as shown in Table 1. The data set with missing fields is called the second data set, as shown in Table 2. That is, the data in the second data set needs to be identified. It should be understood that the "first" and "second" mentioned here are only used to distinguish the data sets and are not used to limit the size, content, order, timing, priority or importance of the data sets.
S101、电子设备获取第一数据集及第二数据集,第一数据集包括N个字段的敏感数据,第二数据集包括M个字段的待检测数据;其中,第一数据集为预先存储的经过标注的敏感数据集,第二数据集为采集到的未经标注的数据集,M、N均为正整数。S101. The electronic device obtains a first data set and a second data set. The first data set includes N fields of sensitive data, and the second data set includes M fields of data to be detected. The first data set is pre-stored. The labeled sensitive data set, the second data set is the collected unlabeled data set, M and N are both positive integers.
电子设备获取第一数据集及第二数据集。其中,第一数据集中的数据为字段标识清楚的敏感数据。第一数据集可以是电子设备预先采集的数据集。电子设备可以对预先采集到的数据集进行人工识别,并对该数据集的敏感数据的字段进行标注。电子设备对第一数据集进行标注后可以得到N个字段的敏感数据,N为正整数。电子设备可以将标注后的第一数据集进行预先存储。The electronic device acquires the first data set and the second data set. Among them, the data in the first data set is sensitive data with clearly identified fields. The first data set may be a data set pre-collected by the electronic device. The electronic device can manually identify the pre-collected data set and mark the sensitive data fields of the data set. After the electronic device labels the first data set, it can obtain N fields of sensitive data, where N is a positive integer. The electronic device can pre-store the annotated first data set.
第二数据集为电子设备采集到的、但未经标注的数据集,第二数据集中的字段可以部分或者全部缺失。因此,第二数据集中的数据可能同时存在敏感数据与非敏感数据,需要对第二数据集中的敏感数据进行识别。为便于表述,将第二数据集中的数据称为待检测数据。第二数据集中的字段数为M,也就是说,第二数据集包括M个字段的待检测数据。The second data set is a data set collected by electronic devices but not labeled. Some or all of the fields in the second data set may be missing. Therefore, the data in the second data set may contain both sensitive data and non-sensitive data, and the sensitive data in the second data set needs to be identified. For ease of expression, the data in the second data set is called data to be detected. The number of fields in the second data set is M, that is to say, the second data set includes M fields of data to be detected.
S102、电子设备对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,获得第三数据集,第三数据集包括S个字段的数据,S为正整数。S102. The electronic device merges the data of fields with the same character type in the first data set and the second data set to obtain a third data set. The third data set includes data of S fields, and S is a positive integer.
在本申请实施例中,由于直接通过第二数据集中的数据对其自身识别的效率较低,考虑到第一数据集为已知的经过标注的敏感数据,可以将第一数据集用于对第二数据集的识别。第一数据集包括N个字段的数据,第二数据集包括M个字段的待检测数据,第一数据集包括的N个字段的数据都为敏感数据,第二数据集中的M个字段的待检测数据并非都为敏感数据。第二数据集中的数据的字符类型可能与第一数据集中的数据完全不同,也可能与第一数据集的数据字符类型相同,但并非敏感数据。因此,电子设备在对第二数据集中的待检测数据进行识别时,可以对明显不属于敏感数据的字段不作考虑,可以减少需要识别的待检测数据的数量,提高检测效率。例如,第二数据集中存在某个未经标注的字段下的数据“10:34”、“09:42”、“16:23”…,则可以从数据的字符类型判断该字段为“时间”,该字段下的数据为非敏感数据。又例如,第二数据集汇总存在某个未经标注的字段下的数据“SAFGJAIOJIAJDO”、“OPQJTEQIOTAD”、“POQTEUPODFDAF”…,则该字段下的数据代表的含义明显也非敏感数据。In this embodiment of the present application, since the efficiency of identifying itself directly through the data in the second data set is low, considering that the first data set is known annotated sensitive data, the first data set can be used to identify itself. Identification of the second data set. The first data set includes N fields of data, the second data set includes M fields of data to be detected, the N fields of data in the first data set are all sensitive data, and the M fields of the second data set are to be detected. Not all detection data is sensitive. The character type of the data in the second data set may be completely different from the data in the first data set, or it may be the same as the data character type in the first data set, but is not sensitive data. Therefore, when the electronic device identifies the data to be detected in the second data set, it can ignore fields that are obviously not sensitive data, thereby reducing the amount of data to be detected that needs to be identified and improving detection efficiency. For example, if there is data "10:34", "09:42", "16:23"... under an unlabeled field in the second data set, you can judge that the field is "time" from the character type of the data. , the data under this field is non-sensitive data. For another example, the second data set summarizes the data "SAFGJAIOJIAJDO", "OPQJTEQIOTAD", "POQTEUPODFDAF"... under an unlabeled field. The meaning of the data under this field is obviously not sensitive data.
为了识别第二数据集的数据字符类型与第一数据集中字符类型是否相同,电子设备可以通过采样来对第一数据集的N个字段中的某一个字段(例如,第一字段)与第二数据集的M个字段中的某一个字段(例如,第二字段)进行逐一对比。In order to identify whether the data character type in the second data set is the same as the character type in the first data set, the electronic device may compare a certain field (for example, the first field) of the N fields in the first data set with the second one through sampling. One field (for example, the second field) among the M fields of the data set is compared one by one.
电子设备可以对第一数据集采样得到多个第一采样数据,对第二数据集采样得到多个第二采样数据。若电子设备采集到的多个第二采样数据所属的字段不属于敏感数据的字段,则其字符类型可能与多个第一采样数据完全不同。例如,多个第二采样数据的字符类型为英文字母的字符串,“SAFGJAIOJIAJDO”、“OPQJTEQIOTAD”等,多个第一采样数据的字符类型为浮点数,“4256.27”、“54212.65”等。则可以确定多个第一采样数据和多个第二采样数据的字符类型不相同。应理解,对于采样数据的数量应该在合理的范围内,数量过少采集到的数据可能不具有代表性,数量过多可能影响电子设备的处理效率,具体数量可以根据实际需求而定,本申请实施例不作具体限制。The electronic device may sample the first data set to obtain a plurality of first sampling data, and sample the second data set to obtain a plurality of second sampling data. If the fields to which the plurality of second sampled data collected by the electronic device belong are not sensitive data fields, their character types may be completely different from the plurality of first sampled data. For example, the character types of the plurality of second sampling data are English character strings, such as "SAFGJAIOJIAJDO", "OPQJTEQIOTAD", etc., and the character types of the plurality of first sampling data are floating point numbers, such as "4256.27", "54212.65", etc. Then it can be determined that the character types of the plurality of first sampled data and the plurality of second sampled data are different. It should be understood that the quantity of sampling data should be within a reasonable range. If the quantity is too small, the data collected may not be representative. Too much quantity may affect the processing efficiency of the electronic equipment. The specific quantity can be determined according to actual needs. This application The examples are not specifically limited.
具体的,电子设备可以通过多个第一采样数据和多个第二采样数据的统计参数确定多个第一采样数据和多个第二采样数据的字符类型是否相同,统计参数包括均值和/或方差。也就是说,电子设备可以计算多个第一采样数据的均值和/或方差、多个第二采样数据的均值和/或方差,由于统计参数可以表征数据的分布情况,若通过多个第一采样数据和多个第二采样数据的统计参数可以确定两者的分布情况越相近似,则可以确定两者的字符类型相同的概率越高。例如,多个第一采样数据和多个第二采样数据均为“身高”,则电子设备计算得到的多个第一采样数据的均值和/或方差、多个第二采样数据的均值和/或方差应该相近,都在正常人的身高范围内,则电子设备可以确定两者的字符类型相同。又例如,多个第一采样数据和多个第二采样数据均为字母型数据“姓名”,则电子设备可以通过word2vec方式对多个第一采样数据和多个第二采样数据进行向量化处理,可以计算向量化的多个第一采样数据的均值和/或方差、多个第二采样数据的均值和/或方差,如果二者相近则电子设备可以确定两者的字符类型相同。另一方面,若多个第一采样数据的字符类型为浮点数、多个第二采样数据的字符类型为英文字母的字符串,则多个第一采样数据和多个第二采样数据的统计参数明显不同,电子设备可以确定两者的字符类型不相同。Specifically, the electronic device can determine whether the character types of the plurality of first sampling data and the plurality of second sampling data are the same through the statistical parameters of the plurality of first sampling data and the plurality of second sampling data. The statistical parameters include mean and/or variance. That is to say, the electronic device can calculate the mean and/or variance of multiple first sampling data and the mean and/or variance of multiple second sampling data. Since statistical parameters can characterize the distribution of data, if multiple first sampling data are used, The statistical parameters of the sampled data and the plurality of second sampled data can determine that the more similar the distributions of the two are, the higher the probability that the character types of the two are the same. For example, if multiple first sampling data and multiple second sampling data are both "height", then the mean and/or variance of multiple first sampling data, the mean and/or variance of multiple second sampling data calculated by the electronic device Or the variances should be similar and both are within the height range of normal people, then the electronic device can determine that the character types of the two are the same. For another example, if the plurality of first sampling data and the plurality of second sampling data are all letter-type data "names", the electronic device can vectorize the plurality of first sampling data and the plurality of second sampling data through word2vec. , the mean and/or variance of the vectorized plurality of first sampled data and the mean and/or variance of the plurality of second sampled data may be calculated. If the two are similar, the electronic device may determine that the character types of the two are the same. On the other hand, if the character type of the plurality of first sampling data is a floating point number and the character type of the plurality of second sampling data is a string of English letters, then the statistics of the plurality of first sampling data and the plurality of second sampling data The parameters are obviously different, and the electronic device can determine that the character types of the two are different.
由于通过统计参数可以表征多个第一采样数据和多个第二采样数据的分布情况的近似程度,也就是,多个第一采样数据和多个第二采样数据的分布可能存在差异。电子设备可以为多个第一采样数据和多个第二采样数据的统计参数设置第二预设阈值,当两者的统计参数的差异小于第二预设阈值时,说明两者的分布情况足够相近,可以认为两者的字符类型相同。第二预设阈值为采样数据的均值和/或方差的预设阈值,作为一种示例,均值的第二预设阈值的可以为2、方差的第二预设阈值的可以为3。也就是说,若多个第一采样数据和多个第二采样数据的均值之差小于2,且方差之差小于3时,可以认为多个第一采样数据和多个第二采样数据的字符类型相同。相应地,若多个第一采样数据和多个第二采样数据的均值之差大于2或方差之差小于3时,可以认为多个第一采样数据和多个第二采样数据的字符类型不相同,可以将第二字段对应的数据不作考虑。Since the statistical parameters can characterize the degree of approximation of the distribution of multiple first sampling data and multiple second sampling data, that is, there may be differences in the distributions of multiple first sampling data and multiple second sampling data. The electronic device can set a second preset threshold for the statistical parameters of the plurality of first sampling data and the plurality of second sampling data. When the difference in the statistical parameters of the two is less than the second preset threshold, it means that the distribution of the two is sufficient. Similar, it can be considered that the character types of the two are the same. The second preset threshold is a preset threshold for the mean and/or variance of the sampled data. As an example, the second preset threshold for the mean may be 2, and the second preset threshold for the variance may be 3. That is to say, if the difference between the means of the plurality of first sampling data and the plurality of second sampling data is less than 2, and the difference in the variance is less than 3, it can be considered that the characters of the plurality of first sampling data and the plurality of second sampling data Same type. Correspondingly, if the difference between the means of the plurality of first sampled data and the plurality of second sampled data is greater than 2 or the difference of the variances is less than 3, it can be considered that the character types of the plurality of first sampled data and the plurality of second sampled data are not the same. Similarly, the data corresponding to the second field can be ignored.
电子设备可以将第二数据集的每个字段与第一数据集的每个字段进行逐一对比,将字符类型相同的数据对应的字段进行合并。需要说明的是,电子设备在确定字符类型是否相同时,是通过采样获得的多个第一采样数据和多个第二采样数据来确定的,在进行合并时,是将第一采样数据对应的第一字段的数据、第二采样数据对应的第二字段的数据进行合并。也就是说,通过采样判断两个字段的数据字符类型是否相同,在合并时,对两个字段中的所有数据进行合并。电子设备对第一字段的数据与第二字段的数据进行合并,合并后的数据为第三数据集中任一字段的数据。电子设备对第一数据集中的N个字段与第二数据集中的M个字段中字符类型相同的字段的数据进行合并,可以得到第三数据集,第三数据集中的字段数可以记为S,S为正整数。应理解,第三数据集中的数据也并非都为敏感数据,需要对第三数据集中的数据进一步判断。The electronic device can compare each field of the second data set with each field of the first data set one by one, and merge the fields corresponding to data with the same character type. It should be noted that when the electronic device determines whether the character types are the same, it determines by sampling multiple first sampling data and multiple second sampling data. When merging, it determines whether the first sampling data corresponds to The data in the first field and the data in the second field corresponding to the second sampled data are merged. That is to say, it is judged through sampling whether the data character types of the two fields are the same, and when merging, all the data in the two fields are merged. The electronic device merges the data in the first field with the data in the second field, and the merged data is the data in any field in the third data set. The electronic device merges the data of fields with the same character type among the N fields in the first data set and the M fields in the second data set to obtain a third data set. The number of fields in the third data set can be recorded as S, S is a positive integer. It should be understood that not all data in the third data set is sensitive data, and further judgment is required on the data in the third data set.
S103、电子设备对第三数据集中的数据进行聚类,获得R类数据,R为正整数。S103. The electronic device clusters the data in the third data set to obtain R type data, where R is a positive integer.
电子设备可以采用基于密度的聚类算法(density-based spatial clusteringof applications of noise,DBSCAN)对第三数据集中的数据进行聚类。为方便本方案的理解,接下来对DBSCAN算法进行简要介绍。The electronic device may use a density-based spatial clustering of applications of noise (DBSCAN) algorithm to cluster the data in the third data set. In order to facilitate the understanding of this scheme, the DBSCAN algorithm is briefly introduced next.
DBSCAN算法是一种基于密度的聚类算法,该算法一般假定类别可以通过样本分布的紧密程度决定。DBSCAN算法需要选择一种距离度量。对于待聚类的数据集中,任意两个点之间的距离,反映了点之间的密度,可以通过距离来判断点与点是否能够聚到同一类中,也就是说,如果点与点之间的距离满足一定条件,可认为两点之间联系紧密,可以归为同一类。DBSCAN算法需要用户输入2个参数:一个参数是半径(EPS),表示以给定点P为中心的圆形领域的范围。另一个参数是以点P为中心的领域内最少点的数量(minpts)。如果满足条件:以点P为中心、半径为EPS的领域内的点的个数不少于minpts,则称点P为核心点。可以通过点P不断寻找半径为EPS的领域内的点,直到无法满足上述条件为止,则可以将找到的点聚为同一类,也就是,一个簇。The DBSCAN algorithm is a density-based clustering algorithm that generally assumes that categories can be determined by the tightness of the sample distribution. The DBSCAN algorithm requires the selection of a distance metric. For the data set to be clustered, the distance between any two points reflects the density between the points. The distance can be used to determine whether the points can be clustered into the same category. That is to say, if the points are If the distance between them meets certain conditions, it can be considered that the two points are closely connected and can be classified into the same category. The DBSCAN algorithm requires the user to input two parameters: one parameter is the radius (EPS), which represents the range of the circular field centered on a given point P. Another parameter is the minimum number of points (minpts) in the domain centered on point P. If the condition is met: the number of points in the area with point P as the center and radius EPS is not less than minpts, then point P is called the core point. You can continue to search for points within the area with a radius of EPS through point P until the above conditions cannot be met. Then you can group the found points into the same category, that is, a cluster.
电子设备可以通过DBSCAN算法计算得出第三数据集中样本间的距离,通过设置参数半径EPS,可以确定第三数据集中紧密相连的样本,将紧密相连的样本划为一类,这样就得到了一个聚类类别。通过将所有各组紧密相连的样本划为各个不同的类别,则我们就得到了最终的所有聚类类别结果。电子设备可以通过聚类算法将第三数据集进行聚类,得到R个簇,即R个类别。则可以认为对第三数据集进行聚类后得到R类数据。需要说明的是,电子设备在对第三数据集中的数据进行聚类之前,可以将第三数据集中的字符型数据转换为数值型数据,可以提高电子设备对第三数据集进行聚类的效率。The electronic device can calculate the distance between samples in the third data set through the DBSCAN algorithm. By setting the parameter radius EPS, the closely connected samples in the third data set can be determined, and the closely connected samples can be classified into one category, thus obtaining a Cluster categories. By classifying all groups of closely connected samples into different categories, we obtain the final results of all clustering categories. The electronic device can cluster the third data set through a clustering algorithm to obtain R clusters, that is, R categories. Then it can be considered that R type data is obtained after clustering the third data set. It should be noted that before the electronic device clusters the data in the third data set, the character data in the third data set can be converted into numerical data, which can improve the efficiency of the electronic device in clustering the third data set. .
S104、电子设备确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异小于第一预设阈值,则确定第三数据集中的数据为敏感数据。S104. The electronic device determines the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is less than the first preset threshold, it determines that the data in the third data set is sensitive data.
第三数据集的S个字段中的一个字段的数据仅仅是字符类型相同的数据,然而,字符类型相同,其表征的参数并非完全相同。也就是说,第三数据集中可能将第二数据集中的非敏感数据进行了合并,最终第三数据集的S个字段中某个字段的数据可能存在较多非敏感数据的情况。例如,在前述步骤S102中,将第一数据集与第二数据集进行合并时,将第一数据集中字段为字符类型与第二数据集中字符类型相同的浮点数进行合并,则得到的第三数据集的S个字段中某一个字段中的数据可能包括第一数据集和第二数据集中的“金额”或“身高”、“体重”等数据。因此,可以通过对第三数据集进行聚类得到的R类数据进一步对第三数据集中的数据进行识别。The data of one field among the S fields of the third data set is only data of the same character type. However, the character types are the same, and the parameters represented by them are not exactly the same. That is to say, the non-sensitive data in the second data set may be merged in the third data set, and eventually the data in one of the S fields in the third data set may contain more non-sensitive data. For example, in the aforementioned step S102, when merging the first data set and the second data set, the floating-point numbers whose fields in the first data set are of the same character type as those in the second data set are merged, and the third data set obtained is The data in one of the S fields of the data set may include "amount" or "height", "weight" and other data in the first data set and the second data set. Therefore, the data in the third data set can be further identified through the R-type data obtained by clustering the third data set.
由于第三数据集中同时包括第一数据集中的敏感数据,第二数据集中的待检测数据,在对第三数据集中的数据进行聚类后,第三数据集中的敏感数据与待检测数据在R类数据中的分布情况可能存在差异。例如,待检测数据中如果合并了第二数据集中的非敏感数据的字段,则可能导致待检测数据中的非敏感数据较多。非敏感数据会在聚类时生成一个与其他数据距离较远的聚类类别,即生成一个相对独立的簇,进而导致敏感数据与待检测数据的分布差异较大。因此,电子设备可以确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异过大,则可以认为待检测数据中包括了非敏感的字段,待检测数据并非都是敏感数据。Since the third data set includes both the sensitive data in the first data set and the data to be detected in the second data set, after clustering the data in the third data set, the sensitive data in the third data set and the data to be detected are in R Distributions in class data may differ. For example, if the fields of non-sensitive data in the second data set are merged into the data to be detected, it may result in more non-sensitive data in the data to be detected. Non-sensitive data will generate a clustering category that is far away from other data during clustering, that is, a relatively independent cluster will be generated, which will lead to a large difference in the distribution of sensitive data and data to be detected. Therefore, the electronic device can determine the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is too large, it can be considered that the data to be detected includes non-sensitive fields, and not all the data to be detected is is sensitive data.
由于电子设备对第三数据集聚类生成了R类数据,可以通过数据在R类数据中每类数据的占比确定该类数据的分布。数据的占比可以用分数或百分数表示。以占比为分数为例,电子设备可以得到R个表示占比的分数,这R个占比的分数可以用(n1,n2…nr)表示。可以将这R个表示占比的分数组成的集合用来表征数据在R类数据中的分布情况,应理解,该集合的各个元素相加之和应为1,即Σni=1。Since the electronic device clusters the third data set to generate R-type data, the distribution of this type of data can be determined based on the proportion of each type of data in the R-type data. Data proportions can be expressed as fractions or percentages. Taking the proportion as a score as an example, the electronic device can obtain R scores representing the proportion, and the R proportional scores can be represented by (n1, n2...nr). This set of R fractions representing proportions can be used to represent the distribution of data in R-type data. It should be understood that the sum of each element of the set should be 1, that is, Σni=1.
电子设备可以根据敏感数据在R类数据中的每类数据的占比确定第一集合,用以表征敏感数据在R类数据中的分布情况,该第一集合可以用(a1,a2…ar)表示。同样地,电子设备可以根据待检测数据在R类数据中的每类数据的占比确定第二集合,用以表征待检测数据在R类数据中的分布情况,该第二集合可以用(b1,b2…br)表示。电子设备可以根据第一集合与第二集合的欧式距离为敏感数据与待检测数据在R类数据中的分布差异。电子设备可以设置第一预设阈值判断第一集合与第二集合的分布差异是否过大。例如,电子设备可以设置第一预设阈值为4。当第一集合与第二集合的分布差异大于或等于第一预设阈值时,可以认为待检测数据与敏感数据的差异较大,可以认为待检测数据并非敏感数据。相应地,当第一集合与第二集合的分布差异小于第一预设阈值时,可以认为待检测数据与敏感数据的差异足够小,可以认为待检测数据为敏感数据。应理解,本申请实施例不对第一预设阈值的具体取值进行限制,电子设备可以将第一预设阈值设置得足够小,以尽可能保证待检测数据与敏感数据的分布尽可能一致,这样可以保证将待检测数据识别为敏感数据的准确性。The electronic device can determine the first set based on the proportion of sensitive data in each type of data in the R type data to represent the distribution of sensitive data in the R type data. The first set can be used (a1, a2...ar) express. Similarly, the electronic device can determine the second set according to the proportion of the data to be detected in each type of data in the R type data to characterize the distribution of the data to be detected in the R type data. The second set can be used as (b1 ,b2...br) means. The electronic device can determine the distribution difference between the sensitive data and the data to be detected in the R-type data based on the Euclidean distance between the first set and the second set. The electronic device may set a first preset threshold to determine whether the distribution difference between the first set and the second set is too large. For example, the electronic device may set the first preset threshold to 4. When the distribution difference between the first set and the second set is greater than or equal to the first preset threshold, it can be considered that the difference between the data to be detected and the sensitive data is large, and the data to be detected can be considered not to be sensitive data. Correspondingly, when the distribution difference between the first set and the second set is less than the first preset threshold, the difference between the data to be detected and the sensitive data can be considered to be small enough, and the data to be detected can be considered to be sensitive data. It should be understood that the embodiments of the present application do not limit the specific value of the first preset threshold. The electronic device can set the first preset threshold small enough to ensure that the distribution of the data to be detected and the sensitive data are as consistent as possible. This can ensure the accuracy of identifying the data to be detected as sensitive data.
请参见图2,基于同一发明构思,本申请实施例提供了一种敏感数据识别装置200。该装置200包括:获取模块201、合并模块202、聚类模块203及确定模块204。其中,获取模块201用于获取第一数据集及第二数据集,第一数据集包括N个字段的敏感数据,第二数据集包括M个字段的待检测数据;其中,第一数据集为预先存储的经过标注的敏感数据集,第二数据集为采集到的未经标注的数据集,M、N均为正整数。合并模块202用于对第一数据集及第二数据集中字符类型相同的字段的数据进行合并,获得第三数据集,第三数据集包括S个字段的数据,S为正整数。聚类模块203用于对第三数据集中的数据进行聚类,获得R类数据,R为正整数。确定模块204用于确定第三数据集的R类数据中敏感数据与待检测数据的分布差异,若分布差异小于第一预设阈值,则确定第三数据集中的数据为敏感数据。Referring to Figure 2, based on the same inventive concept, an embodiment of the present application provides a sensitive data identification device 200. The device 200 includes: an acquisition module 201, a merging module 202, a clustering module 203 and a determining module 204. Among them, the acquisition module 201 is used to acquire a first data set and a second data set. The first data set includes N fields of sensitive data, and the second data set includes M fields of data to be detected; wherein, the first data set is The pre-stored labeled sensitive data set, the second data set is the collected unlabeled data set, M and N are both positive integers. The merging module 202 is used to merge data of fields with the same character type in the first data set and the second data set to obtain a third data set. The third data set includes data of S fields, where S is a positive integer. The clustering module 203 is used to cluster the data in the third data set to obtain R type data, where R is a positive integer. The determination module 204 is used to determine the distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set. If the distribution difference is less than the first preset threshold, it is determined that the data in the third data set is sensitive data.
可选的,合并模块202具体用于:对第一数据集的N个字段中的第一字段、第二数据集的M个字段中的第二字段分别进行采样,得到多个第一采样数据和多个第二采样数据。确定多个第一采样数据和多个第二采样数据的字符类型是否相同。若字符类型相同,对第一字段的数据与第二字段的数据进行合并,合并后的数据为第三数据集中任一字段的数据。Optionally, the merging module 202 is specifically configured to: respectively sample the first field among the N fields of the first data set and the second field among the M fields of the second data set to obtain multiple first sampled data. and multiple second sampled data. Determine whether the character types of the plurality of first sampling data and the plurality of second sampling data are the same. If the character types are the same, the data in the first field and the data in the second field are merged, and the merged data is the data in any field in the third data set.
可选的,合并模块202具体用于:分别确定多个第一采样数据及多个第二采样数据的统计参数,统计参数包括均值和/或方差。根据多个第一采样数据、多个第二采样数据的统计参数的差异是否小于第二预设阈值确定多个第一采样数据与多个第二采样数据的字符类型是否相同。Optionally, the merging module 202 is specifically configured to determine statistical parameters of a plurality of first sampling data and a plurality of second sampling data respectively, where the statistical parameters include means and/or variances. Whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same is determined based on whether the difference in statistical parameters of the plurality of first sampled data and the plurality of second sampled data is less than a second preset threshold.
可选的,确定模块204具体用于:根据敏感数据在R类数据中的每类数据的占比确定第一集合,第一集合用以表征敏感数据在R类数据中的分布情况。根据待检测数据在R类数据中的每类数据的占比确定第二集合,第二集合用以表征待检测数据在R类数据中的分布情况。确定第一集合与所述第二集合的欧式距离为敏感数据与待检测数据在R类数据中的分布差异。Optionally, the determination module 204 is specifically configured to: determine a first set according to the proportion of sensitive data in each type of data in the R type data, and the first set is used to characterize the distribution of sensitive data in the R type data. The second set is determined according to the proportion of each type of data in the R type data of the data to be detected, and the second set is used to characterize the distribution of the data to be detected in the R type data. The Euclidean distance between the first set and the second set is determined to be the distribution difference between the sensitive data and the data to be detected in the R-type data.
可选的,聚类模块203还用于:将第三数据集中的字符型数据转换为数值型数据。Optionally, the clustering module 203 is also used to convert character data in the third data set into numerical data.
请参见图3,基于同一发明构思,本申请实施例提供了一种电子设备300,该电子设备包括:至少一个处理器301、至少一个存储器302以及存储在存储器中的计算机程序指令,当计算机程序指令被处理器执行时实现如前述的敏感数据识别方法。Referring to Figure 3, based on the same inventive concept, an embodiment of the present application provides an electronic device 300. The electronic device includes: at least one processor 301, at least one memory 302, and computer program instructions stored in the memory. When the computer program When the instructions are executed by the processor, the sensitive data identification method as mentioned above is implemented.
可选的,处理器301具体可以是中央处理器、特定应用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC),可以是一个或多个用于控制程序执行的集成电路,可以是使用现场可编程门阵列(英文:Field Programmable GateArray,简称:FPGA)开发的硬件电路,可以是基带处理器。Optionally, the processor 301 may be a central processing unit, an Application Specific Integrated Circuit (ASIC), one or more integrated circuits for controlling program execution, or an on-site computer. The hardware circuit developed by programmable gate array (English: Field Programmable GateArray, abbreviation: FPGA) can be a baseband processor.
可选的,该读写锁操作设备还包括与至少一个处理器301连接的存储器302,存储器302可以包括只读存储器(英文:Read Only Memory,简称:ROM)、随机存取存储器(英文:Random Access Memory,简称:RAM)和磁盘存储器。存储器302用于存储处理器301运行时所需的数据。存储器302的数量为一个或多个。其中,存储器302在图3中一并示出,但需要知道的是存储器302不是必选的功能模块,因此在图3中以虚线示出。Optionally, the read-write lock operating device also includes a memory 302 connected to at least one processor 301. The memory 302 may include a read-only memory (English: Read Only Memory, ROM for short), a random access memory (English: Random Access Memory (abbreviated as: RAM) and disk memory. The memory 302 is used to store data required when the processor 301 is running. The number of memories 302 is one or more. The memory 302 is also shown in FIG. 3 . However, it should be noted that the memory 302 is not a required functional module, so it is shown with a dotted line in FIG. 3 .
基于同一发明构思,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,当该计算机指令在计算机上运行时,使得计算机执行如前述的敏感数据识别方法。Based on the same inventive concept, embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores computer instructions. When the computer instructions are run on a computer, they cause the computer to execute the aforementioned sensitive data identification method.
在具体的实施过程中,计算机可读存储介质包括:通用串行总线闪存盘(Universal Serial Bus flash drive,USB)、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的存储介质。In the specific implementation process, computer-readable storage media include: Universal Serial Bus flash drive (USB), mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory ( Various storage media that can store program code, such as Random Access Memory (RAM), magnetic disks, or optical disks.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above functional modules is used as an example. In actual applications, the above function allocation can be completed by different functional modules according to needs, that is, The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. For the specific working processes of the systems, devices and units described above, reference can be made to the corresponding processes in the foregoing method embodiments, which will not be described again here.
在本发明所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be The combination can either be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(Universal Serial Bus flash disk)、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in various embodiments of the application. The aforementioned storage media include: Universal Serial Bus flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk Or various media such as CDs that can store program code.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention is also intended to include these modifications and variations.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310833297.9A CN116894073A (en) | 2023-07-07 | 2023-07-07 | Sensitive data identification method, device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310833297.9A CN116894073A (en) | 2023-07-07 | 2023-07-07 | Sensitive data identification method, device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116894073A true CN116894073A (en) | 2023-10-17 |
Family
ID=88312951
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310833297.9A Pending CN116894073A (en) | 2023-07-07 | 2023-07-07 | Sensitive data identification method, device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116894073A (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140115710A1 (en) * | 2012-10-19 | 2014-04-24 | Pearson Education, Inc. | Privacy Server for Protecting Personally Identifiable Information |
| CN111783126A (en) * | 2020-07-21 | 2020-10-16 | 支付宝(杭州)信息技术有限公司 | A privacy data identification method, apparatus, device and readable medium |
-
2023
- 2023-07-07 CN CN202310833297.9A patent/CN116894073A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140115710A1 (en) * | 2012-10-19 | 2014-04-24 | Pearson Education, Inc. | Privacy Server for Protecting Personally Identifiable Information |
| CN111783126A (en) * | 2020-07-21 | 2020-10-16 | 支付宝(杭州)信息技术有限公司 | A privacy data identification method, apparatus, device and readable medium |
| US20220027505A1 (en) * | 2020-07-21 | 2022-01-27 | Alipay (Hangzhou) Information Technology Co., Ltd. | Method, apparatus, device, and readable medium for identifying private data |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI718643B (en) | Method and device for identifying abnormal groups | |
| CN114780746A (en) | Knowledge graph-based document retrieval method and related equipment thereof | |
| CN104424256B (en) | Bloom filter generation method and device | |
| CN109033200A (en) | Method, apparatus, equipment and the computer-readable medium of event extraction | |
| CN111563218B (en) | Page repairing method and device | |
| CN110362829B (en) | Quality evaluation method, device and equipment for structured medical record data | |
| CN112307318B (en) | Content publishing method, system and device | |
| CN114048318A (en) | Clustering method, system, device and storage medium based on density radius | |
| CN111125658A (en) | Method, device, server and storage medium for identifying fraudulent users | |
| TWI769665B (en) | Target data updating method, electronic equipment and computer readable storage medium | |
| WO2023115875A1 (en) | Hardware device maintenance method and apparatus, and electronic device | |
| CN110457626A (en) | A method and device for screening abnormal access requests | |
| CN109033148A (en) | One kind is towards polytypic unbalanced data preprocess method, device and equipment | |
| CN114595689A (en) | Data processing method, data processing device, storage medium and computer equipment | |
| CN111338692A (en) | Vulnerability classification method, device and electronic device based on vulnerability code | |
| CN114860879A (en) | Data association method, apparatus, device and computer storage medium | |
| CN106844748A (en) | Text Clustering Method, device and electronic equipment | |
| WO2022193232A1 (en) | Face clustering method and apparatus, classification storage method, medium, and electronic device | |
| CN108647728B (en) | Imbalanced data classification oversampling method, apparatus, equipment and medium | |
| CN113515593A (en) | Topic detection method and device based on clustering model and computer equipment | |
| CN116894073A (en) | Sensitive data identification method, device and storage medium | |
| CN117112634B (en) | Quick search method and system for querying similar enterprises | |
| CN113554474B (en) | Model verification method and device, electronic equipment and computer readable storage medium | |
| CN118885926A (en) | A method and system for identifying smart government information based on big data | |
| CN114911753B (en) | A method, device, electronic device and storage medium for generating a presentation document |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |