[go: up one dir, main page]

CN113488127B - Sensitivity processing method and system for population health data set - Google Patents

Sensitivity processing method and system for population health data set Download PDF

Info

Publication number
CN113488127B
CN113488127B CN202110856219.1A CN202110856219A CN113488127B CN 113488127 B CN113488127 B CN 113488127B CN 202110856219 A CN202110856219 A CN 202110856219A CN 113488127 B CN113488127 B CN 113488127B
Authority
CN
China
Prior art keywords
sensitive
data
sensitivity
information
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110856219.1A
Other languages
Chinese (zh)
Other versions
CN113488127A (en
Inventor
吴思竹
邬金鸣
钱庆
修晓蕾
钟明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN202110856219.1A priority Critical patent/CN113488127B/en
Publication of CN113488127A publication Critical patent/CN113488127A/en
Application granted granted Critical
Publication of CN113488127B publication Critical patent/CN113488127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a sensitivity processing method and a sensitivity processing system for population health data sets, wherein the sensitivity processing method comprises the following steps: acquiring a population health data set to be evaluated; carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features; analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature; calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set; and generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result. The invention realizes the discovery, identification, analysis and processing of the sensitive information, meets the application requirements of the sensitivity evaluation of the population health data set through multidimensional analysis, and improves the efficiency and the safety of the subsequent population health data application.

Description

一种人口健康数据集敏感度处理方法及系统A population health data set sensitivity processing method and system

技术领域Technical field

本发明涉及数据处理技术领域,特别是涉及一种人口健康数据集敏感度处理方法及系统。The present invention relates to the field of data processing technology, and in particular to a population health data set sensitivity processing method and system.

背景技术Background technique

人口健康数据共享在提高医疗质量和效果,辅助医疗管理决策、提升医学科研水平和科研成透明度、医疗成本控制等方面具有重要作用,但是人口健康数据当中涉及大量个人或群体的身份信息、健康信息、遗传信息等高度敏感信息,如果在数据共享过程中发生泄漏,会给国家、社会或个人带来不同程度的安全风险和财产损失。可见,数据敏感度处理是人口健康数据脱敏处理和数据收集共享的重要基础和前提。Population health data sharing plays an important role in improving the quality and effectiveness of medical care, assisting medical management decision-making, improving the level and transparency of medical research, and controlling medical costs. However, population health data involves a large number of identity information and health information of individuals or groups. , genetic information and other highly sensitive information, if leaked during the data sharing process, it will bring varying degrees of security risks and property losses to the country, society or individuals. It can be seen that data sensitivity processing is an important foundation and prerequisite for population health data desensitization processing and data collection and sharing.

现有研究专注于人口健康数据敏感度信息的识别和去除技术和方案,当前对数据敏感度评估方法研究不足,没有形成有效的评估方法体系以及更全面的数据敏感度评估流程,缺乏归纳形成系统、全面的人口健康数据敏感信息范畴。并且现有敏感度评估研究多集中在数据记录、属性类、属性、属性值层次进行敏感信息度量,对于数据集层次的整体敏感度评估研究较少。进一步现有研究方法多仅计算得出敏感级别或风险分数,计算过程和结果可解释性不强,不易理解,缺乏人机可理解的形式。可见,现有的人口健康数据敏感度处理无法满足实际的人口健康数据敏感度评估需求,降低了后续人口健康数据共享的安全性以及数据处理效率。Existing research focuses on technologies and solutions for identifying and removing sensitivity information in population health data. There is currently insufficient research on data sensitivity assessment methods. There is no effective assessment method system and a more comprehensive data sensitivity assessment process, and there is a lack of induction and formation system. , comprehensive range of sensitive information on population health data. Moreover, most of the existing sensitivity assessment research focuses on measuring sensitive information at the data record, attribute class, attribute, and attribute value levels, and there is less research on the overall sensitivity assessment at the data set level. Furthermore, most of the existing research methods only calculate the sensitivity level or risk score. The calculation process and results are not interpretable and easy to understand, and lack a form that is understandable by humans and machines. It can be seen that the existing population health data sensitivity processing cannot meet the actual population health data sensitivity assessment needs, and reduces the security of subsequent population health data sharing and data processing efficiency.

发明内容Contents of the invention

针对于上述问题,本发明提供一种人口健康数据集敏感度处理方法及系统,实现了满足实际的人口健康数据评估需求,以及提升数据应用的效率和安全性。In response to the above problems, the present invention provides a population health data set sensitivity processing method and system, which can meet the actual population health data assessment needs and improve the efficiency and security of data application.

为了实现上述目的,本发明提供了如下技术方案:In order to achieve the above objects, the present invention provides the following technical solutions:

一种人口健康数据集敏感度处理方法,包括:A method for sensitivity processing of population health data sets, including:

获取待评估的人口健康数据集;Obtain the population health data set to be assessed;

对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;Perform sensitive information identification on each feature of the population health data set to obtain the sensitive features corresponding to each feature, where the features include metadata features, data item features and data value features;

对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;Analyze each of the sensitive features and obtain the analysis results corresponding to each sensitive feature;

基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;Calculate based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set;

基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。Based on the comprehensive sensitivity assessment result, a sensitivity assessment report of the population health data set is generated.

可选地,所述对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,包括:Optionally, the step of identifying sensitive information on each feature of the population health data set and obtaining the sensitive feature corresponding to each feature includes:

获取所述人口健康数据的各个特征维度,所述特征维度包括:元数据特征、数据项特征和数据值特征;Obtain each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features and data value features;

基于目标判定规则对所述元数据特征进行敏感信息识别,获得所述元数据敏感特征,所述目标判定规则是基于标注的元数据确定的;Perform sensitive information identification on the metadata features based on target determination rules to obtain the metadata sensitive features, and the target determination rules are determined based on the annotated metadata;

基于敏感信息类型词典,确定所述数据项特征中是否包括敏感信息类型项目词,如果是,获得数据项敏感特征;Based on the sensitive information type dictionary, determine whether the data item characteristics include sensitive information type item words, and if so, obtain the data item sensitive characteristics;

对所述数据项敏感特征的数据值进行识别,若识别获得的数据值满足敏感信息值对应的识别条件,得到数据值敏感特征。The data value of the sensitive feature of the data item is identified. If the data value obtained by the identification meets the identification condition corresponding to the sensitive information value, the sensitive feature of the data value is obtained.

可选地,所述对每一敏感特征进行分析,获得每一敏感特征对应的分析结果,包括:Optionally, analyzing each sensitive feature to obtain analysis results corresponding to each sensitive feature includes:

对所述元数据敏感特征的数据数量、时间跨度、对象特征、主题类型以及主体数量进行分析,获得元数据敏感特征分析结果;Analyze the data quantity, time span, object characteristics, subject type and number of subjects of the metadata sensitive features, and obtain metadata sensitive feature analysis results;

对所述数据项敏感特征的敏感信息类型特征和敏感信息数量特征进行分析,获得数据项敏感特征分析结果;Analyze the sensitive information type characteristics and sensitive information quantity characteristics of the data item sensitive characteristics, and obtain the data item sensitive characteristics analysis results;

对所述数据值敏感特征的值数量特征、值分布特征和值精准度程度特征进行分析,获得数据值敏感特征分析结果。Analyze the value quantity characteristics, value distribution characteristics and value accuracy characteristics of the data value sensitive characteristics to obtain the data value sensitive characteristics analysis results.

可选地,所述基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果,包括:Optionally, the calculation is performed based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set, including:

基于所述元数据敏感特征分析结果,计算获得泄露损失程度值;Based on the metadata sensitive feature analysis results, calculate and obtain the leakage loss degree value;

基于所述数据项敏感特征分析结果和所述数据值敏感特征分析结果,计算得到标识程度值;Based on the data item sensitive feature analysis results and the data value sensitive feature analysis results, calculate the identification degree value;

基于所述泄露损失程度值和所述标识程度值,计算得到所述人口健康数据集的敏感度综合评估结果。Based on the leakage loss degree value and the identification degree value, a comprehensive sensitivity assessment result of the population health data set is calculated.

可选地,所述基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告,包括:Optionally, generating a sensitivity assessment report of the population health data set based on the comprehensive sensitivity assessment result includes:

确定所述人口健康数据集的基本信息;determining basic information of said population health data set;

基于所述敏感度综合评估结果,确定敏感度评估结果的待显示信息,所述显示信息包括数据集表述程度、数据集泄露损失程度、数据集敏感度参考值和敏感特征数据;Based on the comprehensive sensitivity assessment results, determine the information to be displayed of the sensitivity assessment results, where the displayed information includes the degree of data set expression, the degree of data set leakage loss, the data set sensitivity reference value and sensitive feature data;

基于所述敏感度综合评估结果,确定敏感特征标记信息;Based on the comprehensive sensitivity assessment results, determine sensitive feature marking information;

根据所述基本信息、所述待显示信息和所述敏感特征标记信息,生成所述人口健康数据集的敏感度评估报告。Generate a sensitivity assessment report of the population health data set based on the basic information, the information to be displayed, and the sensitive feature mark information.

一种人口健康数据集敏感度处理系统,包括:A population health data set sensitivity processing system, including:

获取单元,用于获取待评估的人口健康数据集;The acquisition unit is used to acquire the population health data set to be evaluated;

识别单元,用于对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;An identification unit, used to identify sensitive information for each feature of the population health data set, and obtain the sensitive features corresponding to each feature, where the features include metadata features, data item features and data value features;

分析单元,用于对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;An analysis unit is used to analyze each of the sensitive features and obtain analysis results corresponding to each sensitive feature;

计算单元,用于基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;A computing unit configured to perform calculations based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set;

生成单元,用于基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。A generating unit, configured to generate a sensitivity assessment report of the population health data set based on the comprehensive sensitivity assessment result.

可选地,所述识别单元包括:Optionally, the identification unit includes:

第一获取子单元,用于获取所述人口健康数据的各个特征维度,所述特征维度包括:元数据特征、数据项特征和数据值特征;The first acquisition subunit is used to acquire each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features and data value features;

第一识别子单元,用于基于目标判定规则对所述元数据特征进行敏感信息识别,获得所述元数据敏感特征,所述目标判定规则是基于标注的元数据确定的;The first identification subunit is used to identify sensitive information of the metadata features based on target determination rules to obtain the metadata sensitive features. The target determination rules are determined based on the annotated metadata;

第一确定子单元,用于基于敏感信息类型词典,确定所述数据项特征中是否包括敏感信息类型项目词,如果是,获得数据项敏感特征;The first determination subunit is used to determine whether the data item characteristics include sensitive information type item words based on the sensitive information type dictionary, and if so, obtain the data item sensitive characteristics;

第二识别子单元,用于对所述数据项敏感特征的数据值进行识别,若识别获得的数据值满足敏感信息值对应的识别条件,得到数据值敏感特征。The second identification subunit is used to identify the data value of the sensitive feature of the data item. If the data value obtained by identification meets the identification condition corresponding to the sensitive information value, the sensitive feature of the data value is obtained.

可选地,所述分析单元包括:Optionally, the analysis unit includes:

第一分析子单元,用于对所述元数据敏感特征的数据数量、时间跨度、对象特征、主题类型以及主体数量进行分析,获得元数据敏感特征分析结果;The first analysis subunit is used to analyze the data quantity, time span, object characteristics, subject type and number of subjects of the metadata sensitive features, and obtain metadata sensitive feature analysis results;

第二分析子单元,用于对所述数据项敏感特征的敏感信息类型特征和敏感信息数量特征进行分析,获得数据项敏感特征分析结果;The second analysis subunit is used to analyze the sensitive information type characteristics and sensitive information quantity characteristics of the data item sensitive characteristics, and obtain the data item sensitive characteristics analysis results;

第三分析子单元,用于对所述数据值敏感特征的值数量特征、值分布特征和值精准度程度特征进行分析,获得数据值敏感特征分析结果。The third analysis subunit is used to analyze the value quantity characteristics, value distribution characteristics and value accuracy characteristics of the data value sensitive characteristics, and obtain the data value sensitive characteristics analysis results.

可选地,所述计算单元包括:Optionally, the computing unit includes:

第一计算子单元,用于基于所述元数据敏感特征分析结果,计算获得泄露损失程度值;The first calculation subunit is used to calculate and obtain the leakage loss degree value based on the metadata sensitive feature analysis results;

第二计算子单元,用于基于所述数据项敏感特征分析结果和所述数据值敏感特征分析结果,计算得到标识程度值;The second calculation subunit is used to calculate the identification degree value based on the data item sensitive feature analysis result and the data value sensitive feature analysis result;

第三计算子单元,用于基于所述泄露损失程度值和所述标识程度值,计算得到所述人口健康数据集的敏感度综合评估结果。The third calculation subunit is used to calculate the comprehensive sensitivity assessment result of the population health data set based on the leakage loss degree value and the identification degree value.

可选地,所述生成单元包括:Optionally, the generating unit includes:

第二确定子单元,用于确定所述人口健康数据集的基本信息;The second determination subunit is used to determine the basic information of the population health data set;

第三确定子单元,用于基于所述敏感度综合评估结果,确定敏感度评估结果的待显示信息,所述显示信息包括数据集表述程度、数据集泄露损失程度、数据集敏感度参考值和敏感特征数据;The third determination subunit is used to determine the information to be displayed of the sensitivity evaluation result based on the comprehensive sensitivity evaluation result. The display information includes the degree of data set expression, the degree of data set leakage loss, the data set sensitivity reference value and Sensitive characteristic data;

第四确定子单元,用于基于所述敏感度综合评估结果,确定敏感特征标记信息;The fourth determination subunit is used to determine sensitive feature mark information based on the comprehensive sensitivity assessment result;

生成子单元,用于根据所述基本信息、所述待显示信息和所述敏感特征标记信息,生成所述人口健康数据集的敏感度评估报告。Generating a subunit, configured to generate a sensitivity assessment report of the population health data set based on the basic information, the information to be displayed, and the sensitive feature mark information.

相较于现有技术,本发明提供了一种人口健康数据集敏感度处理方法及系统,包括:获取待评估的人口健康数据集;对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。本发明实现了敏感信息发现、识别、分析和处理,并且通过多维度分析满足了人口健康数据集敏感度评估的应用需求,以及提升了后续人口健康数据应用的效率和安全性。Compared with the existing technology, the present invention provides a population health data set sensitivity processing method and system, which includes: obtaining a population health data set to be evaluated; identifying sensitive information on each feature of the population health data set, Obtain the sensitive features corresponding to each feature, which include metadata features, data item features and data value features; analyze each of the sensitive features to obtain the analysis results corresponding to each sensitive feature; based on each sensitive feature The corresponding analysis results are calculated to obtain a comprehensive sensitivity assessment result of the population health data set; based on the comprehensive sensitivity assessment result, a sensitivity assessment report of the population health data set is generated. The invention realizes the discovery, identification, analysis and processing of sensitive information, meets the application requirements of sensitivity assessment of population health data sets through multi-dimensional analysis, and improves the efficiency and security of subsequent population health data applications.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1为本发明实施例提供的一种人口健康数据集敏感度处理方法的流程示意图;Figure 1 is a schematic flow chart of a population health data set sensitivity processing method provided by an embodiment of the present invention;

图2为本发明实施例提供的一种人口健康数据敏感度评估方法的架构图;Figure 2 is an architecture diagram of a population health data sensitivity assessment method provided by an embodiment of the present invention;

图3为本发明实施例提供的一种元数据特征信息的示意图;Figure 3 is a schematic diagram of metadata feature information provided by an embodiment of the present invention;

图4为本发明实施例提供的一种数据敏感度评估维度的示意图;Figure 4 is a schematic diagram of a data sensitivity assessment dimension provided by an embodiment of the present invention;

图5为本发明实施例提供的一种敏感数据度计算流程的示意图;Figure 5 is a schematic diagram of a sensitive data degree calculation process provided by an embodiment of the present invention;

图6(a)-图6(c)为本发明实施例提供的一种数据敏感度评估报告的示意图;Figure 6(a)-Figure 6(c) are schematic diagrams of a data sensitivity assessment report provided by an embodiment of the present invention;

图7为本发明实施例提供的一种人口健康数据集敏感度处理系统的结构示意图。Figure 7 is a schematic structural diagram of a population health data set sensitivity processing system provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本发明的说明书和权利要求书及上述附图中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述特定的顺序。此外术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有设定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。The terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish different objects, rather than describing a specific sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not configured with the listed steps or units, but may include steps or units that are not listed.

在本发明实施例中提供了一种人口健康数据集敏感度处理方法,是一种面向人口健康数据共享需求的数据集敏感度评估方法,为人口健康数据共享的敏感信息发现、识别、分析和处理提供支持,通过计算机辅助有效实施人口健康数据分级管理与安全共享。In the embodiment of the present invention, a population health data set sensitivity processing method is provided, which is a data set sensitivity assessment method oriented to population health data sharing needs, and is used for the discovery, identification, analysis and analysis of sensitive information shared by population health data. Provide support for processing and effectively implement hierarchical management and secure sharing of population health data through computer assistance.

参见图1,为本发明实施例提供的一种人口健康数据集敏感度处理方法的流程示意图,该方法包括:Refer to Figure 1, which is a schematic flow chart of a method for processing sensitivity of a population health data set according to an embodiment of the present invention. The method includes:

S101、获取待评估的人口健康数据集。S101. Obtain the population health data set to be evaluated.

待评估的人口健康数据集是指原始的人口健康数据集,并未进行任何处理的人口健康数据集。由于人口健康数据当中涉及大量个人或群体的身份信息、健康信息、遗传信息等高度敏感信息,如果在数据共享过程中发生泄露,会给国家、社会或个人带来不同程度的安全风险和财产损失。因此,在本发明实施例中需要进行敏感度处理,以保证人口健康数据集在进行数据共享等后续应用的安全性。The population health data set to be evaluated refers to the original population health data set, which has not undergone any processing. Since population health data involves a large number of highly sensitive information such as identity information, health information, and genetic information of individuals or groups, if it is leaked during the data sharing process, it will bring varying degrees of security risks and property losses to the country, society, or individuals. . Therefore, sensitivity processing needs to be performed in the embodiment of the present invention to ensure the security of the population health data set in subsequent applications such as data sharing.

S102、对人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征。S102. Identify sensitive information on each feature of the population health data set, and obtain the sensitive features corresponding to each feature.

S103、对每一敏感特征进行分析,获得每一敏感特征对应的分析结果。S103. Analyze each sensitive feature and obtain the analysis results corresponding to each sensitive feature.

在本发明实施例中是从人口健康数据集的多个维度进行敏感信息的识别和分析的,对应的多个维度体现在人口健康数据集的各个特征,所述特征包括元数据特征、数据项特征和数据值特征。In the embodiment of the present invention, sensitive information is identified and analyzed from multiple dimensions of the population health data set. The corresponding multiple dimensions are reflected in various features of the population health data set. The features include metadata features, data items Features and data value features.

人口健康数据集敏感信息识别,主要基于数据集组织结构,数据集包括描述数据的元数据、数据项和数据值不同层次的敏感信息特征识别。在本发明实施例中基于已经界定的人口健康数据中的敏感信息范畴,检测数据集中是否包含敏感信息、包含哪些类型,具有哪些敏感特征,以及敏感信息的位置。对于不同数据集元数据、数据项和数据值采用不同的敏感信息识别处理过程。The identification of sensitive information in population health data sets is mainly based on the organizational structure of the data set. The data set includes metadata describing the data, data items and data values, and identification of sensitive information features at different levels. In the embodiment of the present invention, based on the defined categories of sensitive information in population health data, it is detected whether the data set contains sensitive information, what types it contains, what sensitive characteristics it has, and the location of the sensitive information. Different sensitive information identification processes are used for different data set metadata, data items and data values.

在本发明的一种实施方式中,确定了12项待识别的元数据特征;数据项特征的识别是指检测与人口健康数据集对应的表格数据集中哪些数据项属于本发明确定的敏感信息类型,即主要检测关系型数据表,即一个单元格里不会出现大段文本的表格类型;数据值特征的识别,一方面,判断关系数据表里已经识别出“敏感信息类型”的数据列中,数据值的非空值数、分布特征以及某些特殊数据类型的精确程度等,另一方面,对于非关系型数据表,表格里可能含有非结构化的文本信息,要在这些文本信息里识别具体的敏感信息指,以及进行后续的统计分析。需要说明的是,具体涉及的元数据特征种类、类型以及数据项、数据值的识别需要根据实际需求进行确定,本发明对此不进行限定。In one embodiment of the present invention, 12 metadata features to be identified are determined; the identification of data item features refers to detecting which data items in the table data set corresponding to the population health data set belong to the sensitive information types determined by the present invention. , that is, it mainly detects relational data tables, that is, table types where large sections of text do not appear in a cell; identification of data value characteristics, on the one hand, determines the data columns in which "sensitive information types" have been identified in the relational data table , the number of non-null values of data values, distribution characteristics, and the accuracy of certain special data types, etc. On the other hand, for non-relational data tables, the table may contain unstructured text information, and it is necessary to include in this text information Identify specific sensitive information and conduct subsequent statistical analysis. It should be noted that the specific types and types of metadata features as well as the identification of data items and data values need to be determined based on actual needs, and the present invention does not limit this.

本发明的敏感特征是指将影响数据敏感度的因素科学化、具体化、细化为的一系列相互联系的指标的集合,为数据敏感度评估奠定基础,以及用其描述和揭示数据集中的敏感信息情况,从而方便数据提交者、数据管理者和数据使用者更为客观和公正的了解人口健康数据集的敏感度水平。本发明的敏感特征分析从元数据、数据项、数据值3个维度出发,包含敏感信息类型、敏感信息值、数据主体(主体类型、主体量)、整体描述信息(数据量、数据时间、数据主体类型等)4大主要因素,具体的处理方式将在本发明后续的实施例中进行说明。The sensitive characteristics of the present invention refer to a collection of a series of interrelated indicators that scientifically, concretely and refine the factors that affect data sensitivity, laying the foundation for data sensitivity assessment, and using them to describe and reveal the characteristics in the data set. Sensitive information status, so as to facilitate data submitters, data managers and data users to understand the sensitivity level of population health data sets more objectively and impartially. The sensitive feature analysis of the present invention starts from three dimensions: metadata, data items, and data values, including sensitive information type, sensitive information value, data subject (subject type, subject volume), and overall description information (data amount, data time, data Main body type, etc.) 4 major factors, the specific processing methods will be explained in subsequent embodiments of the present invention.

S104、基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果。S104. Calculate based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set.

数据集敏感度由两大主要因素决定,即标识程度和泄露损失程度,其中泄露损失程度由数据主题类型和数据量决定。标识程度和泄露损失程度均与数据敏感度成正相关。数据敏感度计算承接敏感信息识别和特征分析结果,主要分为标识程度评估(确定标识程度值)、泄露损失程度评估(确定泄露损失程度值)、敏感度综合计算3个环节,最终得到表征数据集敏感信息含量水平的敏感度参考值,即人口健康数据集的敏感度综合评估结果。The sensitivity of a data set is determined by two main factors, namely the degree of identification and the degree of leakage loss, where the degree of leakage loss is determined by the type of data subject and the amount of data. Both the degree of identification and the degree of leakage loss are positively related to data sensitivity. Data sensitivity calculation is based on the results of sensitive information identification and feature analysis. It is mainly divided into three links: identification degree assessment (determining the identification degree value), leakage loss assessment (determining the leakage loss degree value), and sensitivity comprehensive calculation, and finally obtains the characterization data. The sensitivity reference value of the set's sensitive information content level is the result of the comprehensive sensitivity assessment of the population health data set.

S105、基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。S105. Based on the comprehensive sensitivity assessment result, generate a sensitivity assessment report of the population health data set.

敏感度评估报告通常需要包括唯一检测编号,以便于能够确定敏感度评估报告的真伪,此外,还需要包括标注敏感度检测时间,具体的内容可以包括人口健康数据集的基本信息、数据敏感度评估结果、数据集敏感特征分析、敏感信息位置标记等,具体的包括内容可以基于实际需求进行确定,本发明对此不进行限制。The sensitivity assessment report usually needs to include a unique detection number so that the authenticity of the sensitivity assessment report can be determined. In addition, it also needs to include the sensitivity detection time. The specific content can include the basic information of the population health data set and the data sensitivity. The specific contents of evaluation results, data set sensitive feature analysis, sensitive information location marking, etc. can be determined based on actual needs, and the present invention does not limit this.

本发明实施例提供了一种人口健康数据集敏感度处理方法,包括:获取待评估的人口健康数据集;对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。本发明实现了敏感信息发现、识别、分析和处理,并且通过多维度分析满足了人口健康数据集敏感度评估的应用需求,以及提升了后续人口健康数据应用的效率和安全性。Embodiments of the present invention provide a method for processing the sensitivity of a population health data set, which includes: obtaining a population health data set to be evaluated; identifying sensitive information for each feature of the population health data set, and obtaining the sensitive information corresponding to each feature. Features, the features include metadata features, data item features and data value features; analyze each of the sensitive features to obtain the analysis results corresponding to each sensitive feature; perform calculations based on the analysis results corresponding to each sensitive feature, A comprehensive sensitivity assessment result of the population health data set is obtained; and a sensitivity assessment report of the population health data set is generated based on the comprehensive sensitivity assessment result. The invention realizes the discovery, identification, analysis and processing of sensitive information, meets the application requirements of sensitivity assessment of population health data sets through multi-dimensional analysis, and improves the efficiency and security of subsequent population health data applications.

参见图2,为本发明实施例提供的一种人口健康数据敏感度评估方法的架构图。在该架构中主要包括数据敏感信息识别、多维度敏感特征分析、敏感度评估计算和评估报告生成这几部分,即输入该架构的是待评估的人口健康数据集,输出为该数据集的敏感度评估报告。Refer to Figure 2, which is an architecture diagram of a population health data sensitivity assessment method provided by an embodiment of the present invention. This architecture mainly includes data sensitive information identification, multi-dimensional sensitive feature analysis, sensitivity assessment calculation and assessment report generation. That is, the input to the architecture is the population health data set to be assessed, and the output is the sensitive data set of the data set. degree evaluation report.

人口健康数据集的敏感信息识别,主要基于数据集组织结构。数据集包括描述数据的元数据、数据项和数据值不同层次的敏感信息特征识别。本发明基于已经界定的人口健康数据中的敏感信息范畴,检测数据集中是否包含敏感信息、包含哪些类型,具有哪些敏感特征,以及敏感信息的位置。对于不同数据集元数据、数据项和数据值采用不同的敏感信息识别方法。The identification of sensitive information in population health data sets is mainly based on the organizational structure of the data set. The data set includes metadata that describes the data, data items and data values that identify sensitive information features at different levels. Based on the defined categories of sensitive information in population health data, the present invention detects whether the data set contains sensitive information, what types it contains, what sensitive characteristics it has, and the location of the sensitive information. Use different sensitive information identification methods for different data set metadata, data items and data values.

对应的,在本发明的一种实施方式中,所述对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,包括:Correspondingly, in one embodiment of the present invention, the step of identifying sensitive information on each feature of the population health data set to obtain the sensitive features corresponding to each feature includes:

获取所述人口健康数据的各个特征维度,所述特征维度包括:元数据特征、数据项特征和数据值特征;Obtain each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features and data value features;

基于目标判定规则对所述元数据特征进行敏感信息识别,获得所述元数据敏感特征,所述目标判定规则是基于标注的元数据确定的;Perform sensitive information identification on the metadata features based on target determination rules to obtain the metadata sensitive features, and the target determination rules are determined based on the annotated metadata;

基于敏感信息类型词典,确定所述数据项特征中是否包括敏感信息类型项目词,如果是,获得数据项敏感特征;Based on the sensitive information type dictionary, determine whether the data item characteristics include sensitive information type item words, and if so, obtain the data item sensitive characteristics;

对所述数据项敏感特征的数据值进行识别,若识别获得的数据值满足敏感信息值对应的识别条件,得到数据值敏感特征。The data value of the sensitive feature of the data item is identified. If the data value obtained by the identification meets the identification condition corresponding to the sensitive information value, the sensitive feature of the data value is obtained.

采用不同的方式对元数据特征、数据项特征和数据值特征进行识别,具体的,通过目标判定规则即需要设定的规则对元数据特征进行识别;可以通过敏感信息类型词典对数据项特征进行识别,在获得了已经识别了数据项敏感特征后对其对应的数据值进行识别,判断对应的数据值是否属于泄露的敏感信息。Different methods are used to identify metadata characteristics, data item characteristics and data value characteristics. Specifically, metadata characteristics are identified through target determination rules, that is, rules that need to be set; data item characteristics can be identified through sensitive information type dictionaries. Identification: After obtaining the identified sensitive characteristics of the data item, identify its corresponding data value and determine whether the corresponding data value belongs to leaked sensitive information.

具体的,在本发明实施例中确定了12项待识别的元数据特征信息,参见图3,是示出本发明实施例提供的元数据特征信息的示意图,元数据特征识别分为两部分,可基于数据集的元数据标注。如记录数量、时间跨度、时间新颖度、是否是人遗资源4项元数据特征,分别从人口健康数据集“基本信息”中的“数据记录数”、“时间范围”、“是否涉及人遗资源”中获取。Specifically, in the embodiment of the present invention, 12 items of metadata feature information to be identified are determined. Refer to Figure 3, which is a schematic diagram showing the metadata feature information provided by the embodiment of the present invention. Metadata feature identification is divided into two parts. Dataset-based metadata annotation. The four metadata characteristics, such as the number of records, time span, time novelty, and whether it is a human heritage resource, are respectively obtained from the "number of data records", "time range", and "whether it involves human heritage resources" in the "Basic Information" of the population health data set. Resources".

元数据特征识别的另一部分,包括“数据主体年龄特征”、“聚合类型和个体类型数据”、“是否包含生物识别数据”、“是否涉及临床记录”、“是否涉及敏感临床记录”、“是否是财务相关数据”、“主体数量”、“敏感信息类型数量”等8项特征需要设定规则进行判定。规则判断依据主要来源于“数据集名称”、“关键词”、“数据描述”等相关信息,其中数据主体数、敏感信息类型数需要等需要识别完数据项、数据值的相关内容再进行综合计算。Another part of the metadata feature identification includes "data subject age characteristics", "aggregated type and individual type data", "whether it contains biometric data", "whether it involves clinical records", "whether it involves sensitive clinical records", "whether Eight characteristics, including "financial-related data", "number of entities", and "number of sensitive information types", need to be set for determination. The basis for rule judgment mainly comes from "data set name", "keywords", "data description" and other related information. Among them, the number of data subjects, the number of sensitive information types, etc. need to be identified before the relevant content of data items and data values is synthesized. calculate.

数据项特征的识别是指检测表格数据集中哪些数据项属于本研究框定的敏感信息类型,主要是检测人口健康数据集对应的表格数据集中哪些数据项属于本研究框定的敏感信息类型,主要是检测关系型数据表,即一个单元格里不会出现大段文本的表格类型。本研究面向人口健康数据集的特点,暂时选定44种敏感信息类型进行识别。The identification of data item characteristics refers to detecting which data items in the tabular data set belong to the type of sensitive information framed in this study, mainly to detect which data items in the tabular data set corresponding to the population health data set belong to the type of sensitive information framed in this study. Relational data table, that is, a table type that does not contain large sections of text in one cell. This study is oriented to the characteristics of population health data sets and temporarily selects 44 types of sensitive information for identification.

对于关系型表格数据,在敏感项识别时考虑数据项特征和数据值特征两方面因素,基于自上而下和自下而上两种方式。具体而言,自上而下是指先检测数据项(字段/变量)名称、含义/注释以及数据字典等信息,基于“敏感信息类型词典”,判断其是否出现了敏感信息类型款目词,出现款目词则判断该列为对应的敏感信息类型,如敏感信息类型“姓名”,其款目词包括“本人姓名、患者姓名、直系亲属姓名、家庭成员姓名、单位联系人姓名、病史陈述者、name、patient、participant、XM、XRXM”等;再读取前50个实例,基于“敏感信息值字典”或“正则表达式规则库”验证该列下的实例是否符合判断出的敏感信息类型的值的特征,如相符则继续进行,若不相符则标识为“待定状态”。自下而上的方式是指,若字段名称及其注释未出现相应款目词,则读取前50个实例,基于“敏感信息值字典”或“正则表达式规则库”判断该类实例是否是敏感信息类型下的值,以及是什么敏感信息值,依此向上归纳该列属于何种敏感信息类型,且将该列的字段名收纳进“敏感信息类型词典”中,以此完成敏感信息类型字典的扩充。例如,敏感信息类型词典中包括敏感词A、B和C,则可以识别出A、B、C是敏感词,基于正则表达式规则库发现D也是敏感词,则可以将D也加入到敏感信息类型字典里,进行词典的更显扩充,之后通过扩充后的敏感信息类型字典就可以识别出D也是敏感词。For relational tabular data, two factors, data item characteristics and data value characteristics, are considered when identifying sensitive items, based on top-down and bottom-up methods. Specifically, top-down means to first detect information such as data item (field/variable) name, meaning/annotation, and data dictionary, and based on the "sensitive information type dictionary", determine whether a sensitive information type item word appears. The entry word determines the type of sensitive information corresponding to the column. For example, the sensitive information type "name", the entry word includes "my name, patient's name, name of immediate family member, name of family member, name of unit contact person, person who stated medical history , name, patient, participant, XM, The characteristics of the value, if they match, continue, if not, mark it as "pending status". The bottom-up approach means that if the corresponding item word does not appear in the field name and its annotation, the first 50 instances will be read and based on the "sensitive information value dictionary" or "regular expression rule base" to determine whether the instance is It is the value under the sensitive information type, and what sensitive information value it is. Based on this, it is summarized upward to what type of sensitive information the column belongs to, and the field name of the column is included in the "sensitive information type dictionary" to complete the sensitive information. Extension of type dictionary. For example, if the sensitive information type dictionary includes sensitive words A, B, and C, you can identify that A, B, and C are sensitive words. Based on the regular expression rule base, it is found that D is also a sensitive word, and D can also be added to the sensitive information. In the type dictionary, the dictionary is further expanded, and then through the expanded sensitive information type dictionary, it can be recognized that D is also a sensitive word.

数据值特征的识别,一方面,判断关系数据表里已经识别出是“敏感信息类型”的数据列中,数据值的非空值数、分布特征以及某些特殊数据类型的精确程度等。如要判断时间信息是精确到年、月、日等何种程度;位置信息是精确到省、市、县(区)、街道等何种程度;年龄信息是精确到具体如14岁还是仅仅是10-20岁的年龄区间;疾病信息是判断是否是敏感疾病,以及属于精神障碍类、性与生育类、遗传类、社会敏感性传染病类哪种敏感疾病类别。另一方面,对于非关系型数据表,表格里可能含有非结构化的文本信息,要在这些文本信息里识别具体的敏感信息值,以及进行后续的统计分析。Identification of data value characteristics, on the one hand, determines the number of non-null values, distribution characteristics, and accuracy of certain special data types in data columns that have been identified as "sensitive information types" in the relational data table. To determine how accurate the time information is to the year, month, day, etc.; whether the location information is accurate to the province, city, county (district), street, etc.; whether the age information is accurate to a specific age such as 14 years old or just The age range is 10-20 years old; the disease information is to determine whether it is a sensitive disease and which sensitive disease category it belongs to: mental disorders, sex and fertility, genetics, and socially sensitive infectious diseases. On the other hand, for non-relational data tables, the table may contain unstructured text information. Specific sensitive information values must be identified in this text information and subsequent statistical analysis must be performed.

敏感信息按照数据类型可分为3大类,包括身份证号、电话号码、邮编等数字类型;出生日期、医疗行为日期等日期类型;姓名、地址、医疗机构等命名实体类型。本文采用正则表达式规则识别数字类型和日期类型敏感信息,规则示例见表1。此外,虽然身份证号等信息根据上述正则表达式可容易地识别出来,但极易与心电图的影像编码混淆。为保证数字类型敏感信息识别的精确率,本文利用正则表达式筛选出候选集后,再通过上下文环境语义判断进一步筛选出数字类型保护健康信息并加以去除,若该字段项说明中出现影像、药品的字样,则将其从候选结果集删除。此外,人口健康数据中还可能出现相对时间,如上月、去年冬天等,由于攻击者无法通过相对时间来获取具体的日期,本研究中该种信息暂不做处理。Sensitive information can be divided into three categories according to data type, including numeric types such as ID number, phone number, and postal code; date types such as date of birth, medical action date, and named entity types such as name, address, and medical institution. This article uses regular expression rules to identify sensitive information of numeric types and date types. See Table 1 for rule examples. In addition, although information such as ID number can be easily identified based on the above regular expression, it is easily confused with the image encoding of the electrocardiogram. In order to ensure the accuracy of identifying digital type sensitive information, this article uses regular expressions to filter out the candidate set, and then further filters out digital type protected health information through contextual semantic judgment and removes it. If images, medicines appear in the field description, , it will be deleted from the candidate result set. In addition, relative time may also appear in population health data, such as last month, last winter, etc. Since attackers cannot obtain specific dates through relative time, this type of information will not be processed in this study.

表1敏感信息识别正则表达式示例Table 1 Example of regular expression for sensitive information identification

对于地址、医疗机构名称等命名实体类型的数据,本文采用基于词典的方式进行识别,并基于AC自动机(Aho-Corasick automaton)的方式进行优化。对于疾病敏感信息的识别,本文通过复用中文版国际疾病分类(International Classification of Diseases,ICD)、中文医学主题词表(Chinese Medical Subject Headings,CMesh)构建了敏感信息词典,包含精神障碍类、性与生育类、遗传病类、社会敏感性传染病类4类敏感疾病,目前包含405个主题词及其相应的款目词,后期可进行应用的扩展;对于数据主体姓名的识别,应用了《中文常见姓氏表》;对于民族敏感信息类型的识别,构建了中华民族列表及缩略语词典;对于婚姻状态、职业、宗教信仰、学历分别构建了表征相应状态的词典;对于地址的识别,应用《全国5级行政区划分表》,其中包含747748条记录;对于医疗机构名称的识别,应用了中国医疗卫生机构字典,包括全国49582个各种级别的医疗卫生机构。For data of named entity types such as addresses and medical institution names, this article uses a dictionary-based method to identify and optimize based on an AC automaton (Aho-Corasick automaton). For the identification of disease-sensitive information, this article builds a sensitive information dictionary by reusing the Chinese version of the International Classification of Diseases (ICD) and Chinese Medical Subject Headings (CMesh), including mental disorders, sexual There are four categories of sensitive diseases related to fertility, genetic diseases, and socially sensitive infectious diseases. Currently, it contains 405 subject headings and their corresponding headings, and the application can be expanded in the future; for the identification of the name of the data subject, the " "Table of Common Chinese Surnames"; for the identification of ethnic sensitive information types, a Chinese ethnic list and abbreviation dictionary were constructed; for marital status, occupation, religious belief, and education, dictionaries were constructed to represent the corresponding states; for the identification of addresses, " "National Five-Level Administrative Region Division Table", which contains 747,748 records; for the identification of medical institution names, the Chinese Medical and Health Institutions Dictionary was used, including 49,582 medical and health institutions of various levels across the country.

在本发明的一种实施方式中,所述对每一敏感特征进行分析,获得每一敏感特征对应的分析结果,包括:In one embodiment of the present invention, the analysis of each sensitive feature to obtain the analysis results corresponding to each sensitive feature includes:

对所述元数据敏感特征的数据数量、时间跨度、对象特征、主题类型以及主体数量进行分析,获得元数据敏感特征分析结果;Analyze the data quantity, time span, object characteristics, subject type and number of subjects of the metadata sensitive features, and obtain metadata sensitive feature analysis results;

对所述数据项敏感特征的敏感信息类型特征和敏感信息数量特征进行分析,获得数据项敏感特征分析结果;Analyze the sensitive information type characteristics and sensitive information quantity characteristics of the data item sensitive characteristics, and obtain the data item sensitive characteristics analysis results;

对所述数据值敏感特征的值数量特征、值分布特征和值精准度程度特征进行分析,获得数据值敏感特征分析结果。Analyze the value quantity characteristics, value distribution characteristics and value accuracy characteristics of the data value sensitive characteristics to obtain the data value sensitive characteristics analysis results.

在本发明中,敏感特征是指将影响数据敏感度的因素科学化、具体化、细化为的一系列相互联系的指标的集合,为数据敏感度评估奠定基础,以及用其描述和揭示数据集中的敏感信息情况,从而方便数据提交者、数据管理者和数据使用者更为客观和公正的了解人口健康数据集的敏感度水平。参见图4,为本发明实施例提供的数据敏感度评估维度的示意图,本发明中的敏感特征分析从元数据、数据项、数据值3个维度出发,包含敏感信息类型、敏感信息值、数据主体(主体类型、主体量)、整体描述信息(数据量、数据时间、数据主体类型等)4大主要因素。In the present invention, sensitive features refer to a collection of a series of interrelated indicators that scientifically, concretely and refine the factors that affect data sensitivity, laying the foundation for data sensitivity assessment, and using them to describe and reveal data. Concentrate sensitive information to facilitate data submitters, data managers and data users to understand the sensitivity level of population health data sets more objectively and impartially. Refer to Figure 4, which is a schematic diagram of data sensitivity assessment dimensions provided by an embodiment of the present invention. The sensitive feature analysis in the present invention starts from three dimensions: metadata, data items, and data values, and includes sensitive information types, sensitive information values, and data. There are four main factors: subject (subject type, subject volume) and overall description information (data volume, data time, data subject type, etc.).

元数据维度分析了12项元数据特征,具体包括数据集记录数量、时间跨度、时间新颖度、数据主体年龄特征、聚合类型和个体类型数据、是否涉及人类遗传资源、是否涉及生物识别数据、是否涉及临床记录、是否涉及敏感临床记录、是否涉及财务相关数据、涉及数据主体数量、敏感信息类型数量。数据敏感度是对数据集中敏感信息含量的一个度量,数量是对“含量”的一个直观反映,数量因素包括数据记录数量和所涉及到的数据主体的数量。时间因素是影响数据敏感度的因素之一,其中包括时间跨度和时间新颖度。时间跨度类似于记录数量,前者是在数量维度上,后者是在时间量维度上反应敏感信息含量。数据是具有时效性的,不同时效期的数据敏感度是不同的,一般而言时间越久远数据敏感度越低,时间越近数据敏感度越高,如某些数据集要等两年或几年之后敏感度降低后才允许公开数据。以数据集中所涉及的最近年份作为时间新颖度,时间新颖度越高与数据敏感度成正相关。此外,敏感与否是针对人而言,数据集所涉及的数据主体是影响数据敏感度的一个重要因素。数据主体(Data Subject)是指其个人信息被作为个人数据在网络中以或明或暗的方式加以披露的自然人。数据所面向的对象包括数据主体这种个体型对象(个体类型数据),也包括群体型对象(聚合数据),面向个体型的数据相对于群体型更为敏感;此外,不同类型数据主体的敏感度不同,如我国《个人信息保护法》要求加强未成年人信息保护,其他比较特殊的数据主体类型还包括老年人、孕妇等,如MIMIC将老年人的年龄进行变换处理以降低数据敏感度。The metadata dimension analyzes 12 metadata characteristics, including the number of data set records, time span, time novelty, age characteristics of data subjects, aggregate type and individual type data, whether human genetic resources are involved, whether biometric data is involved, whether Involves clinical records, whether sensitive clinical records are involved, whether financial-related data is involved, the number of data subjects involved, and the number of sensitive information types. Data sensitivity is a measure of the content of sensitive information in a data set, and quantity is an intuitive reflection of the "content." Quantitative factors include the number of data records and the number of data subjects involved. The time factor is one of the factors that affects data sensitivity, including time span and time novelty. The time span is similar to the number of records. The former is in the quantity dimension, while the latter reflects the content of sensitive information in the time dimension. Data is time-sensitive, and the data sensitivity in different aging periods is different. Generally speaking, the longer the time, the lower the data sensitivity, and the closer the time, the higher the data sensitivity. For example, some data sets have to wait two or more years. Disclosure of data will be allowed after years when sensitivity is reduced. The most recent year involved in the data set is used as the temporal novelty. The higher the temporal novelty, the higher the temporal novelty is positively correlated with the data sensitivity. In addition, whether sensitivity is related to people, the data subjects involved in the data set are an important factor affecting the sensitivity of the data. Data Subject refers to a natural person whose personal information is disclosed as personal data on the Internet, either explicitly or implicitly. The objects of data include individual objects such as data subjects (individual type data), and group objects (aggregated data). Individual-oriented data is more sensitive than group-type data; in addition, the sensitivity of different types of data subjects The degree is different. For example, my country's "Personal Information Protection Law" requires strengthening the protection of minors' information. Other special data subject types include the elderly, pregnant women, etc. For example, MIMIC transforms the age of the elderly to reduce data sensitivity.

数据项维度分析了9项数据项特征,具体包括直接标识信息类型数、间接标识信息类型数、网络身份标识信息类型数、通讯信息类型数、位置信息类型数、时间信息类型数、敏感疾病类型数、医保付费信息类型数、健康医疗记录标识信息类型数等。数据项维度特征主要是“某敏感信息类型数”,如“直接标识信息类型数”是指该数据集中出现了“姓名、身份证号、出生证明编号、社保号、护照号等”中的几种具体的直接标识信息,敏感信息类型列表。The data item dimension analyzes 9 data item characteristics, including the number of direct identification information types, the number of indirect identification information types, the number of network identity identification information types, the number of communication information types, the number of location information types, the number of time information types, and the types of sensitive diseases. Number, number of medical insurance payment information types, number of health medical record identification information types, etc. The dimensional characteristics of data items are mainly "the number of certain sensitive information types". For example, "the number of directly identified information types" refers to how many of the "name, ID number, birth certificate number, social security number, passport number, etc." appear in the data set. A specific direct identification information, a list of sensitive information types.

数据项维度反应数据集的内容框架,即从内容层面上反映该数据集包含哪些敏感信息类型。人口健康数据涉及到的敏感信息按照内容可分为身份信息(直接身份信息、间接身份信息、网络身份信息)、生物特征信息、通讯信息、位置信息、时间信息、健康医疗信息等。一个数据集是否包含某类敏感信息,以及包含该类的具体哪几种敏感信息类型是衡量数据集敏感度的一个重要因素。其中身份信息具有较强的可识别性,能够将该数据集中的各种信息与其数据主体相关联;生物特征信息,包括基因遗传数据、面部识别特征、指纹、掌纹、虹膜信息等,如随着新一代测序技术在医学研究的广泛应用、临床基因组学的迅速发展,通过基因组测序结果逆推人的外貌特征已成为可能,个人和其家庭的遗传信息已经成为研究、医疗和数据共享中需要重视的敏感信息。The data item dimension reflects the content framework of the data set, that is, it reflects what types of sensitive information the data set contains from the content level. The sensitive information involved in population health data can be divided into identity information (direct identity information, indirect identity information, network identity information), biometric information, communication information, location information, time information, health and medical information, etc. according to the content. Whether a data set contains a certain type of sensitive information and which specific types of sensitive information it contains is an important factor in measuring the sensitivity of a data set. Among them, identity information is highly identifiable and can associate various information in the data set with its data subject; biometric information includes genetic data, facial recognition features, fingerprints, palm prints, iris information, etc., such as With the widespread application of next-generation sequencing technology in medical research and the rapid development of clinical genomics, it has become possible to infer a person's appearance characteristics through genome sequencing results. The genetic information of individuals and their families has become a need in research, medical care, and data sharing. Sensitive information that is valued.

数据值维度分析了14项数据项特征,具体包括直接标识信息值数、间接标识信息值数、网络身份标识信息值数、通讯信息值数、通讯信息遮蔽处理程度、位置信息值数、位置信息精确程度、出生时间值数、出生时间精确程度、其他行为时间值数、其他行为时间精确程度、敏感疾病值数、医保付费信息类型值数、医保付费信息类型值数等。其中“某信息值数”是指“某信息”类型出现的具体值的数量,可由敏感信息识别结果计算所得,其中“某信息精确程度”,如出生日期精确程度是指精确到年/月/日等。The data value dimension analyzes 14 data item characteristics, including direct identification information value, indirect identification information value, network identity identification information value, communication information value, communication information masking processing degree, location information value, location information Accuracy, birth time value, birth time accuracy, other behavior time values, other behavior time accuracy, sensitive disease values, medical insurance payment information type values, medical insurance payment information type values, etc. The "number of values of a certain information" refers to the number of specific values that appear in the type of "certain information", which can be calculated from the sensitive information identification results. The "accuracy of a certain information", such as the accuracy of birth date, refers to the accuracy to the year/month/ Days and so on.

在数据值维度进行敏感特征的分析是因为,有时尽管两个数据集在元数据层面上具有相同的数量、时间、主体、类型特征,在数据项层面上包含相同的敏感信息类型,但两个数据集在实例层面上的填充度不同,如相同信息类型项下的缺失值程度不同,日期、地址等特殊信息类型值的精确程度不同,各敏感信息类型涉及的数据主体数不同等,仍有可能造成数据集敏感程度可能不同。The reason for analyzing sensitive features in the data value dimension is that sometimes even though two data sets have the same quantity, time, subject, and type characteristics at the metadata level and contain the same sensitive information type at the data item level, the two data sets The filling degree of the data set at the instance level is different, such as the degree of missing values under the same information type items, the accuracy of special information type values such as dates and addresses, and the number of data subjects involved in each sensitive information type. There are still This may result in data sets that may have different levels of sensitivity.

在本发明的另一实施方式中,所述基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果,包括:In another embodiment of the present invention, the calculation is performed based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set, including:

基于所述元数据敏感特征分析结果,计算获得泄露损失程度值;Based on the metadata sensitive feature analysis results, calculate and obtain the leakage loss degree value;

基于所述数据项敏感特征分析结果和所述数据值敏感特征分析结果,计算得到标识程度值;Based on the data item sensitive feature analysis results and the data value sensitive feature analysis results, calculate the identification degree value;

基于所述泄露损失程度值和所述标识程度值,计算得到所述人口健康数据集的敏感度综合评估结果。Based on the leakage loss degree value and the identification degree value, a comprehensive sensitivity assessment result of the population health data set is calculated.

数据集敏感度由两大主要因素决定,即标识程度和泄露损失程度,其中泄露损失程度由数据主题类型和数据量决定。标识程度和泄露损失程度均与数据敏感度成正相关。数据敏感度计算承接敏感信息识别和特征分析结果,主要分为标识程度评估(确定标识程度值)、泄露损失程度评估(确定泄露损失程度值)、敏感度综合计算3个环节,最终得到表征数据集敏感信息含量水平的敏感度参考值。The sensitivity of a data set is determined by two main factors, namely the degree of identification and the degree of leakage loss, where the degree of leakage loss is determined by the type of data subject and the amount of data. Both the degree of identification and the degree of leakage loss are positively related to data sensitivity. Data sensitivity calculation is based on the results of sensitive information identification and feature analysis. It is mainly divided into three links: identification degree assessment (determining the identification degree value), leakage loss assessment (determining the leakage loss degree value), and sensitivity comprehensive calculation, and finally obtains the characterization data. Set the sensitivity reference value for the content level of sensitive information.

参见图5,为本发明实施例提供的一种敏感数据度计算流程的示意图,该敏感数据度计算流程包括:基于接收到的敏感特征分析结果进行标识程度计算和泄露损失程度计算。其中,标识程度计算包括:通过标识程度计算判断是否是聚合数据,如果是,则标识为1级;如果不是,判断是否包含直接标识,如果是,标识4级,如果不是则定量计算重标识风险,基于计算结果确定风险是否大于阈值,如果是,标识为三级,如果不是,标识为2级,通过上述标识级别获得标识程度得分。另一方面,泄露损失程度计算包括:判断是否是生物识别数据,如果是,直接进行综合数量计算,如果否,判断是否是财务相关数据,如果是,直接进行综合数量计算,如果否,判断是否是临床数据,如果否,直接进行综合数量计算,如果是,判断是否是特殊临床数据,然后进行综合数量计算,基于综合数量计算结果之后得到敏感泄露损失程度得分。基于标识程度得分和敏感泄露损失程度得分,计算得到人口健康数据集的敏感度综合得分。Refer to Figure 5, which is a schematic diagram of a sensitive data degree calculation process provided by an embodiment of the present invention. The sensitive data degree calculation process includes: calculating the identification degree and the leakage loss degree based on the received sensitive feature analysis results. Among them, the calculation of the identification degree includes: judging whether it is aggregated data through the calculation of the identification degree. If so, the identification is level 1; if not, judging whether it contains direct identification. If so, identifying level 4. If not, quantitatively calculating the risk of re-identification. , based on the calculation results, determine whether the risk is greater than the threshold. If so, it will be marked as level three. If not, it will be marked as level 2. The identification degree score will be obtained through the above identification levels. On the other hand, the calculation of the degree of leakage loss includes: judging whether it is biometric data, if so, directly calculating the comprehensive quantity, if not, judging whether it is financial-related data, if so, directly calculating the comprehensive quantity, if not, judging whether It is clinical data. If not, the comprehensive quantity calculation is performed directly. If yes, it is judged whether it is special clinical data, and then the comprehensive quantity calculation is performed. Based on the comprehensive quantity calculation result, the sensitive leakage loss degree score is obtained. Based on the identification degree score and the sensitive leakage loss degree score, the comprehensive sensitivity score of the population health data set is calculated.

在本发明实施例的一种实施方式中,所述基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告,包括:In one implementation of the embodiment of the present invention, generating a sensitivity assessment report of the population health data set based on the comprehensive sensitivity assessment result includes:

确定所述人口健康数据集的基本信息;determining basic information of said population health data set;

基于所述敏感度综合评估结果,确定敏感度评估结果的待显示信息,所述显示信息包括数据集表述程度、数据集泄露损失程度、数据集敏感度参考值和敏感特征数据;Based on the comprehensive sensitivity assessment results, determine the information to be displayed of the sensitivity assessment results, where the displayed information includes the degree of data set expression, the degree of data set leakage loss, the data set sensitivity reference value and sensitive feature data;

基于所述敏感度综合评估结果,确定敏感特征标记信息;Based on the comprehensive sensitivity assessment results, determine sensitive feature marking information;

根据所述基本信息、所述待显示信息和所述敏感特征标记信息,生成所述人口健康数据集的敏感度评估报告。Generate a sensitivity assessment report of the population health data set based on the basic information, the information to be displayed, and the sensitive feature mark information.

具体的,数据敏感度评估报告的设计含有唯一检测编号以防真伪,以及标注检测时间。主体内容主要包括4部分:Specifically, the data sensitivity assessment report is designed to contain a unique detection number to prevent authenticity, and to indicate the detection time. The main content mainly includes 4 parts:

(1)检测数据集基本信息。包括其中英文名称、科技资源标识符、数据资源创建机构、以及数据资源创建者,对数据集的主要内容及来源进行简单介绍。(1) Detect basic information of the data set. Including the English name, scientific and technological resource identifier, data resource creation organization, and data resource creator, a brief introduction to the main content and source of the data set.

(2)数据敏感度评估结果。数据集敏感度主要由数据集标识程度和数据集泄露损失程度两方面决定,报告展示该数据集的标识程度为几级以及对应的标识程度值,展示该数据集的主题类型、数据量以及对应的泄露损失程度值,并且展示完整的评估参考表,以期让报告读者对该数据集的标识程度和泄露损失程度有更为直观、全面的了解。最后,报告展示该数据集的具体敏感度参考值,并注明参考值范围为[0.05—1],报告读者可基于参考值在范围中的位置以及结合人工审查对该数据集的敏感度有个掌握。此外,报告还标识出数据提交者原本对于该数据集“是否包含敏感信息”的标注,以进行标注正确性的验证。(2) Data sensitivity assessment results. The sensitivity of the data set is mainly determined by the degree of identification of the data set and the degree of data leakage loss. The report displays the identification degree of the data set and the corresponding identification degree value, and displays the subject type, data volume and corresponding identification level of the data set. The leakage loss degree value is displayed, and the complete evaluation reference table is displayed, so that the report readers can have a more intuitive and comprehensive understanding of the identification degree and leakage loss degree of the data set. Finally, the report displays the specific sensitivity reference value of the data set, and indicates that the reference value range is [0.05-1]. Report readers can have an idea of the sensitivity of the data set based on the position of the reference value in the range and combined with manual review. A mastery. In addition, the report also identifies the data submitter's original annotation of "whether the data set contains sensitive information" to verify the correctness of the annotation.

(3)数据集敏感特征分析。数据集敏感特征包括元数据、数据项、数据值3个维度,元数据维度包括记录数量、时间跨度、涉及主体数量等12项,数据项和数据值维度的敏感度合并展示,主要包括各类敏感信息类型数、各类敏感信息值数、各类特殊敏感信息类型精确程度等23项敏感特征。报告旨在通过上述特征的解释让数据提交者、数据管理者对该数据集中敏感信息情况进行了解和把握。(3) Analysis of sensitive features of the data set. The sensitive characteristics of the data set include three dimensions: metadata, data items, and data values. The metadata dimensions include 12 items, such as the number of records, time span, and the number of involved subjects. The sensitivity of the data items and data value dimensions is combined and displayed, mainly including various types of data. There are 23 sensitive characteristics including the number of sensitive information types, the number of various types of sensitive information values, and the accuracy of various types of special sensitive information. The report aims to allow data submitters and data managers to understand and grasp the sensitive information in this data set through the explanation of the above characteristics.

(4)数据信息位置标记。该部分主要标识出该数据集中敏感信息的位置、呈现方式、识别样例以及精确程度等,以期为该数据集后续可能的脱敏处理操作提供参考。(4) Data information location marking. This part mainly identifies the location, presentation method, identification samples and accuracy of sensitive information in the data set, in order to provide a reference for possible subsequent desensitization processing operations of the data set.

本发明实施例提供了一种人口健康数据集敏感度处理方法,面向人口健康数据共享需求,构建的敏感度评估方法可在数据集层次度量其本身的敏感信息类型、特征、分布、内容,可将该方法作为一个度量标准指导数据共享敏感信息发现、识别、检测、分级管理和为后续处理提供支撑。本发明包含敏感信息识别、敏感特征分析、敏感度计算和评估报告生成四大关键环节的一整套数据集敏感度评估方法,设计了面向敏感信息识别需求的字典库与规则库,自动扫描生成数据集中敏感信息的位置标记与实现特征分析,基于敏感信息的特征扫描结果实现敏感度的计算评估,为数据敏感信息检测、审核、分级管理等提供计算机辅助支撑。Embodiments of the present invention provide a population health data set sensitivity processing method, which is oriented to population health data sharing needs. The constructed sensitivity evaluation method can measure the type, characteristics, distribution, and content of its own sensitive information at the data set level. Use this method as a metric to guide the discovery, identification, detection, hierarchical management of data sharing sensitive information and provide support for subsequent processing. This invention includes a complete set of data set sensitivity assessment methods including four key links: sensitive information identification, sensitive feature analysis, sensitivity calculation and assessment report generation. It designs a dictionary library and rule library oriented to the needs of sensitive information identification, and automatically scans and generates data. Centralize the location marking of sensitive information and implement feature analysis. Based on the feature scanning results of sensitive information, the calculation and evaluation of sensitivity are realized, providing computer-aided support for data sensitive information detection, review, hierarchical management, etc.

下面以具体的应用实例对本发明实施例进行说明,待评估的人口健康数据集为《XXX精神科病例数据集》,由于该类数据并不属于完全能公开数据,在本发明实施例中为了便于说明,名称中以“XXX”进行相关信息的表示,该数据集不是典型的关系数据表,虽是表格形式但内部包含有非结构化的文本信息,以此为例展示该类数据集敏感度评估报告。因其所含的敏感信息不宜泄露,故该数据集不做样例数据的展示。该数据集包括270例精神科患者的主诉、现病史、既往史、诊断、治疗计划等信息。The following is a specific application example to illustrate the embodiment of the present invention. The population health data set to be evaluated is the "XXX Psychiatric Case Data Set". Since this type of data is not completely public data, in the embodiment of the present invention for convenience Note that "XXX" is used in the name to represent relevant information. This data set is not a typical relational data table. Although it is in tabular form, it contains unstructured text information. This is an example to demonstrate the sensitivity of this type of data set. Evaluation Report. Because the sensitive information contained in it should not be leaked, sample data is not displayed in this data set. This data set includes the chief complaints, current history, past history, diagnosis, treatment plan and other information of 270 psychiatric patients.

《XXX精神科病例数据集》数据敏感度评估报告如图6(a)-图6(c)所示,部分信息在评估报告展示中做了遮蔽处理。其不含有直接身份标识符,含有性别、年龄、婚姻状态、国籍等准标识符,该数据集重标识风险小于0.5,故判断该数据集标识程度为2级,标识程度值为0.3。该数据集的数据主体涉及未成年人和老年人,且属于精神障碍类疾病主题,结合数据量270<500条,依据泄露损失程度评估参考表,泄露损失参考值赋值为0.5。最后,将标识程度值与泄露损失程度值相加并进行标准化,得到敏感度参考值0.4。该数据集平均单条记录含有的敏感信息量较多,但总体数据记录数较少,故表征整个数据集含量的数据敏感度参考值并未表现很高。The data sensitivity assessment report of the "XXX Psychiatric Case Data Set" is shown in Figure 6(a)-Figure 6(c). Some information has been obscured in the presentation of the assessment report. It does not contain direct identity identifiers, but contains quasi-identifiers such as gender, age, marital status, and nationality. The re-identification risk of this data set is less than 0.5, so the identification level of this data set is judged to be level 2, and the identification degree value is 0.3. The data subjects of this data set involve minors and the elderly, and belong to the subject of mental disorders. Combined with the data volume of 270<500 items, according to the leakage loss assessment reference table, the leakage loss reference value is assigned a value of 0.5. Finally, the identification degree value and the leakage loss degree value are added and normalized to obtain a sensitivity reference value of 0.4. The average single record in this data set contains a large amount of sensitive information, but the overall number of data records is small, so the data sensitivity reference value that characterizes the content of the entire data set is not very high.

此类数据集较为特殊的是敏感信息位置标记部分,该数据集不是典型的关系数据形式,由记录行数和列数确定的一个单元格位置中可能含有大段文本信息,故在进行敏感信息位置的时候,要记录该敏感信息(如出院日期2019-12-14)在数据集中的数据记录行数、字段名以及在该单元格中的起始位置(如start:468,end:477)。What is special about this type of data set is the sensitive information location marking part. This data set is not a typical relational data form. A cell position determined by the number of record rows and columns may contain a large piece of text information, so sensitive information needs to be When positioning, record the number of data record rows, field names, and starting positions in the cell (such as start: 468, end: 477) of the sensitive information (such as discharge date 2019-12-14) in the data set. .

基于前述实施例,本发明的实施例还提供了一种人口健康数据集敏感度处理系统,参见图7,包括:Based on the foregoing embodiments, embodiments of the present invention also provide a population health data set sensitivity processing system, see Figure 7, including:

获取单元10,用于获取待评估的人口健康数据集;The acquisition unit 10 is used to acquire the population health data set to be evaluated;

识别单元20用于对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;The identification unit 20 is used to identify sensitive information on each feature of the population health data set, and obtain the sensitive features corresponding to each feature, where the features include metadata features, data item features and data value features;

分析单元30,用于对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;The analysis unit 30 is used to analyze each of the sensitive features and obtain the analysis results corresponding to each sensitive feature;

计算单元40,用于基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;The calculation unit 40 is configured to perform calculations based on the analysis results corresponding to each sensitive feature to obtain a comprehensive sensitivity assessment result of the population health data set;

生成单元50,用于基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。The generating unit 50 is configured to generate a sensitivity assessment report of the population health data set based on the comprehensive sensitivity assessment result.

进一步地,所述识别单元包括:Further, the identification unit includes:

第一获取子单元,用于获取所述人口健康数据的各个特征维度,所述特征维度包括:元数据特征、数据项特征和数据值特征;The first acquisition subunit is used to acquire each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features and data value features;

第一识别子单元,用于基于目标判定规则对所述元数据特征进行敏感信息识别,获得所述元数据敏感特征,所述目标判定规则是基于标注的元数据确定的;The first identification subunit is used to identify sensitive information of the metadata features based on target determination rules to obtain the metadata sensitive features. The target determination rules are determined based on the annotated metadata;

第一确定子单元,用于基于敏感信息类型词典,确定所述数据项特征中是否包括敏感信息类型项目词,如果是,获得数据项敏感特征;The first determination subunit is used to determine whether the data item characteristics include sensitive information type item words based on the sensitive information type dictionary, and if so, obtain the data item sensitive characteristics;

第二识别子单元,用于对所述数据项敏感特征的数据值进行识别,若识别获得的数据值满足敏感信息值对应的识别条件,得到数据值敏感特征。The second identification subunit is used to identify the data value of the sensitive feature of the data item. If the data value obtained by identification meets the identification condition corresponding to the sensitive information value, the sensitive feature of the data value is obtained.

进一步地,所述分析单元包括:Further, the analysis unit includes:

第一分析子单元,用于对所述元数据敏感特征的数据数量、时间跨度、对象特征、主题类型以及主体数量进行分析,获得元数据敏感特征分析结果;The first analysis subunit is used to analyze the data quantity, time span, object characteristics, subject type and number of subjects of the metadata sensitive features, and obtain metadata sensitive feature analysis results;

第二分析子单元,用于对所述数据项敏感特征的敏感信息类型特征和敏感信息数量特征进行分析,获得数据项敏感特征分析结果;The second analysis subunit is used to analyze the sensitive information type characteristics and sensitive information quantity characteristics of the data item sensitive characteristics, and obtain the data item sensitive characteristics analysis results;

第三分析子单元,用于对所述数据值敏感特征的值数量特征、值分布特征和值精准度程度特征进行分析,获得数据值敏感特征分析结果。The third analysis subunit is used to analyze the value quantity characteristics, value distribution characteristics and value accuracy characteristics of the data value sensitive characteristics, and obtain the data value sensitive characteristics analysis results.

进一步地,所述计算单元包括:Further, the computing unit includes:

第一计算子单元,用于基于所述元数据敏感特征分析结果,计算获得泄露损失程度值;The first calculation subunit is used to calculate and obtain the leakage loss degree value based on the metadata sensitive feature analysis results;

第二计算子单元,用于基于所述数据项敏感特征分析结果和所述数据值敏感特征分析结果,计算得到标识程度值;The second calculation subunit is used to calculate the identification degree value based on the data item sensitive feature analysis result and the data value sensitive feature analysis result;

第三计算子单元,用于基于所述泄露损失程度值和所述标识程度值,计算得到所述人口健康数据集的敏感度综合评估结果。The third calculation subunit is used to calculate the comprehensive sensitivity assessment result of the population health data set based on the leakage loss degree value and the identification degree value.

进一步地,所述生成单元包括:Further, the generating unit includes:

第二确定子单元,用于确定所述人口健康数据集的基本信息;The second determination subunit is used to determine the basic information of the population health data set;

第三确定子单元,用于基于所述敏感度综合评估结果,确定敏感度评估结果的待显示信息,所述显示信息包括数据集表述程度、数据集泄露损失程度、数据集敏感度参考值和敏感特征数据;The third determination subunit is used to determine the information to be displayed of the sensitivity evaluation result based on the comprehensive sensitivity evaluation result. The display information includes the degree of data set expression, the degree of data set leakage loss, the data set sensitivity reference value and Sensitive characteristic data;

第四确定子单元,用于基于所述敏感度综合评估结果,确定敏感特征标记信息;The fourth determination subunit is used to determine sensitive feature mark information based on the comprehensive sensitivity assessment result;

生成子单元,用于根据所述基本信息、所述待显示信息和所述敏感特征标记信息,生成所述人口健康数据集的敏感度评估报告。Generating a subunit, configured to generate a sensitivity assessment report of the population health data set based on the basic information, the information to be displayed, and the sensitive feature mark information.

本发明实施例提供了一种人口健康数据集敏感度处理系统,包括:获取单元获取待评估的人口健康数据集;识别单元对所述人口健康数据集的各个特征进行敏感信息识别,获得每一特征对应的敏感特征,所述特征包括元数据特征、数据项特征和数据值特征;分析单元对每一所述敏感特征进行分析,获得每一敏感特征对应的分析结果;计算单元基于每一敏感特征对应的分析结果进行计算,得到所述人口健康数据集的敏感度综合评估结果;生成单元基于所述敏感度综合评估结果,生成所述人口健康数据集的敏感度评估报告。本发明实现了敏感信息发现、识别、分析和处理,并且通过多维度分析提升人口健康数据集敏感度评估的应用需求,以及提升了后续人口健康数据应用的效率和安全性。Embodiments of the present invention provide a population health data set sensitivity processing system, which includes: an acquisition unit obtains a population health data set to be evaluated; an identification unit identifies sensitive information on each feature of the population health data set, and obtains each feature of the population health data set. Sensitive features corresponding to the features, the features include metadata features, data item features and data value features; the analysis unit analyzes each of the sensitive features and obtains the analysis results corresponding to each sensitive feature; the calculation unit is based on each sensitive feature The analysis results corresponding to the features are calculated to obtain a comprehensive sensitivity assessment result of the population health data set; the generation unit generates a sensitivity assessment report of the population health data set based on the comprehensive sensitivity assessment result. The present invention realizes the discovery, identification, analysis and processing of sensitive information, improves the application requirements for sensitivity assessment of population health data sets through multi-dimensional analysis, and improves the efficiency and safety of subsequent population health data applications.

基于前述实施例,本申请的实施例提供一种计算机可读存储介质,计算机可读存储介质存储有一个或者多个程序,该一个或者多个程序可被一个或者多个处理器执行,以实现如上任一项的人口健康数据集敏感度处理方法的步骤。Based on the foregoing embodiments, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores one or more programs. The one or more programs can be executed by one or more processors to implement Steps for sensitivity processing of population health datasets as described above.

基于前述实施例,本发明实施例还提供了一种电子设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现的人口健康数据集敏感度处理方法的步骤。Based on the foregoing embodiments, embodiments of the present invention also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the population is implemented. Steps in the health dataset sensitivity approach.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for sensitivity processing of a population health dataset, comprising:
acquiring a population health data set to be evaluated;
carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features;
analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature;
calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set;
generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result;
the identifying the sensitive information of each feature of the population health data set to obtain the sensitive feature corresponding to each feature includes:
acquiring each characteristic dimension of the population health data, wherein the characteristic dimensions comprise: metadata features, data item features, and data value features;
performing sensitive information identification on the metadata features based on target judgment rules to obtain metadata sensitive features, wherein the target judgment rules are determined based on marked metadata;
Based on the sensitive information type dictionary, determining whether the data item features comprise sensitive information type item words, and if so, obtaining the data item sensitive features;
and identifying the data value of the data item sensitive characteristic, and if the data value obtained by identification meets the identification condition corresponding to the sensitive information value, obtaining the data value sensitive characteristic.
2. The method of claim 1, wherein analyzing each of the sensitive features to obtain an analysis result corresponding to each of the sensitive features comprises:
analyzing the data quantity, the time span, the object characteristics, the theme type and the main body quantity of the metadata sensitive characteristics to obtain metadata sensitive characteristic analysis results;
analyzing the sensitive information type characteristics and the sensitive information quantity characteristics of the sensitive characteristics of the data items to obtain a data item sensitive characteristic analysis result;
and analyzing the value quantity characteristic, the value distribution characteristic and the value precision degree characteristic of the data value sensitive characteristic to obtain a data value sensitive characteristic analysis result.
3. The method of claim 2, wherein the calculating based on the analysis result corresponding to each sensitive feature to obtain the comprehensive sensitivity assessment result of the population health dataset comprises:
Calculating and obtaining a leakage loss degree value based on the metadata sensitive characteristic analysis result;
calculating to obtain an identification degree value based on the data item sensitive characteristic analysis result and the data value sensitive characteristic analysis result;
and calculating to obtain a sensitivity comprehensive evaluation result of the population health data set based on the leakage loss degree value and the identification degree value.
4. The method of claim 1, wherein the generating a sensitivity assessment report of the population health dataset based on the sensitivity comprehensive assessment results comprises:
determining basic information of the population health data set;
determining information to be displayed of the sensitivity evaluation result based on the sensitivity comprehensive evaluation result, wherein the display information comprises a data set expression degree, a data set leakage loss degree, a data set sensitivity reference value and sensitive characteristic data;
determining sensitive characteristic mark information based on the sensitivity comprehensive evaluation result;
and generating a sensitivity evaluation report of the population health data set according to the basic information, the information to be displayed and the sensitive characteristic mark information.
5. A population health dataset sensitivity processing system, comprising:
An acquisition unit for acquiring a population health dataset to be evaluated;
the identification unit is used for carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features;
the analysis unit is used for analyzing each sensitive characteristic to obtain an analysis result corresponding to each sensitive characteristic;
the computing unit is used for computing based on the analysis result corresponding to each sensitive characteristic to obtain a sensitivity comprehensive evaluation result of the population health data set;
a generating unit, configured to generate a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result;
wherein the identification unit includes:
a first obtaining subunit, configured to obtain each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features, and data value features;
the first identification subunit is used for carrying out sensitive information identification on the metadata characteristics based on target judgment rules to obtain metadata sensitive characteristics, wherein the target judgment rules are determined based on marked metadata;
A first determining subunit, configured to determine, based on a sensitive information type dictionary, whether the data item feature includes a sensitive information type item word, and if so, obtain a data item sensitive feature;
and the second identification subunit is used for identifying the data value of the data item sensitive characteristic, and if the data value obtained by identification meets the identification condition corresponding to the sensitive information value, the data value sensitive characteristic is obtained.
6. The system of claim 5, wherein the analysis unit comprises:
the first analysis subunit is used for analyzing the data quantity, the time span, the object characteristics, the theme type and the main body quantity of the metadata sensitive characteristics to obtain metadata sensitive characteristic analysis results;
the second analysis subunit is used for analyzing the sensitive information type characteristics and the sensitive information quantity characteristics of the sensitive characteristics of the data items to obtain the analysis result of the sensitive characteristics of the data items;
and the third analysis subunit is used for analyzing the value quantity characteristic, the value distribution characteristic and the value precision degree characteristic of the data value sensitive characteristic to obtain a data value sensitive characteristic analysis result.
7. The system of claim 6, wherein the computing unit comprises:
The first calculating subunit is used for calculating and obtaining a leakage loss degree value based on the metadata sensitive characteristic analysis result;
the second calculating subunit is used for calculating to obtain an identification degree value based on the data item sensitive characteristic analysis result and the data value sensitive characteristic analysis result;
and the third calculation subunit is used for calculating and obtaining the comprehensive sensitivity evaluation result of the population health data set based on the leakage loss degree value and the identification degree value.
8. The system of claim 5, wherein the generating unit comprises:
a second determination subunit configured to determine basic information of the population health dataset;
the third determining subunit is used for determining information to be displayed of the sensitivity evaluation result based on the sensitivity comprehensive evaluation result, wherein the display information comprises a data set expression degree, a data set leakage loss degree, a data set sensitivity reference value and sensitive characteristic data;
a fourth determination subunit, configured to determine sensitive feature tag information based on the sensitivity comprehensive evaluation result;
and the generation subunit is used for generating a sensitivity evaluation report of the population health data set according to the basic information, the information to be displayed and the sensitive characteristic mark information.
CN202110856219.1A 2021-07-28 2021-07-28 Sensitivity processing method and system for population health data set Active CN113488127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110856219.1A CN113488127B (en) 2021-07-28 2021-07-28 Sensitivity processing method and system for population health data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110856219.1A CN113488127B (en) 2021-07-28 2021-07-28 Sensitivity processing method and system for population health data set

Publications (2)

Publication Number Publication Date
CN113488127A CN113488127A (en) 2021-10-08
CN113488127B true CN113488127B (en) 2023-10-20

Family

ID=77943223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110856219.1A Active CN113488127B (en) 2021-07-28 2021-07-28 Sensitivity processing method and system for population health data set

Country Status (1)

Country Link
CN (1) CN113488127B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119513177A (en) * 2023-11-14 2025-02-25 大连理工大学 Information collection and analysis method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8239246B1 (en) * 2009-08-27 2012-08-07 Accenture Global Services Limited Health and life sciences payer high performance capability assessment
CA2846795A1 (en) * 2013-03-22 2014-09-22 F. Hoffmann-La Roche Ag Method and system ensuring sensitive data are not accessible
EP2942731A1 (en) * 2014-05-10 2015-11-11 Informatica Corporation Identifying and securing sensitive data at its source
CA2948513A1 (en) * 2014-06-04 2015-12-10 Microsoft Technology Licensing, Llc Dissolvable protection of candidate sensitive data items
WO2016034068A1 (en) * 2014-09-03 2016-03-10 阿里巴巴集团控股有限公司 Sensitive information processing method, device, server and security determination system
CN106408140A (en) * 2015-07-27 2017-02-15 广州西麦信息科技有限公司 Grading and classifying model method based on power grid enterprise data
US9729583B1 (en) * 2016-06-10 2017-08-08 OneTrust, LLC Data processing systems and methods for performing privacy assessments and monitoring of new versions of computer code for privacy compliance
CN107730128A (en) * 2017-10-23 2018-02-23 上海携程商务有限公司 Methods of risk assessment and system based on operation flow
CN110941956A (en) * 2019-10-26 2020-03-31 华为技术有限公司 Data classification method, device and related equipment
CN112733152A (en) * 2021-01-22 2021-04-30 湖北宸威玺链信息技术有限公司 Sensitive data processing method, system and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060111943A1 (en) * 2004-11-15 2006-05-25 Wu Harry C Method and system to edit and analyze longitudinal personal health data using a web-based application
US20120101870A1 (en) * 2010-10-22 2012-04-26 International Business Machines Corporation Estimating the Sensitivity of Enterprise Data
US8510850B2 (en) * 2010-12-17 2013-08-13 Microsoft Corporation Functionality for providing de-identified data
US10204238B2 (en) * 2012-02-14 2019-02-12 Radar, Inc. Systems and methods for managing data incidents
US20130332194A1 (en) * 2012-06-07 2013-12-12 Iquartic Methods and systems for adaptive ehr data integration, query, analysis, reporting, and crowdsourced ehr application development
US9762603B2 (en) * 2014-05-10 2017-09-12 Informatica Llc Assessment type-variable enterprise security impact analysis
US10510265B2 (en) * 2014-11-14 2019-12-17 Hi.Q, Inc. System and method for determining and using knowledge about human health
US20180096102A1 (en) * 2016-10-03 2018-04-05 International Business Machines Corporation Redaction of Sensitive Patient Data
US10540521B2 (en) * 2017-08-24 2020-01-21 International Business Machines Corporation Selective enforcement of privacy and confidentiality for optimization of voice applications
EP3906564A4 (en) * 2018-12-31 2022-09-07 Tempus Labs, Inc. METHOD AND APPARATUS FOR PREDICTION AND ANALYSIS OF PATIENT COHORT RESPONSE, PROGRESSION AND SURVIVAL

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8239246B1 (en) * 2009-08-27 2012-08-07 Accenture Global Services Limited Health and life sciences payer high performance capability assessment
CA2846795A1 (en) * 2013-03-22 2014-09-22 F. Hoffmann-La Roche Ag Method and system ensuring sensitive data are not accessible
EP2942731A1 (en) * 2014-05-10 2015-11-11 Informatica Corporation Identifying and securing sensitive data at its source
CA2948513A1 (en) * 2014-06-04 2015-12-10 Microsoft Technology Licensing, Llc Dissolvable protection of candidate sensitive data items
WO2016034068A1 (en) * 2014-09-03 2016-03-10 阿里巴巴集团控股有限公司 Sensitive information processing method, device, server and security determination system
CN106408140A (en) * 2015-07-27 2017-02-15 广州西麦信息科技有限公司 Grading and classifying model method based on power grid enterprise data
US9729583B1 (en) * 2016-06-10 2017-08-08 OneTrust, LLC Data processing systems and methods for performing privacy assessments and monitoring of new versions of computer code for privacy compliance
CN107730128A (en) * 2017-10-23 2018-02-23 上海携程商务有限公司 Methods of risk assessment and system based on operation flow
CN110941956A (en) * 2019-10-26 2020-03-31 华为技术有限公司 Data classification method, device and related equipment
CN112733152A (en) * 2021-01-22 2021-04-30 湖北宸威玺链信息技术有限公司 Sensitive data processing method, system and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
人口健康科学数据中个人敏感信息分类研究;邬金鸣;《中华医学图书情报杂志》;第29卷(第11期);8-15 *
浙江省人口健康脆弱性评估及影响因素分析;童磊;郑珂;苏飞;汤青;曹轶蓉;郑艳艳;;地理科学(08);75-81页 *

Also Published As

Publication number Publication date
CN113488127A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN113345577B (en) Diagnosis and treatment auxiliary information generation method, model training method, device, equipment and storage medium
Dorr et al. Assessing the difficulty and time cost of de-identification in clinical narratives
CN106934038B (en) A kind of medical data duplicate checking and the method and system associated
CN112035757A (en) Medical waterfall flow pushing method, device, equipment and storage medium
CN112132238A (en) A method, apparatus, device and readable medium for identifying private data
CN110752027B (en) Electronic medical record data push method, device, computer equipment and storage medium
WO2020093720A1 (en) Speech recognition-based information query method and device
CN114860887A (en) Disease content push method, device, equipment and medium based on intelligent association
CN113658712A (en) Doctor-patient matching method, device, equipment and storage medium
CN114817683A (en) Information recommendation method and device, computer equipment and storage medium
Gupta et al. Algorithms for rapid digitalization of prescriptions
CN113488127B (en) Sensitivity processing method and system for population health data set
WO2021008601A1 (en) Method for testing medical data
CN111538805A (en) A method and system for text information extraction based on deep learning and rule engine
CN111460173A (en) A method for constructing a disease ontology model of thyroid cancer
CN111104481A (en) Method, device and equipment for identifying matching field
CN112699669B (en) Natural language processing method, device and storage medium for epidemiological survey report
CN118888156A (en) Clinical trial project matching method, device and electronic device based on patient dialogue
CN118550986A (en) A bill entry method and related device
CN116860750A (en) Data processing method, device, computer equipment and storage medium
CN116483987A (en) Target group selection method, device, computer equipment and readable storage medium
CN115762694A (en) A LMKE-based method for analyzing electronic medical records and assisting hospital inventory management
CN111191291B (en) Database attribute sensitivity quantification method based on attack probability
El-Hayek et al. Phenotyping people with a history of injecting drug use within electronic medical records using an interactive machine learning approach
CN115238163A (en) Information pushing method and device based on document data, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant