[go: up one dir, main page]

CN119312165A - Intelligent identification method of computer sensitive data based on artificial intelligence - Google Patents

Intelligent identification method of computer sensitive data based on artificial intelligence Download PDF

Info

Publication number
CN119312165A
CN119312165A CN202411461054.8A CN202411461054A CN119312165A CN 119312165 A CN119312165 A CN 119312165A CN 202411461054 A CN202411461054 A CN 202411461054A CN 119312165 A CN119312165 A CN 119312165A
Authority
CN
China
Prior art keywords
data
model
sensitive
sensitive data
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411461054.8A
Other languages
Chinese (zh)
Inventor
宋宁宁
李晗
孙明刚
邵家聪
邹鸿远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Muding Information Technology Co ltd
Original Assignee
Shandong Muding Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Muding Information Technology Co ltd filed Critical Shandong Muding Information Technology Co ltd
Priority to CN202411461054.8A priority Critical patent/CN119312165A/en
Publication of CN119312165A publication Critical patent/CN119312165A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了基于人工智能的计算机敏感数据智能识别方法,涉及敏感数据识别技术领域,识别方法包括:收集不同源数据,基于多模态数据融合技术,将不同源数据进行整合;基于自监督学习技术提取特征;基于深度学习模型,进行模型训练,进行敏感数据的检测;引入动态模型更新机制,对模型进行在线调整和优化;用训练好的模型对新数据进行识别和分类,实施多级分类体系。本发明提出多模态数据融合,利用多模态数据融合技术,将不同类型的数据整合到一个统一的结构中,确保了信息的互补性,同时引入动态模型更新机制,依据实时数据反馈,对模型进行在线调整和优化,能适应数据的变化和新兴模式,使数据的检测更加准确。

The present invention discloses a computer sensitive data intelligent identification method based on artificial intelligence, which relates to the technical field of sensitive data identification. The identification method includes: collecting data from different sources, integrating the data from different sources based on multimodal data fusion technology; extracting features based on self-supervised learning technology; training the model based on a deep learning model to detect sensitive data; introducing a dynamic model update mechanism to adjust and optimize the model online; using the trained model to identify and classify new data, and implementing a multi-level classification system. The present invention proposes multimodal data fusion, and uses multimodal data fusion technology to integrate different types of data into a unified structure, thereby ensuring the complementarity of information. At the same time, a dynamic model update mechanism is introduced to adjust and optimize the model online based on real-time data feedback, which can adapt to data changes and emerging patterns, making data detection more accurate.

Description

Computer sensitive data intelligent identification method based on artificial intelligence
Technical Field
The invention relates to the technical field of sensitive data identification, in particular to an intelligent computer sensitive data identification method based on artificial intelligence.
Background
In the digital age of today, the generation and storage of data grows exponentially, and industries are relying on data for decision making, operation and service, however, as data usage increases, the protection of sensitive data becomes more and more important, especially in cases involving personal privacy, business confidentiality and compliance requirements, sensitive data identification techniques are receiving widespread attention as an important component of data security and privacy protection, and sensitive data identification techniques generally include aspects of identifying stored sensitive data by automated tool scanning of databases, file systems and applications;
Although the technology of identifying sensitive data has advanced to some extent, many challenges remain, including diversity of data and different formats, the technology of identifying sensitive data needs to have flexibility and adaptability, and under the condition of large data traffic, the sensitive data cannot be identified quickly and accurately.
Disclosure of Invention
In order to solve the technical problems, the intelligent recognition method for the computer sensitive data based on artificial intelligence is provided.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the intelligent computer sensitive data identifying method based on artificial intelligence includes the following steps:
Different source data are collected, integration is carried out on the different source data based on a multi-mode data fusion technology, and preprocessing is carried out on the integrated data;
extracting keywords and context information from the text and extracting advanced features from the image based on a self-supervision learning technology;
Model training is performed based on the deep learning model, and sensitive data detection is performed based on the model trained on the large-scale data set;
Introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time feedback;
and identifying and classifying the new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies.
Preferably, the collecting the different source data, integrating the different source data based on the multi-mode data fusion technology, and preprocessing the integrated data specifically includes:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
Converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
Carrying out standardization processing on the image, including size adjustment, normalization and enhancement;
And identifying and deleting repeated data entries, deleting missing values, and encoding category variables.
Preferably, the extracting the keywords and the context information from the text based on the self-supervised learning technology, and extracting the advanced features from the image specifically includes:
based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Outputting an embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Different parts of the text are input through the model, context embedding is obtained, and then context information of a specific paragraph or sentence is selected;
after the clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
the method comprises the steps of obtaining image features, and combining the image features with text features, wherein a feature combination formula is as follows:
F=α·T+β·I
Wherein F is the feature vector after fusion, alpha is the weight coefficient of the text feature vector T, T is the text feature vector, beta is the weight coefficient of the image feature vector I, and I is the image feature vector.
Preferably, the model training is performed based on a deep learning model, and the detection of sensitive data based on a model trained on a large-scale data set specifically includes:
Acquiring training data related to the sensitive data, wherein the training data comprises data samples marked with the sensitive information;
Labeling the collected data, wherein labeling content comprises sensitive data types and positions;
Based on the fused feature vector, a detection model of sensitive data is established;
training the model using a training set with a back propagation and optimizer;
Deploying the trained model into a production environment, and performing real-time batch data processing;
Inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
And carrying out post-processing on the model output result, and filtering false alarms and false positives.
Preferably, the establishing a detection model of the sensitive data based on the fused feature vector specifically includes:
Based on the activation function, the fused feature vector is used as a model input, and the prediction result of the model is used as an output, and a detection model of the sensitive data is established, wherein the formula of the detection model of the sensitive data is as follows:
Wherein h (F) is a prediction result of the model, F is a fused feature vector, sigma (z) is a Sigmoid function, z is a linear combination result, and e is a mathematical constant;
and outputting the sensitivity probability of the data based on the prediction result of the model.
Preferably, the introducing a dynamic model updating mechanism, according to real-time feedback, carries out online adjustment and optimization on the model specifically includes:
collecting data generated in real time, and acquiring feedback of a user and comparison of a result output by a system and a target value;
Cleaning and preprocessing the collected feedback data, and calculating a model loss function;
training the model using the training set, adjusting weights to minimize the loss function based on the loss function of the model;
Setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
Monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on the accuracy;
using the real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
the feedback of the model result is combined with the further model adjustment of the user, and the user feedback is incorporated into the decision process;
the detailed information of each model update is recorded, including updated parameters, feedback sources, and performance changes.
Preferably, the cleaning and preprocessing the collected feedback data, and calculating the model loss function specifically includes:
The model loss function calculation formula is as follows:
Where J is the value of the loss function, N is the total number of samples, h i is the true label of the ith sample, Is the predictive probability of the kth sample.
Preferably, the real-time monitoring of the performance index of the model, the evaluating the performance of the updated model based on the accuracy rate specifically includes:
Collecting latest input data and corresponding real labels from a system, predicting the collected new data by using a current updated model, and calculating the accuracy, wherein the accuracy calculation formula is as follows:
where a is the accuracy, TP is the number of samples the model correctly predicts the positive class as positive, TN is the number of samples the model correctly predicts the negative class as negative, FP is the number of samples the model incorrectly predicts the negative class as positive, FN is the number of samples the model incorrectly predicts the positive class as negative;
Periodically calculating and recording the accuracy, and capturing the performances of the model at different time points;
Generating a report containing the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
Analyzing the change trend of the accuracy rate, and checking whether the performance of the model after receiving new data is stable and improved;
the feedback of the user to the prediction result is incorporated into the updating process of the follow-up model;
visualization tools using dashboards display accuracy and performance metrics in real-time.
Preferably, the identifying and classifying the new data with the trained model, implementing a multi-level classification system, dividing the data into different sensitive levels, and setting different processing strategies according to specific service requirements specifically includes:
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, and identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
Dividing the data sample into different sensitivity levels according to the prediction result output by the model;
According to different sensitivity levels, corresponding processing strategies are formulated;
automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
For data with high sensitivity level, implementing a manual auditing mechanism to ensure compliance with the requirements of policies and regulations;
real-time monitoring is carried out on the classification and treatment processes, so that the normal operation of the system is ensured;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
Preferably, the formulating the corresponding processing policy according to the different sensitivity levels specifically includes:
low sensitivity level, namely disclosing, performing conventional storage and access;
Medium sensitivity level, namely encrypting and storing, limiting access rights and auditing regularly;
High sensitivity level-strict access control, requiring additional monitoring and auditing.
Compared with the prior art, the invention has the beneficial effects that:
The invention provides multi-mode data fusion, integrates different types of data into a unified structure by utilizing a multi-mode data fusion technology, ensures the complementarity of information, introduces a dynamic model updating mechanism, and can adapt to the change and emerging modes of the data by carrying out online adjustment and optimization on the model according to real-time data feedback so as to ensure more accurate detection of the data.
Drawings
FIG. 1 is a block diagram of a process flow according to the present invention.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, the intelligent recognition method for computer sensitive data based on artificial intelligence comprises the following steps:
Different source data are collected, integration is carried out on the different source data based on a multi-mode data fusion technology, and preprocessing is carried out on the integrated data;
extracting keywords and context information from the text and extracting advanced features from the image based on a self-supervision learning technology;
Model training is performed based on the deep learning model, and sensitive data detection is performed based on the model trained on the large-scale data set;
Introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time feedback;
and identifying and classifying the new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies.
Further, the collecting different source data, integrating the different source data based on the multi-mode data fusion technology, and preprocessing the integrated data specifically includes:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
Converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
Carrying out standardization processing on the image, including size adjustment, normalization and enhancement;
And identifying and deleting repeated data entries, deleting missing values, and encoding category variables.
The self-supervision learning technology-based keyword and context information are extracted from texts, and the extraction of advanced features from images specifically comprises the following steps:
based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Outputting an embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Different parts of the text are input through the model, context embedding is obtained, and then context information of a specific paragraph or sentence is selected;
after the clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
the method comprises the steps of obtaining image features, and combining the image features with text features, wherein a feature combination formula is as follows:
F=α·T+β·I
Wherein F is the feature vector after fusion, alpha is the weight coefficient of the text feature vector T, T is the text feature vector, beta is the weight coefficient of the image feature vector I, and I is the image feature vector.
Further, the performing model training based on the deep learning model, and the detecting of the sensitive data based on the model trained on the large-scale data set specifically includes:
Acquiring training data related to the sensitive data, wherein the training data comprises data samples marked with the sensitive information;
Labeling the collected data, wherein labeling content comprises sensitive data types and positions;
Based on the fused feature vector, a detection model of sensitive data is established;
training the model using a training set with a back propagation and optimizer;
Deploying the trained model into a production environment, and performing real-time batch data processing;
Inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
And carrying out post-processing on the model output result, and filtering false alarms and false positives.
Further, the establishing a detection model of the sensitive data based on the fused feature vector specifically includes:
Based on the activation function, the fused feature vector is used as a model input, and the prediction result of the model is used as an output, and a detection model of the sensitive data is established, wherein the formula of the detection model of the sensitive data is as follows:
Wherein h (F) is a prediction result of the model, F is a fused feature vector, sigma (z) is a Sigmoid function, z is a linear combination result, and e is a mathematical constant;
and outputting the sensitivity probability of the data based on the prediction result of the model.
Furthermore, the introducing a dynamic model updating mechanism, according to real-time feedback, carries out online adjustment and optimization on the model specifically comprises:
collecting data generated in real time, and acquiring feedback of a user and comparison of a result output by a system and a target value;
Cleaning and preprocessing the collected feedback data, and calculating a model loss function;
training the model using the training set, adjusting weights to minimize the loss function based on the loss function of the model;
Setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
Monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on the accuracy;
using the real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
the feedback of the model result is combined with the further model adjustment of the user, and the user feedback is incorporated into the decision process;
the detailed information of each model update is recorded, including updated parameters, feedback sources, and performance changes.
Further, the cleaning and preprocessing the collected feedback data, and calculating the model loss function specifically includes:
The model loss function calculation formula is as follows:
Where J is the value of the loss function, N is the total number of samples, h i is the true label of the ith sample, Is the predictive probability of the kth sample.
Further, the real-time monitoring of the performance index of the model, and the evaluating of the performance of the updated model based on the accuracy rate specifically includes:
Collecting latest input data and corresponding real labels from a system, predicting the collected new data by using a current updated model, and calculating the accuracy, wherein the accuracy calculation formula is as follows:
where a is the accuracy, TP is the number of samples the model correctly predicts the positive class as positive, TN is the number of samples the model correctly predicts the negative class as negative, FP is the number of samples the model incorrectly predicts the negative class as positive, FN is the number of samples the model incorrectly predicts the positive class as negative;
Periodically calculating and recording the accuracy, and capturing the performances of the model at different time points;
Generating a report containing the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
Analyzing the change trend of the accuracy rate, and checking whether the performance of the model after receiving new data is stable and improved;
the feedback of the user to the prediction result is incorporated into the updating process of the follow-up model;
visualization tools using dashboards display accuracy and performance metrics in real-time.
Further, the method for identifying and classifying new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies according to specific service requirements specifically comprises:
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, and identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
Dividing the data sample into different sensitivity levels according to the prediction result output by the model;
According to different sensitivity levels, corresponding processing strategies are formulated;
automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
For data with high sensitivity level, implementing a manual auditing mechanism to ensure compliance with the requirements of policies and regulations;
real-time monitoring is carried out on the classification and treatment processes, so that the normal operation of the system is ensured;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
Further, the formulating the corresponding processing policy according to the different sensitivity levels specifically includes:
low sensitivity level, namely disclosing, performing conventional storage and access;
Medium sensitivity level, namely encrypting and storing, limiting access rights and auditing regularly;
High sensitivity level-strict access control, requiring additional monitoring and auditing.
The application process of the invention is as follows:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
step two, converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
step three, carrying out standardization processing on the image, wherein the standardization processing comprises size adjustment, normalization and enhancement;
and fourthly, identifying and deleting repeated data items, deleting missing values, and encoding category variables.
Fifthly, based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Step six, outputting the embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Step seven, inputting different parts of the text through the model, obtaining context embedding, and further selecting context information of a specific paragraph or sentence;
Step eight, after clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
step nine, acquiring image features, and combining the image features with text features;
Step ten, training data related to the sensitive data is obtained, wherein the training data comprises data samples marked with the sensitive information;
labeling the collected data, wherein labeling contents comprise sensitive data types and positions;
Step twelve, based on the fused feature vector, a detection model of sensitive data is established;
a thirteenth step of training the model by using a training set by adopting a back propagation and optimizer;
fourteen, deploying the trained model into a production environment, and performing real-time batch data processing;
fifteen, inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
sixthly, post-processing is carried out on the model output result, and false positive are filtered;
Seventeenth, collecting data generated in real time, and acquiring feedback of a user, and comparing a result output by a system with a target value;
Eighteenth, cleaning and preprocessing the collected feedback data, and calculating a model loss function;
Nineteenth, training the model by using a training set, and adjusting weights to minimize a loss function based on the loss function of the model;
setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
twenty-one, monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on accuracy;
Twenty-two steps, using real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
Twenty four, further model adjustment is carried out by combining the feedback of the model result of the user, and the user feedback is incorporated into the decision process;
Twenty-five, recording detailed information of each model update, including updated parameters, feedback sources and performance changes;
sixteenth, collecting the latest input data and corresponding real labels from the system, predicting the collected new data by using the current updated model, and calculating the accuracy;
Seventeenth, periodically calculating and recording the accuracy rate, and capturing the performances of the model at different time points;
Generating a report comprising the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
nineteenth, analyzing the change trend of the accuracy, and checking whether the performance of the model is stable and improved after receiving new data;
Step thirty, the feedback of the user on the prediction result is incorporated into the updating process of the follow-up model;
And step thirty one, the accuracy and the performance index are displayed in real time by using the visualization tool of the instrument board.
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
thirty-four, dividing the data sample into different sensitivity levels according to the prediction result output by the model;
thirty-five, according to different sensitive levels, formulating corresponding processing strategies;
Automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
Seventeenth, implementing a manual auditing mechanism for the data with high sensitivity level to ensure that the data meets the requirements of policies and regulations;
The thirty-eighth step is to monitor the classification and treatment process in real time to ensure the normal operation of the system;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
In summary, the invention has the advantages that:
From various data sources, the diversity and the richness of the data are ensured, and different types of data are integrated into a unified structure by utilizing a multi-mode data fusion technology, so that the complementarity of information is ensured;
The self-supervision learning technology is applied to extract key words and context information from text data, and meanwhile, advanced features such as edges, shapes and textures are extracted from image data, so that the extraction rate of the data is improved;
based on the extracted features, a deep learning model is constructed for training, and a large-scale data set is utilized for model optimization, so that the high efficiency of the model in the aspect of sensitive data detection is ensured.
And introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time data feedback so as to adapt to the change and emerging modes of the data.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. The intelligent computer sensitive data identification method based on artificial intelligence is characterized by comprising the following steps:
Different source data are collected, integration is carried out on the different source data based on a multi-mode data fusion technology, and preprocessing is carried out on the integrated data;
extracting keywords and context information from the text and extracting advanced features from the image based on a self-supervision learning technology;
Model training is performed based on the deep learning model, and sensitive data detection is performed based on the model trained on the large-scale data set;
Introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time feedback;
and identifying and classifying the new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies.
2. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 1, wherein the collecting different source data, integrating the different source data based on a multi-modal data fusion technology, and preprocessing the integrated data specifically comprises:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
Converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
Carrying out standardization processing on the image, including size adjustment, normalization and enhancement;
And identifying and deleting repeated data entries, deleting missing values, and encoding category variables.
3. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 2, wherein the extracting key words and context information from text and extracting advanced features from images based on self-supervised learning technology specifically comprises:
based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Outputting an embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Different parts of the text are input through the model, context embedding is obtained, and then context information of a specific paragraph or sentence is selected;
after the clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
the method comprises the steps of obtaining image features, and combining the image features with text features, wherein a feature combination formula is as follows:
F=α·T+β·I
Wherein F is the feature vector after fusion, alpha is the weight coefficient of the text feature vector T, T is the text feature vector, beta is the weight coefficient of the image feature vector I, and I is the image feature vector.
4. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 3, wherein the model training based on the deep learning model, the detecting of sensitive data based on the model trained on the large-scale data set specifically comprises:
Acquiring training data related to the sensitive data, wherein the training data comprises data samples marked with the sensitive information;
Labeling the collected data, wherein labeling content comprises sensitive data types and positions;
Based on the fused feature vector, a detection model of sensitive data is established;
training the model using a training set with a back propagation and optimizer;
Deploying the trained model into a production environment, and performing real-time batch data processing;
Inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
And carrying out post-processing on the model output result, and filtering false alarms and false positives.
5. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 4, wherein the establishing a detection model of sensitive data based on the fused feature vector specifically comprises:
Based on the activation function, the fused feature vector is used as a model input, and the prediction result of the model is used as an output, and a detection model of the sensitive data is established, wherein the formula of the detection model of the sensitive data is as follows:
Wherein h (F) is a prediction result of the model, F is a fused feature vector, sigma (z) is a Sigmoid function, z is a linear combination result, and e is a mathematical constant;
and outputting the sensitivity probability of the data based on the prediction result of the model.
6. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 5, wherein the introducing a dynamic model update mechanism, according to real-time feedback, performs online adjustment and optimization of the model specifically comprises:
collecting data generated in real time, and acquiring feedback of a user and comparison of a result output by a system and a target value;
Cleaning and preprocessing the collected feedback data, and calculating a model loss function;
training the model using the training set, adjusting weights to minimize the loss function based on the loss function of the model;
Setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
Monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on the accuracy;
using the real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
the feedback of the model result is combined with the further model adjustment of the user, and the user feedback is incorporated into the decision process;
the detailed information of each model update is recorded, including updated parameters, feedback sources, and performance changes.
7. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 6, wherein the cleaning and preprocessing of the collected feedback data, and calculating the model loss function specifically comprises:
The model loss function calculation formula is as follows:
Where J is the value of the loss function, N is the total number of samples, h i is the true label of the ith sample, Is the predictive probability of the kth sample.
8. The intelligent recognition method of computer sensitive data based on artificial intelligence according to claim 7, wherein the real-time monitoring of the model performance index and the evaluation of the updated model performance based on the accuracy rate specifically comprises:
Collecting latest input data and corresponding real labels from a system, predicting the collected new data by using a current updated model, and calculating the accuracy, wherein the accuracy calculation formula is as follows:
where a is the accuracy, TP is the number of samples the model correctly predicts the positive class as positive, TN is the number of samples the model correctly predicts the negative class as negative, FP is the number of samples the model incorrectly predicts the negative class as positive, FN is the number of samples the model incorrectly predicts the positive class as negative;
Periodically calculating and recording the accuracy, and capturing the performances of the model at different time points;
Generating a report containing the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
analyzing the change trend of the accuracy rate, and checking whether the performance of the model after receiving new data is stable and improved;
the feedback of the user to the prediction result is incorporated into the updating process of the follow-up model;
visualization tools using dashboards display accuracy and performance metrics in real-time.
9. The intelligent recognition method for computer sensitive data based on artificial intelligence according to claim 8, wherein the steps of using the trained model to recognize and classify new data, implementing a multi-level classification system to divide the data into different sensitive levels, and setting different processing strategies according to specific business requirements comprise:
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, and identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
Dividing the data sample into different sensitivity levels according to the prediction result output by the model;
According to different sensitivity levels, corresponding processing strategies are formulated;
automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
For data with high sensitivity level, implementing a manual auditing mechanism to ensure compliance with the requirements of policies and regulations;
real-time monitoring is carried out on the classification and treatment processes, so that the normal operation of the system is ensured;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
10. The intelligent recognition method for computer sensitive data based on artificial intelligence according to claim 9, wherein the formulating the corresponding processing policy according to different sensitivity levels specifically comprises:
low sensitivity level, namely disclosing, performing conventional storage and access;
Medium sensitivity level, namely encrypting and storing, limiting access rights and auditing regularly;
High sensitivity level-strict access control, requiring additional monitoring and auditing.
CN202411461054.8A 2024-10-18 2024-10-18 Intelligent identification method of computer sensitive data based on artificial intelligence Pending CN119312165A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411461054.8A CN119312165A (en) 2024-10-18 2024-10-18 Intelligent identification method of computer sensitive data based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411461054.8A CN119312165A (en) 2024-10-18 2024-10-18 Intelligent identification method of computer sensitive data based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN119312165A true CN119312165A (en) 2025-01-14

Family

ID=94182325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411461054.8A Pending CN119312165A (en) 2024-10-18 2024-10-18 Intelligent identification method of computer sensitive data based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN119312165A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119670157A (en) * 2025-02-21 2025-03-21 贵州华谊联盛科技有限公司 Data desensitization processing method and system based on large model
CN119720281A (en) * 2025-02-27 2025-03-28 广东网安科技有限公司 A data sharing method for a technology finance platform

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119670157A (en) * 2025-02-21 2025-03-21 贵州华谊联盛科技有限公司 Data desensitization processing method and system based on large model
CN119720281A (en) * 2025-02-27 2025-03-28 广东网安科技有限公司 A data sharing method for a technology finance platform
CN119720281B (en) * 2025-02-27 2025-07-08 广东网安科技有限公司 A data sharing method for a technology finance platform

Similar Documents

Publication Publication Date Title
US8886574B2 (en) Generalized pattern recognition for fault diagnosis in machine condition monitoring
US11837329B2 (en) Method for classifying multi-granularity breast cancer genes based on double self-adaptive neighborhood radius
CN119312165A (en) Intelligent identification method of computer sensitive data based on artificial intelligence
CN112966259B (en) Power monitoring system operation and maintenance behavior security threat assessment method and equipment
CN106250442A (en) The feature selection approach of a kind of network security data and system
CN116756688A (en) Public opinion risk discovery method based on multi-mode fusion algorithm
CN112884570B (en) Method, device and equipment for determining model security
Lin et al. Learning to detect representative data for large scale instance selection
CN115048464A (en) User operation behavior data detection method and device and electronic equipment
CN118733714B (en) Semantic big model optimization method and system for electric power scene
CN117768220B (en) Network security level protection evaluation method, system and device based on artificial intelligence
CN120257113B (en) An intelligent data management system and method based on multi-source data acquisition
CN116383747A (en) Anomaly Detection Method Based on Multi-Timescale Deep Convolutional Generative Adversarial Networks
CN116662899A (en) A Noisy Data Anomaly Detection Method Based on Adaptive Strategy
Li et al. Improving performance of log anomaly detection with semantic and time features based on bilstm-attention
CN118503803A (en) Membrane pollution diagnosis method and system based on machine learning algorithm
CN113778733A (en) Log sequence anomaly detection method based on multi-scale MASS
CN116185817A (en) Screening method and system for software defect prediction rules
Zeng et al. An Enhanced Gas Sensor Data Classification Method Using Principal Component Analysis and Synthetic Minority Over-Sampling Technique Algorithms
CN117272081B (en) Abnormal behavior combination identification method and system based on log records
CN118394589A (en) Background data intelligent monitoring system and method based on data mining
CN117807238A (en) An experimental knowledge base construction method and device
CN116932487A (en) Quantized data analysis method and system based on data paragraph division
CN115392375A (en) Intelligent evaluation method and system for multi-source data fusion degree
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination