Disclosure of Invention
In order to solve the technical problems, the intelligent recognition method for the computer sensitive data based on artificial intelligence is provided.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the intelligent computer sensitive data identifying method based on artificial intelligence includes the following steps:
Different source data are collected, integration is carried out on the different source data based on a multi-mode data fusion technology, and preprocessing is carried out on the integrated data;
extracting keywords and context information from the text and extracting advanced features from the image based on a self-supervision learning technology;
Model training is performed based on the deep learning model, and sensitive data detection is performed based on the model trained on the large-scale data set;
Introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time feedback;
and identifying and classifying the new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies.
Preferably, the collecting the different source data, integrating the different source data based on the multi-mode data fusion technology, and preprocessing the integrated data specifically includes:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
Converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
Carrying out standardization processing on the image, including size adjustment, normalization and enhancement;
And identifying and deleting repeated data entries, deleting missing values, and encoding category variables.
Preferably, the extracting the keywords and the context information from the text based on the self-supervised learning technology, and extracting the advanced features from the image specifically includes:
based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Outputting an embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Different parts of the text are input through the model, context embedding is obtained, and then context information of a specific paragraph or sentence is selected;
after the clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
the method comprises the steps of obtaining image features, and combining the image features with text features, wherein a feature combination formula is as follows:
F=α·T+β·I
Wherein F is the feature vector after fusion, alpha is the weight coefficient of the text feature vector T, T is the text feature vector, beta is the weight coefficient of the image feature vector I, and I is the image feature vector.
Preferably, the model training is performed based on a deep learning model, and the detection of sensitive data based on a model trained on a large-scale data set specifically includes:
Acquiring training data related to the sensitive data, wherein the training data comprises data samples marked with the sensitive information;
Labeling the collected data, wherein labeling content comprises sensitive data types and positions;
Based on the fused feature vector, a detection model of sensitive data is established;
training the model using a training set with a back propagation and optimizer;
Deploying the trained model into a production environment, and performing real-time batch data processing;
Inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
And carrying out post-processing on the model output result, and filtering false alarms and false positives.
Preferably, the establishing a detection model of the sensitive data based on the fused feature vector specifically includes:
Based on the activation function, the fused feature vector is used as a model input, and the prediction result of the model is used as an output, and a detection model of the sensitive data is established, wherein the formula of the detection model of the sensitive data is as follows:
Wherein h (F) is a prediction result of the model, F is a fused feature vector, sigma (z) is a Sigmoid function, z is a linear combination result, and e is a mathematical constant;
and outputting the sensitivity probability of the data based on the prediction result of the model.
Preferably, the introducing a dynamic model updating mechanism, according to real-time feedback, carries out online adjustment and optimization on the model specifically includes:
collecting data generated in real time, and acquiring feedback of a user and comparison of a result output by a system and a target value;
Cleaning and preprocessing the collected feedback data, and calculating a model loss function;
training the model using the training set, adjusting weights to minimize the loss function based on the loss function of the model;
Setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
Monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on the accuracy;
using the real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
the feedback of the model result is combined with the further model adjustment of the user, and the user feedback is incorporated into the decision process;
the detailed information of each model update is recorded, including updated parameters, feedback sources, and performance changes.
Preferably, the cleaning and preprocessing the collected feedback data, and calculating the model loss function specifically includes:
The model loss function calculation formula is as follows:
Where J is the value of the loss function, N is the total number of samples, h i is the true label of the ith sample, Is the predictive probability of the kth sample.
Preferably, the real-time monitoring of the performance index of the model, the evaluating the performance of the updated model based on the accuracy rate specifically includes:
Collecting latest input data and corresponding real labels from a system, predicting the collected new data by using a current updated model, and calculating the accuracy, wherein the accuracy calculation formula is as follows:
where a is the accuracy, TP is the number of samples the model correctly predicts the positive class as positive, TN is the number of samples the model correctly predicts the negative class as negative, FP is the number of samples the model incorrectly predicts the negative class as positive, FN is the number of samples the model incorrectly predicts the positive class as negative;
Periodically calculating and recording the accuracy, and capturing the performances of the model at different time points;
Generating a report containing the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
Analyzing the change trend of the accuracy rate, and checking whether the performance of the model after receiving new data is stable and improved;
the feedback of the user to the prediction result is incorporated into the updating process of the follow-up model;
visualization tools using dashboards display accuracy and performance metrics in real-time.
Preferably, the identifying and classifying the new data with the trained model, implementing a multi-level classification system, dividing the data into different sensitive levels, and setting different processing strategies according to specific service requirements specifically includes:
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, and identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
Dividing the data sample into different sensitivity levels according to the prediction result output by the model;
According to different sensitivity levels, corresponding processing strategies are formulated;
automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
For data with high sensitivity level, implementing a manual auditing mechanism to ensure compliance with the requirements of policies and regulations;
real-time monitoring is carried out on the classification and treatment processes, so that the normal operation of the system is ensured;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
Preferably, the formulating the corresponding processing policy according to the different sensitivity levels specifically includes:
low sensitivity level, namely disclosing, performing conventional storage and access;
Medium sensitivity level, namely encrypting and storing, limiting access rights and auditing regularly;
High sensitivity level-strict access control, requiring additional monitoring and auditing.
Compared with the prior art, the invention has the beneficial effects that:
The invention provides multi-mode data fusion, integrates different types of data into a unified structure by utilizing a multi-mode data fusion technology, ensures the complementarity of information, introduces a dynamic model updating mechanism, and can adapt to the change and emerging modes of the data by carrying out online adjustment and optimization on the model according to real-time data feedback so as to ensure more accurate detection of the data.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, the intelligent recognition method for computer sensitive data based on artificial intelligence comprises the following steps:
Different source data are collected, integration is carried out on the different source data based on a multi-mode data fusion technology, and preprocessing is carried out on the integrated data;
extracting keywords and context information from the text and extracting advanced features from the image based on a self-supervision learning technology;
Model training is performed based on the deep learning model, and sensitive data detection is performed based on the model trained on the large-scale data set;
Introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time feedback;
and identifying and classifying the new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies.
Further, the collecting different source data, integrating the different source data based on the multi-mode data fusion technology, and preprocessing the integrated data specifically includes:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
Converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
Carrying out standardization processing on the image, including size adjustment, normalization and enhancement;
And identifying and deleting repeated data entries, deleting missing values, and encoding category variables.
The self-supervision learning technology-based keyword and context information are extracted from texts, and the extraction of advanced features from images specifically comprises the following steps:
based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Outputting an embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Different parts of the text are input through the model, context embedding is obtained, and then context information of a specific paragraph or sentence is selected;
after the clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
the method comprises the steps of obtaining image features, and combining the image features with text features, wherein a feature combination formula is as follows:
F=α·T+β·I
Wherein F is the feature vector after fusion, alpha is the weight coefficient of the text feature vector T, T is the text feature vector, beta is the weight coefficient of the image feature vector I, and I is the image feature vector.
Further, the performing model training based on the deep learning model, and the detecting of the sensitive data based on the model trained on the large-scale data set specifically includes:
Acquiring training data related to the sensitive data, wherein the training data comprises data samples marked with the sensitive information;
Labeling the collected data, wherein labeling content comprises sensitive data types and positions;
Based on the fused feature vector, a detection model of sensitive data is established;
training the model using a training set with a back propagation and optimizer;
Deploying the trained model into a production environment, and performing real-time batch data processing;
Inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
And carrying out post-processing on the model output result, and filtering false alarms and false positives.
Further, the establishing a detection model of the sensitive data based on the fused feature vector specifically includes:
Based on the activation function, the fused feature vector is used as a model input, and the prediction result of the model is used as an output, and a detection model of the sensitive data is established, wherein the formula of the detection model of the sensitive data is as follows:
Wherein h (F) is a prediction result of the model, F is a fused feature vector, sigma (z) is a Sigmoid function, z is a linear combination result, and e is a mathematical constant;
and outputting the sensitivity probability of the data based on the prediction result of the model.
Furthermore, the introducing a dynamic model updating mechanism, according to real-time feedback, carries out online adjustment and optimization on the model specifically comprises:
collecting data generated in real time, and acquiring feedback of a user and comparison of a result output by a system and a target value;
Cleaning and preprocessing the collected feedback data, and calculating a model loss function;
training the model using the training set, adjusting weights to minimize the loss function based on the loss function of the model;
Setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
Monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on the accuracy;
using the real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
the feedback of the model result is combined with the further model adjustment of the user, and the user feedback is incorporated into the decision process;
the detailed information of each model update is recorded, including updated parameters, feedback sources, and performance changes.
Further, the cleaning and preprocessing the collected feedback data, and calculating the model loss function specifically includes:
The model loss function calculation formula is as follows:
Where J is the value of the loss function, N is the total number of samples, h i is the true label of the ith sample, Is the predictive probability of the kth sample.
Further, the real-time monitoring of the performance index of the model, and the evaluating of the performance of the updated model based on the accuracy rate specifically includes:
Collecting latest input data and corresponding real labels from a system, predicting the collected new data by using a current updated model, and calculating the accuracy, wherein the accuracy calculation formula is as follows:
where a is the accuracy, TP is the number of samples the model correctly predicts the positive class as positive, TN is the number of samples the model correctly predicts the negative class as negative, FP is the number of samples the model incorrectly predicts the negative class as positive, FN is the number of samples the model incorrectly predicts the positive class as negative;
Periodically calculating and recording the accuracy, and capturing the performances of the model at different time points;
Generating a report containing the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
Analyzing the change trend of the accuracy rate, and checking whether the performance of the model after receiving new data is stable and improved;
the feedback of the user to the prediction result is incorporated into the updating process of the follow-up model;
visualization tools using dashboards display accuracy and performance metrics in real-time.
Further, the method for identifying and classifying new data by using the trained model, implementing a multi-stage classification system, dividing the data into different sensitive levels, and setting different processing strategies according to specific service requirements specifically comprises:
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, and identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
Dividing the data sample into different sensitivity levels according to the prediction result output by the model;
According to different sensitivity levels, corresponding processing strategies are formulated;
automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
For data with high sensitivity level, implementing a manual auditing mechanism to ensure compliance with the requirements of policies and regulations;
real-time monitoring is carried out on the classification and treatment processes, so that the normal operation of the system is ensured;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
Further, the formulating the corresponding processing policy according to the different sensitivity levels specifically includes:
low sensitivity level, namely disclosing, performing conventional storage and access;
Medium sensitivity level, namely encrypting and storing, limiting access rights and auditing regularly;
High sensitivity level-strict access control, requiring additional monitoring and auditing.
The application process of the invention is as follows:
Identifying a data source, including structured data, unstructured data and semi-structured data, and acquiring data;
step two, converting the data from different sources into a unified format, ensuring the consistency of the data, and aligning the data from different sources according to the associated fields;
step three, carrying out standardization processing on the image, wherein the standardization processing comprises size adjustment, normalization and enhancement;
and fourthly, identifying and deleting repeated data items, deleting missing values, and encoding category variables.
Fifthly, based on BERT as a self-supervision learning model, performing model training by using unlabeled text data to obtain context information;
Step six, outputting the embedded vector of the trained model, and extracting keywords based on a cluster analysis algorithm;
Step seven, inputting different parts of the text through the model, obtaining context embedding, and further selecting context information of a specific paragraph or sentence;
Step eight, after clustering is completed, extracting keywords from each cluster, and summarizing the extracted keywords;
step nine, acquiring image features, and combining the image features with text features;
Step ten, training data related to the sensitive data is obtained, wherein the training data comprises data samples marked with the sensitive information;
labeling the collected data, wherein labeling contents comprise sensitive data types and positions;
Step twelve, based on the fused feature vector, a detection model of sensitive data is established;
a thirteenth step of training the model by using a training set by adopting a back propagation and optimizer;
fourteen, deploying the trained model into a production environment, and performing real-time batch data processing;
fifteen, inputting data to be detected, and outputting a sensitive information position and a type of the sensitive information position by a model;
sixthly, post-processing is carried out on the model output result, and false positive are filtered;
Seventeenth, collecting data generated in real time, and acquiring feedback of a user, and comparing a result output by a system with a target value;
Eighteenth, cleaning and preprocessing the collected feedback data, and calculating a model loss function;
Nineteenth, training the model by using a training set, and adjusting weights to minimize a loss function based on the loss function of the model;
setting the updating frequency of the model, wherein the updating frequency comprises each new data point, each time window and a certain fixed time interval;
twenty-one, monitoring performance indexes of the model in real time, and evaluating the performance of the updated model based on accuracy;
Twenty-two steps, using real-time data as a verification set, and testing the performance of the updated model;
Setting a threshold value to monitor whether the performance of the model is lower than expected, if the accuracy rate is reduced, sending out early warning, and establishing model version control;
Twenty four, further model adjustment is carried out by combining the feedback of the model result of the user, and the user feedback is incorporated into the decision process;
Twenty-five, recording detailed information of each model update, including updated parameters, feedback sources and performance changes;
sixteenth, collecting the latest input data and corresponding real labels from the system, predicting the collected new data by using the current updated model, and calculating the accuracy;
Seventeenth, periodically calculating and recording the accuracy rate, and capturing the performances of the model at different time points;
Generating a report comprising the latest accuracy, past accuracy historical data, a time stamp and the number of data samples;
nineteenth, analyzing the change trend of the accuracy, and checking whether the performance of the model is stable and improved after receiving new data;
Step thirty, the feedback of the user on the prediction result is incorporated into the updating process of the follow-up model;
And step thirty one, the accuracy and the performance index are displayed in real time by using the visualization tool of the instrument board.
Loading a pre-trained classification model into the environment, and predicting new data;
Inputting new data into the model, identifying and classifying to obtain a prediction label and a corresponding sensitivity level of each data sample;
thirty-four, dividing the data sample into different sensitivity levels according to the prediction result output by the model;
thirty-five, according to different sensitive levels, formulating corresponding processing strategies;
Automatically executing corresponding processing strategies according to the sensitivity level, including data encryption, access control and record audit;
Seventeenth, implementing a manual auditing mechanism for the data with high sensitivity level to ensure that the data meets the requirements of policies and regulations;
The thirty-eighth step is to monitor the classification and treatment process in real time to ensure the normal operation of the system;
And (3) carrying out detailed records on all the processing procedures, including data classification results, processing strategy implementation conditions and audit records.
In summary, the invention has the advantages that:
From various data sources, the diversity and the richness of the data are ensured, and different types of data are integrated into a unified structure by utilizing a multi-mode data fusion technology, so that the complementarity of information is ensured;
The self-supervision learning technology is applied to extract key words and context information from text data, and meanwhile, advanced features such as edges, shapes and textures are extracted from image data, so that the extraction rate of the data is improved;
based on the extracted features, a deep learning model is constructed for training, and a large-scale data set is utilized for model optimization, so that the high efficiency of the model in the aspect of sensitive data detection is ensured.
And introducing a dynamic model updating mechanism, and carrying out online adjustment and optimization on the model according to real-time data feedback so as to adapt to the change and emerging modes of the data.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.