CN119961216A

CN119961216A - Archives management method based on AI and encrypted storage

Info

Publication number: CN119961216A
Application number: CN202510435041.1A
Authority: CN
Inventors: 雷敏; 曹洁; 宋会侠; 闫斌
Original assignee: INSTITUTE OF GEOLOGY CHINESE ACADEMY OF GEOLOGICAL SCIENCES
Current assignee: INSTITUTE OF GEOLOGY CHINESE ACADEMY OF GEOLOGICAL SCIENCES
Priority date: 2025-04-08
Filing date: 2025-04-08
Publication date: 2025-05-09

Abstract

The invention relates to the technical field of archive management, and discloses an archive management method based on AI and encryption storage, the method comprises the steps of preprocessing original file information by using a natural language processing algorithm to generate structured file data, and carrying out multistage classification by using a dynamic classification tree model. And encrypting the data by adopting a hybrid encryption technology, constructing an encrypted archive data block, and storing the encrypted archive data block into a distributed database. And (3) utilizing a multi-mode search model to search in parallel, decrypting according to user rights and scenes based on a dynamic decryption strategy, and finally transmitting data through a secure channel. In addition, the system also comprises functions of updating and version control of file data, detection of access log and abnormality, sharing of files across institutions and the like. The invention improves the efficiency, the safety and the intelligentization level of file management and effectively solves the problems existing in the traditional file management.

Description

File management method based on AI and encryption storage

Technical Field

The invention relates to the technical field of archive management, in particular to an archive management method based on AI and encryption storage.

Background

In the digital age, archive management faces unprecedented challenges and opportunities. With the rapid development of information technology, various archives are explosively grown, and the traditional archives management mode has difficulty in meeting the requirements of modern society on archives management for high efficiency, safety and intelligence. From a data processing perspective, traditional archive management lacks intelligent data preprocessing capabilities. The original file information is often in various formats and complex in content, and contains a large amount of redundant information and noise data. For example, in some digitizing processes of history files, the scanned text has problems of disordered format, wrong character recognition and the like, and manual processing is not only inefficient, but also prone to omission. Moreover, in the face of massive archival texts, the traditional method is difficult to accurately extract key information, and effective data indexes cannot be quickly established, so that subsequent retrieval and utilization are extremely inconvenient.

The importance of archival data in terms of storage security is self-evident in that sensitive information contained therein, once compromised, may be severely lost to individuals, businesses or institutions. However, the traditional storage mode is mostly dependent on a centralized database, and there is a single point of failure risk, and the data is vulnerable to hacking, malicious tampering, natural disasters and the like. For example, if a centralized database is hacked, a large amount of archive data may be stolen or destroyed. In addition, in the application of encryption technology, in the past, a single encryption algorithm is adopted, so that the encryption efficiency and the security are difficult to be considered, and the differentiated encryption requirements of different types of archive data cannot be met.

The efficiency of retrieval is yet another pain point in traditional archive management. When a user needs to search a specific file, a search mode based on simple keyword matching often causes inaccurate and incomplete search results because semantic association cannot be understood. For example, in searching for documents related to "artificial intelligence in medical field application", merely entering keywords may miss important data that are close in terms of the word. Moreover, as the number of files increases, the search speed becomes slower and the work efficiency is seriously affected.

There are many obstacles to the traditional model in terms of across-organization archive sharing. The lack of unified sharing standard and security mechanism between different institutions often restricts the data sharing by the problems of rights management confusion, trust loss and the like. For example, in the medical industry, patient files in different hospitals are difficult to share, the collaborative development of medical services is hindered, and in the government affair field, file information circulation among departments is not smooth, so that the processing efficiency of public affairs is influenced.

Disclosure of Invention

The present invention is directed to an archive management method based on AI and encryption storage, so as to solve the problems set forth in the background art.

In order to achieve the above purpose, the invention provides the following technical scheme that the file management method based on AI and encryption storage comprises the following steps:

Preprocessing original file information through a natural language processing algorithm, wherein the preprocessing comprises text word segmentation, semantic entity identification and redundant information elimination, and structured file data is generated; carrying out multistage classification on the structured archive data based on a dynamic classification tree model, constructing classification nodes by the dynamic classification tree model through a hierarchical clustering algorithm, and dynamically adjusting classification levels based on archive topic similarity;

Encrypting the classified structured archive data by adopting a hybrid encryption technology, wherein the hybrid encryption technology comprises a combination of a symmetric encryption algorithm and an asymmetric encryption algorithm, wherein archive metadata is encrypted by adopting an AES algorithm, and archive access right keys are encrypted by adopting an RSA algorithm;

Storing the encrypted archive data blocks into a distributed database, wherein the distributed database adopts a slicing storage mechanism, performs slicing distribution on the data blocks based on archive classification labels and access frequencies, introduces a metadata index table based on block chains, and realizes position mapping and integrity verification of the data blocks through intelligent contracts;

According to keyword information in a user access request, performing parallel retrieval on the distributed database by utilizing a multi-mode retrieval model, constructing a joint index by combining text keywords, semantic vectors and classification labels by the multi-mode retrieval model, and matching the positions of the target archive data blocks by an approximate nearest neighbor search algorithm;

and decrypting the retrieved encrypted archive data block based on a dynamic decryption strategy, wherein the dynamic decryption strategy adaptively selects a decryption flow according to the user authority level and the access scene, and the decryption flow comprises single-layer decryption, multi-layer decryption and time-limited decryption, and generates decrypted archive data and transmits the decrypted archive data to a user terminal through a secure channel.

Preferably, preprocessing the original archive information by a natural language processing algorithm includes:

constructing a text cleaning model, wherein the text cleaning model adopts a mode of combining regular expression matching and deep learning to identify and remove irrelevant characters, repeated sections and format noise in an archive text;

semantic entity identification is carried out by adopting a two-way long-short-term memory network model with enhanced attention mechanism, key entities, time stamps and theme labels in the archive text are extracted, and an entity-relation map is generated;

Redundant node pruning is carried out on the entity-relation map based on a graph convolution network, and low-importance nodes and edges are removed through calculating semantic similarity and connection weight among the nodes, so that simplified semantic structure data is generated;

And carrying out alignment mapping on the semantic structure data and the original archive text to generate the structured archive data containing metadata description.

Preferably, the multi-stage classification of the structured archive data based on the dynamic classification tree model includes:

initializing classified tree root nodes, and constructing archive feature representation by adopting TF-IDF weighted word vectors and Doc2Vec document vectors;

generating an initial classification level by a hierarchical aggregation clustering algorithm, optimizing the clustering quantity based on the contour coefficient, and calculating the topic similarity between adjacent levels;

Introducing a dynamic splitting and merging mechanism, and triggering node splitting operation when the similarity of the levels is lower than a preset threshold value due to the newly added file data;

A unique classification code is generated for each classification node, the classification code comprising a hierarchical path, a topic identifier, and a version number, and the classification code is embedded into metadata of the structured archive data.

Preferably, the encrypting the classified structured archive data by using the hybrid encryption technology includes:

the file content is subjected to block processing, encryption priority is divided according to the size and the sensitivity of the data blocks, the high-priority data blocks are encrypted by using AES-256, and the low-priority data blocks are encrypted by using AES-128;

generating a random initialization vector and key derivation parameters for each data block, and generating a data block exclusive encryption key based on a PBKDF2 algorithm and a user master key;

Encrypting the file metadata by adopting an RSA-OAEP algorithm, generating an encrypted metadata signature, and embedding a public key hash value into the head of a data block;

And constructing a hash chain verification code, and performing iterative computation on the data block content and the adjacent data block hash value through an SHA-3 algorithm to generate a chain integrity verification code.

Preferably, storing the encrypted archive data block in the distributed database comprises:

Constructing a data slicing strategy based on the classification labels and the access frequencies, distributing high-frequency access data blocks to low-delay storage nodes, and distributing low-frequency access data blocks to high-capacity storage nodes;

Deploying metadata index intelligent contracts in a blockchain network, wherein the intelligent contracts record data block hash values, storage positions and access right strategies and realize privacy protection position inquiry through zero knowledge proof;

and performing redundancy coding on the data blocks by adopting an erasure coding technology, storing the coded data blocks into a plurality of geographically distributed storage nodes in a scattered manner, and maintaining the availability of the data through periodic heartbeat detection.

Preferably, the parallel search using the multimodal search model includes:

Constructing a joint index structure, wherein the joint index comprises an inverted index, a vector index and a graph index, and the text keywords, the semantic vectors and the classification labels are respectively corresponding to the joint index;

Carrying out semantic expansion on the user query keywords, and generating a synonym set and an associated concept set based on a Word2Vec model and a knowledge graph;

Carrying out quick search on the vector index by adopting a quantization approximate nearest neighbor algorithm, and constructing a multi-level search tree through product quantization and hierarchical clustering;

and integrating the accurate matching result of the inverted index, the similarity ordering result of the vector index and the associated path result of the graph index, generating a comprehensive retrieval score, and screening the target data block identifiers of the N top ranks.

Preferably, decrypting the encrypted archive data block based on the dynamic decryption policy includes:

The intermediate level authority user triggers the multi-layer decryption process and needs to verify the temporary token and the dynamically generated session key in sequence;

under the time-limited decryption scene, binding timeliness parameters for a decryption key, adopting an encryption scheme based on a time lock, and automatically invalidating the key after the preset time is exceeded;

And realizing distributed decryption through a secure multi-party computing protocol, dividing a decryption task into a plurality of participating nodes, and cooperatively completing decryption operation by each node based on a secret sharing protocol.

Preferably, the method further comprises the steps of archive data update and version control:

When the file data is modified, generating an increment update packet based on a difference coding algorithm, and re-encrypting an update part and reconstructing a hash chain;

recording historical versions by using a version snapshot mechanism, wherein each snapshot comprises a time stamp, a modifier signature and a version difference abstract, and realizing quick verification among versions through a merck tree structure;

Version synchronization protocols are deployed in the distributed database to coordinate data consistency among the multiple nodes based on a distributed consistency algorithm.

Preferably, the method further comprises the steps of accessing the log and anomaly detection:

Recording a detailed log of user access operation, including access time, request parameters, decryption operation and a data transmission path, and encrypting and storing the log into an independent audit database;

Constructing an anomaly detection model based on the combination of an isolated forest and a long-short-term memory network, analyzing access frequency, a permission abuse mode and decryption failure events in log data in real time, and generating anomaly scores;

When the anomaly score exceeds a preset threshold, an adaptive response mechanism is triggered, including temporarily freezing the account, enhancing the key rotation frequency and starting a data self-destruction protocol.

Preferably, the method further comprises the step of sharing across organization archives:

Constructing a alliance chain network, wherein each participating mechanism is used as a node to join the network, and defines a data sharing rule and an authority policy through an intelligent contract;

fine granularity access control is realized by adopting an attribute-based encryption technology, user attributes are matched with a file access policy, and decryption credentials are dynamically generated;

And introducing homomorphic encryption technology in the sharing process, allowing a third party to carry out statistical analysis on the encrypted archive data under the condition of not decrypting, and returning to the requesting party after generating an aggregation result.

Compared with the prior art, the invention has the beneficial effects that:

In the aspect of data processing, the original archive information is preprocessed by means of a natural language processing algorithm, so that the data quality is greatly improved. Through the text cleaning model, the related characters, repeated paragraphs and format noise can be accurately removed, for example, a regular expression is combined with deep learning, and messy codes and typesetting errors in the scanned files are effectively processed. Semantic entity recognition is carried out by adopting a two-way long-short term memory network model with enhanced attention mechanism, an entity-relation map is constructed, key information is extracted accurately, and a solid foundation is provided for subsequent classification and retrieval. Meanwhile, redundant node pruning operation based on the graph rolling network further optimizes the data structure and improves the data processing efficiency.

In terms of storage security, the application of the hybrid encryption technology provides dual guarantees. The symmetric encryption algorithm (such as AES) encrypts the file content and the metadata, so that the confidentiality of the data is guaranteed, and the asymmetric encryption algorithm (such as RSA) is used for encrypting the file access authority key, so that the security of key management is enhanced. And, divide the encryption priority according to the sensitive degree and size of the data block, adopt the higher-level encryption algorithm (such as AES-256) to the highly sensitive data, have balanced encryption efficiency and security. The constructed hash chain verification code and the encryption metadata signature effectively prevent the data from being tampered, and ensure the integrity of the data. The distributed database combines the blockchain technology, adopts a fragmentation storage mechanism and a metadata index table, improves the storage reliability, realizes the position mapping and the integrity verification of the data blocks through intelligent contracts, and reduces the risks of data loss and attack.

The retrieval efficiency is greatly improved. The multi-mode search model is combined with text keywords, semantic vectors and classification labels to construct a joint index, and the approximate nearest neighbor search algorithm is utilized to quickly and accurately match the positions of the data blocks of the target file. Through semantic expansion technology, a synonym set and an associated concept set are generated based on a Word2Vec model and a knowledge graph, so that the retrieval range is effectively enlarged, and the retrieval accuracy and the retrieval comprehensiveness are improved. For example, when searching for complex theme files, related data can be positioned more accurately, and a large amount of searching time is saved.

The dynamic decryption strategy adaptively selects the decryption flow according to the user permission level and the access scene, so that the flexibility and the security of access control are enhanced. The single-layer decryption process of the high-level authority user improves the working efficiency, and the multi-layer decryption process of the medium-level authority user meets the requirements of users at different levels on the premise of ensuring the safety. The time-limited decryption mechanism effectively prevents risks brought by long-term exposure of the secret key, and the distributed decryption realized by the secure multiparty computing protocol further ensures the security of the decryption process.

In the aspects of file data updating and version control, the differential encoding algorithm generates an incremental updating packet, so that the data transmission quantity and the storage burden are reduced, and the updating efficiency is improved. The version snapshot mechanism is combined with the merck tree structure, so that version differences can be conveniently and rapidly verified, and effective management and traceability of historical versions of files are realized. Version synchronization protocol in distributed database ensures data consistency among multiple nodes based on distributed consistency algorithm.

The access log and anomaly detection function provides security monitoring and early warning capabilities for the archive management system. And the user access operation log is recorded in detail and stored in an encrypted mode, so that post audit and tracking are facilitated. Based on an anomaly detection model combining an isolated forest and a long-short-term memory network, access behaviors can be analyzed in real time, abnormal conditions such as authority abuse and abnormal access frequency can be found in time, and a self-adaptive response mechanism is triggered, including temporarily freezing an account, enhancing key rotation frequency and starting a data self-destruction protocol, so that the safety of archival data is effectively protected.

By constructing a alliance chain network and an intelligent contract, the cross-organization archive sharing defines data sharing rules and authority policies, and solves the trust and authority management problems. The attribute-based encryption technology realizes fine-grained access control, and the homomorphic encryption technology allows a third party to perform statistical analysis under the condition of not decrypting, so that the data privacy is protected, the reasonable utilization of the data is promoted, the cooperative development among cross-institutions is promoted, and more efficient information sharing and cooperative work in the medical and government fields are realized.

Drawings

FIG. 1 is a schematic diagram illustrating the operation of the archive management method according to the present invention;

FIG. 2 is a workflow diagram of archive information preprocessing;

FIG. 3 is a workflow diagram of a dynamic classification tree model multi-level classification;

FIG. 4 is a flowchart illustrating the operation of the encrypted archive data block storage.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-4, the present invention provides an AI-and encryption-storage-based archive management method, which is generally implemented as follows:

And processing the original file information by using a natural language processing algorithm, wherein the processing comprises text word segmentation, semantic entity identification and redundant information elimination, so as to generate structured file data.

And carrying out multistage classification on the structured archive data by means of a dynamic classification tree model. The model builds classification nodes through hierarchical clustering algorithm, and dynamically adjusts classification levels according to file topic similarity.

And (3) adopting a hybrid encryption technology, namely a combination of a symmetric encryption algorithm and an asymmetric encryption algorithm, to encrypt the classified structured archive data. Wherein, the file metadata is encrypted by using an AES algorithm, and the file access authority key is encrypted by using an RSA algorithm. Meanwhile, an encrypted archive data block is constructed, and each data block comprises encrypted archive content, an encrypted metadata signature and a hash chain verification code.

The encrypted archive data blocks are stored in a distributed database. The database adopts a slicing storage mechanism, performs slicing distribution on the data blocks according to file classification labels and access frequencies, introduces a metadata index table based on block chains, and realizes position mapping and integrity verification of the data blocks by utilizing intelligent contracts.

And according to the keyword information in the user access request, utilizing the multi-mode search model to search the distributed database in parallel. The model combines the text keywords, the semantic vectors and the classification labels to construct a joint index, and the position of the target archive data block is matched through an approximate nearest neighbor search algorithm. And then, decrypting the retrieved encrypted archive data block based on a dynamic decryption strategy, wherein the decryption strategy adaptively selects a decryption flow according to the user permission level and the access scene, and the decryption flow comprises single-layer decryption, multi-layer decryption and time-limited decryption. And finally, transmitting the decrypted file data to the user terminal through a secure channel.

The practice of the invention will be further described with reference to examples 1 to 6.

Examples

The embodiment describes the specific process of preprocessing the file information in detail, and has the effect of improving the preprocessing precision and efficiency, so that the generated structured file data can reflect the file content more accurately, and a high-quality data basis is provided for the subsequent operations of classification, encryption, retrieval and the like.

When the text cleaning model is constructed, the mode of combining regular expression matching and deep learning plays a key role. The regular expression can quickly identify and reject irrelevant characters in the file text, such as HTML tags, laTeX commands and the like when processing electronic documents containing a large number of special format marks, and the format characters irrelevant to the core content of the file can be quickly removed by compiling the targeted regular expression. The deep learning model is then used to handle more complex situations, such as identifying repeat paragraphs and format noise. Taking Convolutional Neural Network (CNN) as an example, training a large number of archives containing different format noises, CNN can learn the characteristic mode of the noises, thus accurately detecting and removing redundant blank lines, inconsistent setbacks, typesetting errors and other format noises, and ensuring the purity of the archives.

The semantic entity identification is carried out by adopting a two-way long-short term memory network model (BiLSTM-Attention) with enhanced Attention mechanism, so that key information in the archive text can be effectively extracted. In processing history files, complex information such as numerous characters, events, time, etc. is involved. The BiLSTM-Attention model can focus on the key parts through an Attention mechanism, and accurately extracts key entities such as person names, event names, time stamps and the like. Meanwhile, a theme label can be generated, such as a theme label like history-war-ancient war is generated in a file related to ancient war, an entity-relation map is constructed, and the association between people and events and between events and time is clearly shown, such as a certain position is involved in a certain battle, the specific time of the battle, and the like, so that the structure of file information is clearer.

Redundant node pruning is performed on the entity-relationship graph based on a graph rolling network (GCN). When the semantic similarity among the nodes is calculated, a cosine similarity algorithm is adopted, and the formula is as follows: wherein AndRepresenting the feature vectors of the two nodes, respectively. And after the similarity among the nodes is calculated through the formula, the low-importance nodes and edges are removed by combining the connection weights. For example, in an entity-relationship graph of an enterprise archive, after some nodes with weaker descriptions and little relationship with a core business process are calculated, and similarity and connection weight of the nodes with other nodes are calculated, if the nodes and connection edges of the nodes are found to have smaller contribution to the information expression of the whole graph, the nodes and the connection edges are deleted, so that simplified semantic structure data is generated, and the definition and processing efficiency of the graph are improved.

And carrying out alignment mapping on the semantic structure data and the original archive text to generate the structured archive data containing metadata description. In this process, detailed metadata descriptions, such as information of types, sources, etc., of entities are added to each entity in the semantic structure data. Taking a medical file as an example, metadata describing 'patient basic information-identity identification' is added for 'patient name', and 'medical record-diagnosis time stamp' and the like are added for 'disease diagnosis time', so that the structured file data not only contains key information, but also has clear metadata labeling, and is convenient for subsequent management and utilization.

Examples

The embodiment is developed around multistage classification based on the dynamic classification tree model, and has the effects of realizing intelligent and dynamic classification of the structured archive data, improving the accuracy and adaptability of archive classification, and facilitating quick searching and management of archives by users.

And when initializing the classification tree root node, constructing file characteristic representation by adopting the TF-IDF weighted word vector and the Doc2Vec document vector. For large amounts of news archive data, the TF-IDF algorithm may calculate the importance of each word in a document, highlighting words that frequently occur in a particular document but are not common in the entire document collection. For example, in news about scientific and technological achievements, the TF-IDF values of words such as "quantum computation", "artificial intelligent chip" and the like are high, which indicates that the words have important significance to the topic expression of the document. The Doc2Vec model maps the whole document into a vector with fixed dimension, and retains the whole semantic information of the document. By combining the two vectors, the characteristics of the file can be more comprehensively described, and richer data support is provided for subsequent classification.

And generating an initial classification level by a hierarchical aggregation clustering algorithm, and optimizing the number of clusters based on the contour coefficient. The calculation formula of the contour coefficient is as follows: Wherein Is a sampleThe average distance from other samples within the same cluster,Is a sampleMinimum average distance from samples in other clusters. When a batch of educational files are clustered, different clustering numbers are tried continuously, and the contour coefficient under each clustering number is calculated. When the contour coefficient is closer to 1, the clustering effect is better, and the corresponding clustering quantity is the optimal clustering quantity. Meanwhile, the topic similarity between adjacent layers is calculated, and KL divergence and other methods can be adopted to measure the topic difference degree between different layers and ensure the classification rationality.

A dynamic splitting and merging mechanism is introduced, and when archive data is newly added, the similarity between the archive data and the existing hierarchy is calculated. If the level similarity is lower than a preset threshold (e.g., set to 0.5), the node splitting operation is triggered. For example, in an e-commerce product archive, when a brand new type of product archive, such as a "virtual reality device", is newly added, no proper category is contained in the original classification hierarchy, at this time, a new child node "virtual reality device" is created under the related electronic product classification through node splitting operation, and the archive is reasonably classified. Otherwise, when the similarity between the layers is higher than a preset threshold, the node merging operation is triggered, the classification structure is optimized, redundancy is reduced, and the compactness and logic of classification are improved.

A unique classification code is generated for each classification node, including a hierarchical path, a topic identifier, and a version number. For example, a hierarchical path of a classification node is "1-2-3", which means that it is located at the 3 rd grandnode under the 2 nd child node of the 1 st layer under the root node, the topic identifier is "electronic product-mobile phone-smart phone", which defines the topic of the node, and the version number is "V1.1", which is used for recording the change condition of the classification node. The classification codes are embedded into metadata of the structured archive data, so that the categories of the archive can be conveniently and rapidly and accurately positioned and identified in the subsequent storage, retrieval and management processes.

Examples

After the file content is segmented, the encryption priority is divided according to the size and the sensitivity of the data block. For example, when processing a customer profile of a financial institution, data blocks containing sensitive information such as a customer identification number, a bank card password, etc. are classified into high priority, and an AES-256 encryption algorithm is used, while some descriptive information such as a customer occupation, interest, etc. data blocks are classified into low priority, and an AES-128 encryption algorithm is used. Thus, the high security of the sensitive data can be ensured, and the encryption efficiency can be improved to a certain extent.

A random initialization vector and key derivation parameters are generated for each data block, and a data block dedicated encryption key is generated based on a PBKDF2 algorithm in combination with a user master key. The PBKDF2 algorithm increases the security of the key through repeated iterative computation. In practical application, the user master key generates a data block exclusive encryption key with high strength and difficult cracking by combining a PBKDF2 algorithm with a randomly generated salt value and iteration times, and provides independent encryption protection for each data block.

And encrypting the file metadata by adopting an RSA-OAEP algorithm, generating an encrypted metadata signature, and embedding the public key hash value into the head of the data block. The RSA-OAEP algorithm enhances the security of encryption through a filling mechanism in the encryption process. For example, when metadata such as creator of encrypted file, creation time and the like are encrypted by using RSA-OAEP algorithm, encrypted metadata signature is generated, so that the integrity and the non-tamper resistance of the metadata are ensured. The public key hash value is embedded into the head of the data block, so that the validity of the public key can be conveniently and rapidly verified during decryption.

And constructing a hash chain verification code, and performing iterative computation on the data block content and the adjacent data block hash value through an SHA-3 algorithm to generate a chain integrity verification code. Assuming that there is a data block、、First calculateThen,,And the final hash chain verification code is obtained. In this way, the content of any one data block is changed, which leads to the change of the subsequent hash chain verification code, so that whether the data is tampered in the storage or transmission process is effectively detected.

Examples

And constructing a data slicing strategy based on the classification labels and the access frequencies, distributing the high-frequency access data blocks to the low-delay storage nodes, and distributing the low-frequency access data blocks to the high-capacity storage nodes. For example, in a sales archive management system of an enterprise, the recent sales order data access frequency is high, the data blocks are distributed to low-delay storage nodes equipped with high-performance Solid State Disks (SSDs), so that users can quickly acquire the latest sales information, and the historical sales data access frequency is low before many years, and the data blocks are distributed to high-capacity mechanical hard disk storage nodes, so that storage resources are fully utilized, and storage cost is reduced.

And deploying a metadata index intelligent contract in the blockchain network, wherein the intelligent contract records the hash value of the data block, the storage position and the access authority strategy, and realizes the position inquiry of privacy protection through zero knowledge proof. Taking a medical archive sharing scenario as an example, different medical institutions serve as blockchain nodes, and hash values of medical archive data blocks of patients, storage positions and access right strategies of the medical institutions on the data are recorded in an intelligent contract. When a medical institution needs to inquire the archive storage position of a specific patient, the medical institution verifies whether the medical institution has inquiry authority or not under the condition of not revealing any other information by a zero knowledge proof technology, and acquires the storage position, so that the privacy and the data safety of the patient are protected.

And performing redundancy coding on the data blocks by adopting an erasure coding technology, storing the coded data blocks into a plurality of geographically distributed storage nodes in a scattered manner, and maintaining the availability of the data through periodic heartbeat detection. It is assumed that one data block is encoded into a plurality of redundant blocks, which are stored on storage nodes in different regions, respectively. When a certain node fails, other nodes can recover lost data through an erasure coding algorithm. Meanwhile, the regular heartbeat detection mechanism can discover node faults in time, inform the system of data restoration and node replacement, ensure that data is always in a usable state, and improve the fault tolerance of the system.

Examples

The embodiment elaborates the specific process of parallel retrieval by using the multi-mode retrieval model, and has the effects of improving the accuracy and speed of retrieving the target archive data block in the distributed database, and meeting the requirement of users for rapidly acquiring the required archive information.

And constructing a joint index structure, wherein the joint index structure comprises an inverted index, a vector index and a graph index, and the inverted index, the vector index and the graph index correspond to text keywords, semantic vectors and classification labels respectively. In a system containing a large number of academic documents, an inverted index can quickly locate documents containing specific keywords, for example, an artificial intelligence algorithm is input, the inverted index can quickly find all document data blocks containing the keywords, a vector index can find documents similar to query semantics through calculating semantic similarity based on semantic vectors, more comprehensive retrieval results can be provided for some synonyms or related queries, and a graph index can construct a correlation path by using classification labels, for example, in a document classification system, related documents of different subcategories under the same theme can be found through the graph index, so that the retrieval range is widened, and the retrieval comprehensiveness is improved.

And carrying out semantic expansion on the user query keywords, and generating a synonym set and an associated concept set based on the Word2Vec model and the knowledge graph. When the user inputs the electric automobile to search, the Word2Vec model can find similar words, such as a new energy automobile, a pure electric automobile and the like, and the knowledge graph further provides related concepts, such as a battery technology, a charging pile and the like. The expanded vocabulary and concepts are added into the search condition, so that the related archive data blocks can be searched more comprehensively, and important information is prevented from being missed.

And (3) carrying out quick search on the vector index by adopting a quantization approximate nearest neighbor algorithm, and constructing a multi-level search tree through product quantization and hierarchical clustering. The product quantization decomposes the high-dimensional vector into a plurality of low-dimensional vectors for quantization, and reduces the storage space and the calculated amount. For example, a 100-dimensional vector is decomposed into 10-dimensional vectors for quantization processing. Hierarchical clustering builds a multi-level search tree, and can gradually search downwards from a root node during searching, so that the searching range is quickly reduced, and the searching efficiency is improved. In this way, when handling large-scale vector indexes, the vector most similar to the query vector can be quickly found, thereby locating the target archive data block.

And integrating the accurate matching result of the inverted index, the similarity ordering result of the vector index and the associated path result of the graph index, generating a comprehensive retrieval score, and screening the target data block identifiers of the N top ranks. For example, when retrieving technical documents, the inverted index finds documents containing keywords, the vector index sorts the documents according to semantic similarity, and the graph index finds documents under related topics. And setting weights according to the importance of different index results, calculating comprehensive retrieval scores, selecting the top N file data block identifiers with the highest scores, and preferentially presenting files which are most in line with the requirements of users to the users.

Examples

And decrypting the encrypted archive data block based on the dynamic decryption strategy, and dividing the decryption authority according to the user authority level. In the enterprise internal file management system, advanced authority users, such as enterprise high-level management, trigger a single-layer decryption process, directly use a master key to decrypt data blocks, facilitate the users to quickly acquire important information, and intermediate authority users, such as department managers, trigger a multi-layer decryption process, and sequentially verify a temporary token and a dynamically generated session key, thereby increasing the security of decryption. In a time-limited decryption scene, for binding timeliness parameters with a decryption key, an encryption scheme based on a time lock is adopted, for example, the decryption key of a certain archive data block is set to be effective within 30 minutes, and the key automatically fails after the preset time is exceeded, so that the data security risk caused by key leakage is prevented. The distributed decryption is realized through the secure multi-party computing protocol, the decryption task is divided into a plurality of participating nodes, each node cooperatively completes the decryption operation based on the secret sharing protocol, and the decryption efficiency is improved while the data security is ensured.

In the aspects of archival data updating and version control, when archival data is modified, an incremental update packet is generated based on a differential encoding algorithm, and the update part is re-encrypted and the hash chain is rebuilt. For example, in document archive management of a software development project, after each code update, only the changed part of the code is recorded through a differential coding algorithm, so as to generate an incremental update package, and reduce data transmission and storage. Meanwhile, the updating part is re-encrypted, so that the safety of the data is ensured, the hash chain verification code is reconstructed, and the integrity of the data is ensured. The historical versions are recorded by adopting a version snapshot mechanism, each snapshot comprises a time stamp, a modifier signature and a version difference abstract, and quick verification among the versions is realized through a merck tree structure. In terms of version synchronization protocols, a version synchronization protocol based on a distributed consistency algorithm (such as a Paxos algorithm) is deployed in a distributed database, so that data consistency among multiple nodes is coordinated, and file data versions on different nodes are ensured to be consistent.

In terms of access logs and anomaly detection, a detailed log of user access operations, including access time, request parameters, decryption operations, and data transmission paths, is recorded, and the log is stored in an independent audit database in an encrypted manner. In a government archive management system, each access operation of a user is recorded in detail, and the log is prevented from being tampered after being encrypted and stored. And constructing an anomaly detection model based on the combination of the isolated forest and the long-short-term memory network, analyzing access frequency, a permission abuse mode and decryption failure events in log data in real time, and generating anomaly scores. When the abnormal score exceeds a preset threshold, an adaptive response mechanism is triggered, such as temporarily freezing an account, so that illegal users are prevented from continuing to access, the key rotation frequency is enhanced, the system safety is improved, a data self-destruction protocol is started, and important data is protected from being revealed under extreme conditions.

In the aspect of cross-organization archive sharing, a alliance chain network is constructed, each participating organization is taken as a node to join the network, and data sharing rules and authority policies are defined through intelligent contracts. In a cross-hospital archive sharing scenario in the medical industry, different hospitals serve as alliance chain nodes, and an intelligent contract is used for specifying which archives can be shared, access rights of various hospitals to different types of archives, and the like. And (3) fine-granularity access control is realized by adopting an attribute-based encryption technology, user attributes are matched with the archive access policy, and decryption credentials are dynamically generated. For example, for a patient's medical record, corresponding decryption credentials are generated according to the attributes of the doctor, such as the hospital, department, title, etc., and the confidentiality level of the record, and only the doctor who meets the conditions can access the specific medical record. The homomorphic encryption technology is introduced in the sharing process, so that a third party is allowed to carry out statistical analysis on encrypted archival data under the condition of not decrypting, for example, when medical big data research is carried out, a third party research institution can carry out disease incidence statistics and other analysis on encrypted medical record data under the condition of not acquiring patient privacy information, and an aggregation result is generated and returned to a requesting party, so that patient privacy is protected, and value mining of data is realized.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The file management method based on AI and encryption storage is characterized by comprising the following steps:

2. The method of claim 1, wherein preprocessing the original archive information by a natural language processing algorithm comprises:

3. The method of claim 1, wherein classifying the structured archive data in multiple stages based on a dynamic classification tree model comprises:

4. A method according to claim 1, wherein encrypting the categorized structured archive data using hybrid encryption techniques comprises:

5. The method of claim 1, wherein storing the encrypted archive data block to the distributed database comprises:

6. The method of claim 1, wherein performing parallel retrieval using a multi-modal retrieval model comprises:

7. The method of claim 1, wherein decrypting the encrypted archive data block based on the dynamic decryption policy comprises:

8. The method of claim 1, further comprising the steps of archive data update and version control:

9. The method of claim 1, further comprising the steps of accessing a log and anomaly detection:

10. The method of claim 1, further comprising the step of sharing across organization archives: