CN119989418A

CN119989418A - Cross-unit data management method

Info

Publication number: CN119989418A
Application number: CN202510468954.3A
Authority: CN
Inventors: 谭海东; 吴如富; 张玉波; 刘鹏
Original assignee: Guizhou Huizhi Electronic Technology Co ltd
Current assignee: Guizhou Huizhi Electronic Technology Co ltd
Priority date: 2025-04-15
Filing date: 2025-04-15
Publication date: 2025-05-13

Abstract

The invention relates to the technical field of data processing, in particular to a cross-unit data management method, which firstly carries out format standardization processing on data in each institution unit data pool, converting into a unified standard format, extracting semantic features to construct a semantic association network, and recording the semantic relation of data items and a dynamic authorization and authority mapping mechanism. And then, dynamically generating inquiry authorization and establishing an authority mapping table according to the identity information of the inquirer, the unit attribute, the service requirement and the data sensitivity level. And when the query condition is received, analyzing by utilizing a natural language processing technology, and generating a query route and optimizing sentences by combining a semantic association network and a permission mapping table. Finally, the data is transferred from each data pool according to the route, and is displayed to the inquirer in real time after fusion processing and the inquiry log is recorded.

Description

Cross-unit data management method

Technical Field

The invention belongs to the technical field of data processing, and relates to a cross-unit data management method.

Background

Along with the continuous improvement of government affair informatization and data sharing demands, the data collaborative management of cross-organization units becomes an important means for optimizing public service and improving treatment efficiency. Currently, independent data pools are commonly established by various institutions (e.g., environmental, business, tax, etc.), storing large amounts of structured, semi-structured, and unstructured data. However, cross-unit data integration and querying presents significant challenges due to data format non-uniformity, semantic definition differences, rights management dispersion, and the like. For example, the environmental protection department's "enterprise blowdown data" and the business department's "enterprise registration information" may employ different field naming and unit standards, making it difficult for the data to be directly used in association.

At present, the data formats of all units are not uniform, such as text coding and numerical precision difference, a great deal of manpower is required to be consumed for format conversion, the data semantics are isolated, and business logic association is lacked. For example, when a certain unit of data becomes sensitive due to policy adjustment, the system cannot automatically contract related query authorities, the risk of data leakage exists, meanwhile, the mapping relation between authorities and data is rough, fine granularity control is difficult to realize, and the existing query method depends on a preset routing policy and cannot dynamically optimize paths according to data pool loads, semantic association and the like. For example, complex queries require traversing multiple unit data pools, have long response times, lack natural language processing support, require structured query statements for users, have high thresholds, and are prone to errors.

Aiming at the existing problems, the publication number is CN119377582A, and the patent name is a management method and system aiming at multi-terminal engineering internal data, wherein the engineering information data set is constructed through space-time feature fusion data, so that the physical integration of the multi-terminal data is realized, the management efficiency of the engineering data is improved, however, only the space-time feature is concerned, the semantic association among data items is not constructed, and the complex business logic query cannot be supported; another example is a general data management method and system based on identification analysis technology, the patent number is CN119378532A, the scheme uses identification analysis technology to standardize data format, realize industry data convergence, solve data format heterogeneous problem, support cross-system data exchange, but rely on predefined identification rules, unable to dynamically expand semantic association, right management is still based on fixed roles, and a hybrid analysis middleware and query method based on cross-type database, the patent number is CN119396848A, uniform access of multi-type databases is realized through middleware, and cross-database query is maintained, application system development complexity is reduced, however, routing strategy is preset and static, query path cannot be optimized according to data semantics and real-time state, and the data processing method and system, the patent number is CN118838968A, realize data synchronization through message middleware, support distributed cluster data integration, promote data synchronization efficiency, support high concurrency scene, but unresolved semantic conflict, unable to meet sensitive data fine granularity control requirement.

The existing method has a certain progress in the aspects of data integration, authority management and query efficiency, but has the common defects that business logic association analysis among data items is lacking, compound semantic query cannot be supported, an authorization mechanism and a query routing strategy depend on manual configuration, data change and scene requirements cannot be responded in real time, the conflict resolution capability is weak, an effective cross-unit data semantic conflict resolution scheme is not provided, and data consistency is difficult to guarantee.

Disclosure of Invention

The invention provides a cross-unit data management method, which aims to solve the problems of data semantic isolation, static authority management, low query efficiency and insufficient cross-unit data conflict resolution capability in the existing cross-unit data management.

In order to solve the problems, the invention adopts the following technical scheme:

A method of cross-unit data management, comprising the steps of:

S01, carrying out format standardization processing on data in each institution unit data pool, converting the data in different formats into a unified standard format, extracting semantic features for each data item, constructing a semantic association network, and recording semantic association relations among different data items through the semantic association network, wherein a dynamic authorization and permission mapping mechanism is adopted;

S02, dynamically generating query authorization based on identity information, unit attributes, service requirements and data sensitivity level of the inquirer, and simultaneously establishing a permission mapping table to map permissions of the inquirer with data nodes in a semantic association network;

S03, when receiving the query condition, analyzing the query condition by utilizing a natural language processing technology, combining a semantic association network and a permission mapping table, converting the natural language query into a structured query tree by utilizing a dependency syntax analysis and intention recognition model, combining the semantic association network and the permission mapping table, selecting an optimal data node path based on reinforcement learning, generating a query route, simultaneously introducing a materialized view and a query rewriting technology, decomposing the complex query into sub-queries, and executing in parallel to optimize the query statement, wherein the reward function of the reinforcement learning

Wherein, For authority compliance, the value range is 0,1,Representing the data processing delay, in milliseconds,For the collision probability predictor, parallelism is based on the formula,

The calculation is performed such that,The method comprises the steps of indicating the total amount of data to be processed, wherein the unit is bytes, nodecapacity is the upper limit of the amount of data which can be borne by a data node, the unit is bytes, the Priority is the Priority of a query task, and the value range is [1-10];

s04, corresponding data are called from the data pools of each institution unit according to the query route, fusion processing is carried out on the called data, and finally, the fused data are displayed to a querier in real time, and query logs are recorded.

The principle and the advantages of the scheme are as follows:

In the data preprocessing stage, format standardization is carried out on data in each institution unit data pool, semantic features are extracted, a semantic association network is constructed, semantic association relations among data items are recorded, query authorization is dynamically generated based on multiple aspects of information of a querier, a permission mapping table is established, the permission is associated with semantic network nodes, accurate control of the querier permission is ensured, when query conditions are received, natural language processing technology is utilized to analyze the conditions, query routes are generated by combining the semantic association network and the permission mapping table, and query sentences are optimized, so that needed data can be efficiently positioned. And finally, according to the query route, the data are called, are displayed to a querier after fusion processing, and query logs are recorded for subsequent system optimization.

Compared with the prior art, the scheme has obvious creativity and advantages, and in the aspect of data integration, the prior art focuses on unification of data formats, and the scheme not only performs format standardization, but also deeply mines semantic information of data to construct a semantic association network. For example, in the data integration of environmental protection and meteorological departments, the 'pollutant emission data' of the environmental protection departments and the 'air quality data' of the meteorological departments can be logically connected through a semantic association network, so that originally isolated data form an organic whole, and support is provided for complex data analysis and decision. In the authority management, the prior art generally adopts a static authorization mode, and cannot adapt to dynamic changes. The scheme dynamically generates the authorization based on the identity, the business requirement and the data sensitivity level of the inquirer, and realizes fine-granularity authority control. If a inquirer needs to temporarily access sensitive data in a specific time period due to work, the system can dynamically adjust authorization according to the service requirement, and ensure the data security while meeting the requirement. This can be achieved in various ways, such as username/password combinations, digital certificates, biometric technologies such as fingerprint recognition, face recognition. The basic identity information of the inquirer, such as name, department, position and the like, is determined through identity authentication, and when the inquirer submits an inquiry request, the system analyzes the service requirement behind the request. This may involve natural language processing and semantic understanding of the query conditions. For example, the system may identify that business needs are market analysis by querying "acquire last month sales department performance data for market analysis report", the data involved is sales department performance data, and the time frame is last month. The system can determine the data resources and operation types to be accessed according to predefined business rules and data association, and the dynamic authorization can set the valid time of the authority. For example, when a querier needs access to certain sensitive data for a temporary task, the system only gives him access rights during the task, e.g. one week. Once the expiration date is exceeded, the rights are automatically disabled, preventing rights abuse. In terms of query processing, the query route and statement optimization capability of the prior art are limited, and the scheme can directly analyze the natural language query condition of a user by utilizing a natural language processing technology, generate an optimal query route by combining a semantic association network and a permission mapping table, and intelligently optimize query statements. For example, the user inputs' inquiry about punishment conditions of highly polluted enterprises in a region of a month, and the system can rapidly and accurately call relevant data from a plurality of institution data pools, so that inquiry efficiency and accuracy are improved. In addition, the data fusion processing of the scheme combines a semantic association network, so that the problems of data conflict and redundancy can be effectively solved, and the data displayed to the inquirer is ensured to be accurate, complete and consistent. Meanwhile, recording and analyzing the query log are beneficial to continuously optimizing the system performance and the authority management strategy, and the quality and efficiency of cross-unit data management are further improved.

Further, in the S01, semantic feature extraction adopts BERT to vectorize data items, and combines named entity recognition and relationship extraction technology to construct a semantic association network, the data items are converted into vector form by using the language understanding capability of the pre-training language model, key entities in the data are determined by the named entity recognition, and semantic relationships among the entities are mined by using the relationship extraction technology, so that a semantic association network is constructed, and the semantic association relationships among different data items are recorded by the network;

establishing a metadata mapping relation across data pools based on cosine similarity, wherein for data A and data B, the semantic similarity is that

Wherein, AndThe vector generated for the BERT is a vector,Introducing attribute-based access control in combination with time stamp, geographic position and data update frequency adjustment authority for the association strength obtained by the relation, when

When Sim (A, B) is larger than a set threshold value theta, establishing metadata mapping relation between data A and data B, and introducing an access control model based on attributes, wherein the authority model isuthU is a user, D is data,As the weight of the attribute(s),Comprises a sensitivity level and a unit trust level, dynamically adjusting the user authority through a time stamp, a geographic position and a data updating frequency,

Wherein P (U, D) represents the final access rights of the user U to the data D,The attribute weight comprises a sensitivity level weight of the data D and a unit trust degree weight which belongs to the data D, and the value range of the attribute weight is 0, 1; The attribute values related to the user U and the data D specifically comprise a time stamp related attribute value, a geographic position related attribute value and a data updating frequency related attribute value, and the range of the attribute values is [0,1].

The scheme BERT is used as a pre-training language model, has strong language understanding capability after training of a large-scale corpus, can capture deep semantic information in data items, can more accurately represent semantic content of the data after converting the data items into vector forms, can accurately identify key entities in the data, such as name, place name, organization name and the like, and can mine semantic relations among the entities by a relation extraction technology. By constructing the semantic association network, semantic association relations among different data items are clearly recorded, so that the context and association of data can be better understood in cross-unit data management, for example, in business data related to a plurality of units, the data interaction relation among the different units can be found through the semantic association network, more comprehensive information is provided for data analysis and decision, meanwhile, the similarity among vectors and the association strength among entities are considered due to semantic similarity, and the problem of mismatching possibly occurring in a traditional query mode based on keyword matching can be avoided. For example, when querying a "financial statement of a company", not only data containing keywords of the "company" and "financial statement" may be found, but also other data related to the company's finance, such as a financial analysis report, etc., may be found through a semantic association network. And multiple attributes such as sensitivity level and unit trust are considered according to the authority model, so that personalized authority allocation can be performed according to the characteristics of different users and data. Different data may have different sensitivity levels, different units to which the user belongs may have different degrees of trust, and by weighting these attributes, the most appropriate rights may be assigned to each user and data combination. For example, for data with high sensitivity level, only users with corresponding rights from high trust units can access, and the rights of users can be adjusted in real time with the change of factors such as time, geographic position, data updating frequency and the like. For example, the system may automatically increase the operating frequency limit of users having data update rights as the frequency of data updates increases, and may automatically limit their access rights to certain data as users leave a particular geographic location. The real-time authority adjustment mechanism can adapt to the dynamically-changed service environment, and ensures the safety and usability of data.

In the S02, the authority mapping table uses Neo4j to store the mapping relation between the authority node and the data node, the authority granularity is represented by the edge weight, the mapping relation between the authority and the data is represented by the graph database, and the mapping algorithm is thatWhereinFor sensitive change thresholds, the associated rights are recursively updated, P denotes the rights,Is a function of recursively updating permissions.

In the cross-unit data management, the mapping relation between the rights and the data is often complicated, and different users, roles, data resources and various rights are combined to form a complex network structure. The Neo4j is used for storing the mapping relation between the authority node and the data node, the relation can be intuitively displayed in a graph form, the authority node and the data node are used as vertexes in the graph, edges between the authority node and the data node represent the mapping relation, and the weight of the edges can clearly represent the authority granularity, such as different operation authorities of reading, writing, modifying and the like. This visual representation helps administrators quickly understand and manage the rights hierarchy, which may change over time in cross-unit data management. When new rights, data nodes or modification mapping relations are needed to be added, corresponding vertexes and edges are only needed to be added or modified in the graph by using Neo4j to store the rights mapping relations, operation is simple, and influence on the existing data structure is small. For example, when a new business segment is added, a new authority node can be conveniently created for the new business segment, and a mapping relation can be established between the new authority node and related data nodes, and the whole database structure does not need to be adjusted in a large scale.

Further, in S04, corresponding data is called from the data pool of each organization, the heterogeneous data difference is eliminated by using the ontology alignment and data cleaning technology, the multi-source data confidence is fused based on the evidence theory, and the data fusion is performed according to the fusion formula

WhereinFor the basic probability allocation of the ith data source,The credibility of the data source;

Constructing conflict detection rule base, adopting game theory negotiation model to resolve conflict WhereinIs a priority of units and is a priority of units,For the utility value of the conflict resolution, bel represents the confidence after data fusion, resolution represents the conflict resolution function, maxmize represents the maximizing function.

In the above scheme, in cross-unit data management, the same data may have multiple data sources, and the reliability and accuracy of each data source may be different. Evidence theory can comprehensively consider information of a plurality of data sources through basic probability distributionAnd data source trustworthinessTo calculate confidence of data

The method can fully utilize the complementarity of the multi-source data and improve the accuracy and reliability of data fusion. For example, for the probability of occurrence of an event, different data sources may give different estimates, a more reasonable comprehensive estimate may be obtained by evidence theory fusion, and in cross-unit data management, different units may have different interest appeal and priorities for the data. The game theory negotiation model may treat these units as participants in the game by taking into account unit prioritiesAnd utility value of conflict resolutionTo find the optimal conflict resolution. The method can fully consider the interests of all parties, so that the conflict resolution result is fairer and more reasonable, and the acceptance of all parties to the data fusion result is improved. For example, when processing the conflict of the data ownership, the game theory negotiation model can find a solution acceptable to all parties according to the importance of each unit and the demand level of the data, meanwhile, the game theory negotiation model has the dynamics, the conflict solution can be continuously adjusted according to the actual situation, and when the priority of the unit changes or a new benefit relation appears, the model can recalculate the optimal solution, so that the conflict is ensured to be effectively solved all the time. This enables the conflict resolution mechanism to adapt to changing traffic environments and data conditions.

Further, the format standardization process comprises unified coding formats of text data in different formats, unified measurement units and precision of numerical data, wide data sources of the same organization units, possibility of adopting various coding formats of text data such as UTF-8, GBK and the like, and huge differences of measurement units and precision of the numerical data. After unifying the coding format and the unit accuracy, the difference in the formats can be eliminated, so that the data from different units can be successfully integrated. For example, in the data integration of the environmental protection department and the weather department, the GBK code may be used for the monitoring report text of the environmental protection department, the UTF-8 code is used for the data text of the weather department, after unified coding, the data of the environmental protection department and the weather department can be combined in a system to construct a comprehensive environmental information database, and the unified coding format and the accuracy of the measurement units can reduce the threshold of data sharing, so that the data of each unit is easier to be understood and used by other units. For example, in an information sharing platform between government departments, data in a uniform format can be directly invoked and analyzed by other departments, so that the efficiency and effect of data sharing are improved.

Further, in the process of dynamically generating the query authorization, the log analysis system is used for counting the query frequency, the query data type and the information of query time distribution of the querier in a certain period of time, based on the statistical results, if the querier frequently queries certain data, the query authorization is dynamically generated, more query authorities of the data are provided for the querier, and if abnormal query with larger difference from the historical query mode occurs, the system automatically performs strict examination or limitation on the query authorization. By analyzing the recent query behavior patterns and the historical query records of the querier, the actual requirements of the querier can be deeply known. For example, when a inquirer who pays attention to the atmospheric pollution data of an environmental protection department for a long time dynamically generates inquiry authorization, the system can provide more data inquiry authorities related to the atmospheric pollution for the inquirer, including detailed data of a specific area and a specific time period, so that the authorization is more consistent with the actual working requirements of the inquirer, and the historical inquiry behaviors of the inquirer usually have a certain regularity. If abnormal query behavior that does not match the historical pattern occurs, it may mean that there is a data security risk. For example, a querier only querying public data suddenly and frequently requests sensitive data in one level, and the system can timely find out the abnormality by comparing the historical query records of the querier and strictly examine or limit query authorization, so that safety problems such as data leakage and illegal access are effectively prevented.

Further, in the step S03, the change of the data is monitored in real time, once the sensitivity of the data is changed, the authority setting of the corresponding data node in the authority mapping table is adjusted according to a preset sensitivity level rule, when the authority of the inquirer is changed, the mapping relation between the corresponding inquirer and the data node is directly modified in the authority mapping table, the authority mapping table stores the mapping relation between the authority node and the data node by adopting a Neo4j graph database, the authority granularity is represented by using the edge weight, and the rapid update is realized by means of the characteristics of the graph database. When data update causes that some originally disclosed data become sensitive data, the access rights of the inquirer can be timely limited by updating the rights mapping table in real time. For example, in the medical industry, part of examination data of patients is originally open to specific departments in hospitals, if the data are found to contain new sensitive information later, unauthorized persons can be prevented from continuously accessing the data by updating the authority mapping table in real time, so that data leakage is effectively prevented, and if the authority mapping table cannot be updated in real time, hysteresis phenomenon that the authority of a inquirer is inconsistent with the actual situation can occur. This can result in the inquirer being blocked by insufficient rights when access to certain data is required, or still being able to access the data if rights have been retracted, affecting the efficiency of the work. The real-time update can timely eliminate the hysteresis, so that a inquirer can successfully acquire the required data, and the smoothness of the workflow is improved.

Further, the natural language processing technology comprises word segmentation, part-of-speech tagging, named entity recognition and semantic understanding, and through semantic understanding of query sentences, the system can acquire more context information and user intention, so that more intelligent decisions can be made. For example, when a user queries "environmental protection standard condition of an enterprise in a certain industry in the home city", the system can not only provide relevant standard data, but also perform intelligent analysis according to the data, such as environmental protection situation assessment of the whole industry, comparison analysis with other industries, and the like, so as to provide more valuable decision support for the user.

In step S03, load information of CPU utilization rate and memory occupancy rate of each data pool and storage position information of data are obtained in real time through a monitoring system, when a query route is generated, a data pool which is closer to a query initiating end or has a shorter data transmission path is preferentially selected, when the data pool is in a high load state, a query request is guided to the data pool with lower load, and the storage position of the data can influence the distance and time of data transmission. The query route is generated by considering the storage position of the data, so that a data pool which is closer to the query initiating end or has a shorter data transmission path can be preferentially selected. For example, in cross-regional unit data management, if an inquirer is located in an area A, and related data is stored in a data pool B which is closer to the area A, a query route generated by a system can be preferentially directed to the data pool B, so that the transmission time of the data in a network is shortened, the query response speed is increased, and meanwhile, when some data pools are in a high-load state, the query processing speed is obviously reduced. Considering the load situation of the data pool, the system may direct the query request to the data pool with the lower load. For example, at a certain moment, the CPU utilization and memory occupancy of the C data pool are high, while the D data pool is in a light load state, the system directs the query route to the D data pool, avoiding query delay caused by waiting for processing of the high load data pool.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

Detailed Description

Embodiment 1, as shown in fig. 1, a cross-unit data management method includes the following steps:

In the aspect of data integration, the prior art focuses on unification of data formats, and the scheme not only performs format standardization, but also deeply mines semantic information of data to construct a semantic association network. For example, in the data integration of environmental protection and meteorological departments, the 'pollutant emission data' of the environmental protection departments and the 'air quality data' of the meteorological departments can be logically connected through a semantic association network, so that originally isolated data form an organic whole, and support is provided for complex data analysis and decision. In the authority management, the prior art generally adopts a static authorization mode, and cannot adapt to dynamic changes. The scheme dynamically generates the authorization based on the identity, the business requirement and the data sensitivity level of the inquirer, and realizes fine-granularity authority control. If a inquirer needs to temporarily access sensitive data in a specific time period due to work, the system can dynamically adjust authorization according to the service requirement, and ensure the data security while meeting the requirement. This can be achieved in various ways, such as username/password combinations, digital certificates, biometric technologies such as fingerprint recognition, face recognition. The basic identity information of the inquirer, such as name, department, position and the like, is determined through identity authentication, and when the inquirer submits an inquiry request, the system analyzes the service requirement behind the request. This may involve natural language processing and semantic understanding of the query conditions. For example, the system may identify that business needs are market analysis by querying "acquire last month sales department performance data for market analysis report", the data involved is sales department performance data, and the time frame is last month. The system can determine the data resources and operation types to be accessed according to predefined business rules and data association, and the dynamic authorization can set the valid time of the authority. For example, when a querier needs access to certain sensitive data for a temporary task, the system only gives him access rights during the task, e.g. one week. Once the expiration date is exceeded, the rights are automatically disabled, preventing rights abuse. In terms of query processing, the query route and statement optimization capability of the prior art are limited, and the scheme can directly analyze the natural language query condition of a user by utilizing a natural language processing technology, generate an optimal query route by combining a semantic association network and a permission mapping table, and intelligently optimize query statements. For example, the user inputs' inquiry about punishment conditions of highly polluted enterprises in a region of a month, and the system can rapidly and accurately call relevant data from a plurality of institution data pools, so that inquiry efficiency and accuracy are improved. In addition, the data fusion processing of the scheme combines a semantic association network, so that the problems of data conflict and redundancy can be effectively solved, and the data displayed to the inquirer is ensured to be accurate, complete and consistent. Meanwhile, recording and analyzing the query log are beneficial to continuously optimizing the system performance and the authority management strategy, and the quality and efficiency of cross-unit data management are further improved.

In the S01, semantic feature extraction adopts BERT to vectorize data items, and combines named entity recognition and relation extraction technology to construct a semantic association network, the data items are converted into vector form by utilizing the language understanding capability of a pre-training language model, key entities in the data are determined by the named entity recognition, and semantic relations among the entities are mined by utilizing the relation extraction technology, so that a semantic association network is constructed, and the semantic association relations among different data items are recorded by the network;

In S04, corresponding data is called from the data pool of each organization, the heterogeneous data difference is eliminated by using the ontology alignment and data cleaning technology, the confidence coefficient of the multi-source data is fused based on the evidence theory, and the data fusion is performed according to the fusion formula

The format standardization processing comprises unified coding formats of text data in different formats, unified measurement units and precision of numerical data, wide data sources of the same organization, possibility of adopting various coding formats of the text data, such as UTF-8, GBK and the like, and quite different measurement units and precision of the numerical data. After unifying the coding format and the unit accuracy, the difference in the formats can be eliminated, so that the data from different units can be successfully integrated. For example, in the data integration of the environmental protection department and the weather department, the GBK code may be used for the monitoring report text of the environmental protection department, the UTF-8 code is used for the data text of the weather department, after unified coding, the data of the environmental protection department and the weather department can be combined in a system to construct a comprehensive environmental information database, and the unified coding format and the accuracy of the measurement units can reduce the threshold of data sharing, so that the data of each unit is easier to be understood and used by other units. For example, in an information sharing platform between government departments, data in a uniform format can be directly invoked and analyzed by other departments, so that the efficiency and effect of data sharing are improved.

In the process of dynamically generating query authorization, the log analysis system is used for counting the query frequency, query data type and query time distribution information of a querier in a certain period of time, based on the statistics results, if the querier frequently queries certain data, the querier is provided with more query rights of the data when the query authorization is dynamically generated, and if abnormal query with larger difference from a historical query mode occurs, the system automatically carries out strict examination or limitation on the query authorization. By analyzing the recent query behavior patterns and the historical query records of the querier, the actual requirements of the querier can be deeply known. For example, when a inquirer who pays attention to the atmospheric pollution data of an environmental protection department for a long time dynamically generates inquiry authorization, the system can provide more data inquiry authorities related to the atmospheric pollution for the inquirer, including detailed data of a specific area and a specific time period, so that the authorization is more consistent with the actual working requirements of the inquirer, and the historical inquiry behaviors of the inquirer usually have a certain regularity. If abnormal query behavior that does not match the historical pattern occurs, it may mean that there is a data security risk. For example, a querier only querying public data suddenly and frequently requests sensitive data in one level, and the system can timely find out the abnormality by comparing the historical query records of the querier and strictly examine or limit query authorization, so that safety problems such as data leakage and illegal access are effectively prevented.

And S03, monitoring the change of data in real time, once the sensitivity of the data is changed, adjusting the authority setting of the corresponding data node in the authority mapping table according to a preset sensitivity level rule, and directly modifying the mapping relation between the corresponding inquirer and the data node in the authority mapping table when the authority of the inquirer is changed, wherein the authority mapping table stores the mapping relation between the authority node and the data node by adopting a Neo4j graph database, and the authority granularity is represented by using the edge weight, so that the quick updating is realized by means of the characteristics of the graph database. When data update causes that some originally disclosed data become sensitive data, the access rights of the inquirer can be timely limited by updating the rights mapping table in real time. For example, in the medical industry, part of examination data of patients is originally open to specific departments in hospitals, if the data are found to contain new sensitive information later, unauthorized persons can be prevented from continuously accessing the data by updating the authority mapping table in real time, so that data leakage is effectively prevented, and if the authority mapping table cannot be updated in real time, hysteresis phenomenon that the authority of a inquirer is inconsistent with the actual situation can occur. This can result in the inquirer being blocked by insufficient rights when access to certain data is required, or still being able to access the data if rights have been retracted, affecting the efficiency of the work. The real-time update can timely eliminate the hysteresis, so that a inquirer can successfully acquire the required data, and the smoothness of the workflow is improved.

The natural language processing technology comprises word segmentation, part-of-speech tagging, named entity recognition and semantic understanding, and through semantic understanding of query sentences, the system can acquire more context information and user intention, so that more intelligent decisions can be made. For example, when a user queries "environmental protection standard condition of an enterprise in a certain industry in the home city", the system can not only provide relevant standard data, but also perform intelligent analysis according to the data, such as environmental protection situation assessment of the whole industry, comparison analysis with other industries, and the like, so as to provide more valuable decision support for the user.

In the step S03, load information of CPU utilization rate and memory occupancy rate of each data pool and storage position information of data are obtained in real time through a monitoring system, when a query route is generated, a data pool which is closer to a query initiating end or has a shorter data transmission path is preferentially selected, when the data pool is in a high load state, a query request is guided to the data pool with lower load, and the storage position of the data can influence the distance and time of data transmission. The query route is generated by considering the storage position of the data, so that a data pool which is closer to the query initiating end or has a shorter data transmission path can be preferentially selected. For example, in cross-regional unit data management, if an inquirer is located in an area A, and related data is stored in a data pool B which is closer to the area A, a query route generated by a system can be preferentially directed to the data pool B, so that the transmission time of the data in a network is shortened, the query response speed is increased, and meanwhile, when some data pools are in a high-load state, the query processing speed is obviously reduced. Considering the load situation of the data pool, the system may direct the query request to the data pool with the lower load. For example, at a certain moment, the CPU utilization and memory occupancy of the C data pool are high, while the D data pool is in a light load state, the system directs the query route to the D data pool, avoiding query delay caused by waiting for processing of the high load data pool.

In the actual use process, the water-soluble fiber is prepared,

1. Data preprocessing

For data in the data pool of different institutions, format standardization is performed first. For text data, a professional text code detection tool, such as chardet library, is used for identifying a code format, and if the monitoring report text of the environmental protection department adopts GBK code and the data text of the weather department adopts UTF-8 code, the data text is uniformly converted into UTF-8 code so as to ensure the compatibility of the data in subsequent processing. For numerical data, the original measurement unit and precision of the numerical data are determined by analyzing metadata information and business rules of the data. When integrating the data of environmental protection and meteorological departments, the method finds that the pollutant concentration data recorded by the environmental protection departments are in ppm units, the related data of the meteorological departments are in mg/m3 units, all the numerical data are unified into mg/m3 according to scientific unit conversion rules, the precision is unified to two decimal places, the difference of the numerical data in metering units and precision is eliminated, and a foundation is laid for data integration.

And extracting semantic features of the processed standardized data by adopting a BERT model based on deep learning. Taking the monitoring data of the environmental protection department as an example, the data is input into a pre-trained BERT model, and the model can carry out deep analysis on the data content, the context and the related business rules. And simultaneously, a semantic association network is constructed by combining Named Entity Recognition (NER) and a relation extraction technology. The NER technology is utilized to accurately identify key entities in the data, such as person names, place names, organization names, pollutant names and the like, and semantic relations between the entities, such as causal relations between pollutant emission and air quality, belongings between enterprises and pollutant emission data and the like, are mined through the relation extraction technology. The extracted semantic features are used as nodes, the semantic association relationship is used as an edge, a semantic association network is constructed, the semantic association relationship among different data items is clearly recorded, and the subsequent data analysis and query processing are facilitated.

And establishing a cross-data-pool metadata mapping relation based on cosine similarity. For data A and data B, the semantic similarity is

Wherein, AndThe vector generated for the BERT is a vector,The correlation strength obtained for the relationship, and a threshold value θ, for example θ=0.75, is set through a large number of experiments and business experiences. When Sim (a, B) is greater than a set threshold θ, a metadata mapping relationship between data a and data B is established. Introducing an access control model based on attributes, wherein the authority model is as followsuthU is a user, D is data,As the weight of the attribute(s),Comprises a sensitivity level and a unit trust level, dynamically adjusting the user authority through a time stamp, a geographic position and a data updating frequency,

Wherein P (U, D) represents the final access rights of the user U to the data D,The attribute weight comprises a sensitivity level weight of the data D and a unit trust degree weight which belongs to the data D, and the value range of the attribute weight is 0, 1; The attribute values related to the user U and the data D specifically comprise a time stamp related attribute value, a geographic position related attribute value and a data updating frequency related attribute value, and the range of the attribute values is [0,1]. For example, for data of a high sensitivity level, only users from a high trust level unit and having corresponding rights can access, the system automatically increases the operating frequency limit of users having data update rights when the frequency of data update increases, and automatically limits access rights to certain data when users leave a specific geographic location.

2. Dynamically generating query grants and rights mappings

Collecting the identity information of the inquirer, such as name, affiliated unit, job position and the like, obtaining the agreement of the inquirer, obtaining the service requirement through the service requirement form filled in the system by the user, and setting the sensitivity level of the data according to the data content and related regulations by the data owner or manager. Meanwhile, by means of a log analysis system, information such as query frequency, query data type, query time period and the like of a querier in the past period of time, such as the last three months, is counted. If a inquirer is found to pay attention to the atmospheric pollution data of the environmental protection department for a long time, when inquiry authorization is generated, more data inquiry authorities related to the atmospheric pollution are provided for the inquirer, and the inquiry authorities comprise detailed data of specific areas and specific time periods, so that the authorization is more in accordance with the actual working requirements of the inquirer. If the inquirer has abnormal inquiry behaviors which are inconsistent with the history mode, if the inquirer only inquires the public data suddenly and frequently requests the sensitive data at ordinary times, the system timely finds out the abnormality by comparing the history inquiry records of the inquirer, and strictly checks or limits the inquiry authorization, thereby effectively preventing the safety problems such as data leakage, illegal access and the like.

And using Neo4j to store the mapping relation between the authority node and the data node, and establishing an authority mapping table. The authority nodes and the data nodes are used as vertexes in the graph, edges between the authority nodes and the data nodes represent mapping relations, and weights of the edges are used for clearly representing authority granularity, such as a read authority weight of 0.3, a write authority weight of 0.5, a modification authority weight of 0.7 and the like. The mapping algorithm of the authority mapping table is as follows: Wherein For sensitive change thresholds, the associated rights are recursively updated, P denotes the rights,Is a function of recursively updating permissions. For example, when a new business department is added, a new authority node is created for the new business department in the Neo4j graph database, and a mapping relation is established between the new authority node and the related data node, so that the operation is simple and the influence on the existing data structure is small. The rights mapping table is updated in real time according to the update of the data and the change of the rights of the inquirer. When the data update causes that some originally disclosed data are changed into sensitive data, the system monitors the data change in real time, adjusts the authority setting of the corresponding data node in the authority mapping table according to a preset sensitivity level rule, and when the authority of the inquirer is changed, the mapping relation between the corresponding inquirer and the data node is directly modified in the authority mapping table.

3. Query processing

When query conditions are received, parsing is performed using natural language processing techniques. Natural language processing techniques encompass word segmentation, part-of-speech tagging, named entity recognition, and semantic understanding. Taking the example of query statement 'query the punishment of high pollution enterprises in a certain area of a month, using a bargain word segmentation tool to segment words to obtain words such as' query ',' near a month ',' certain area ',' high pollution enterprises ',' punishment ', and the like, labeling each word segment with parts of speech such as' query '(verb),' near a month '(time phrase),' certain area '(place noun),' high pollution enterprises '(noun phrase),' punishment condition '(noun phrase),' using a named entity recognition model based on deep learning to recognize entities in the query statement such as 'certain area' (place entity), 'high pollution enterprises' (organization entity), and carrying out overall semantic understanding on the query statement through a model based on a Transformer architecture to obtain the query intention of a user.

The natural language query is converted into a structured query tree using a dependency syntax analysis and intent recognition model in combination with a semantic association network and a rights mapping table. Selecting an optimal data node path based on reinforcement learning to generate a query route, wherein a reward function of the reinforcement learning is as follows:

Wherein, For authority compliance, the value range is 0,1,Representing the data processing delay, in milliseconds,Is a predicted value of the collision probability. And through the reward function, the authority compliance, the data processing delay and the conflict probability are comprehensively considered, and the optimal data node path is selected.

(II) query statement optimization

Materialized view and query rewrite techniques are introduced to decompose a complex query into sub-queries for parallel execution to optimize query statements. Parallelism is based on the formula: The calculation is performed such that, The method is characterized in that the total data to be processed is in bytes, nodecapacity is the upper limit of the data quantity which can be borne by the data node, the unit is bytes, the Priority is the Priority of the query task, and the value range is [1-10]. For example, for a complex query involving a large amount of data, resources executed in parallel are reasonably allocated according to the data amount, the carrying capacity of the data node and the priority of the query task, so that the query efficiency is improved.

(III) data pool selection

Load information such as CPU utilization rate, memory occupancy rate and the like of each data pool and storage position information such as IP addresses and geographic positions of data are obtained in real time through a monitoring system. When generating the query route, the data pool which is closer to the query initiating terminal or has a shorter data transmission path is preferentially selected, and when some data pools are in a high-load state, the query request is led to the data pool with lower load. For example, in the cross-regional unit data management, if the inquirer is located in the area A, and the related data is stored in the B data pool close to the area A, and the load of the B data pool is lower, the inquiry route generated by the system can be preferentially directed to the B data pool, so that the transmission time of the data in the network is shortened, the inquiry response speed is accelerated, and if the CPU use rate and the memory occupancy rate of the C data pool are high, and the D data pool is in a light load state, the system can direct the inquiry route to the D data pool, so that the inquiry delay caused by waiting for processing of the high load data pool is avoided.

4. Data retrieval and fusion processing

And according to the generated query route, corresponding data is called from the data pool of each institution. For example, data regarding the penalties of highly contaminated businesses in a region are obtained from environmental protection agency data pools and related law enforcement agency data pools as directed by query routing.

The heterogeneous data difference is eliminated by adopting the ontology alignment and data cleaning technology. The ontology alignment makes the data unified on the semantic level by establishing the mapping relation between the ontologies of different data sources, and the data cleaning removes noise, repeated data and error data in the data. Based on evidence theory, the confidence coefficient of the multi-source data is fused, and according to a fusion formula:

Wherein For the basic probability allocation of the ith data source,The credibility of the data source;

For example, for environmental protection data of an enterprise, there may be a plurality of data sources such as environmental protection departments, enterprise own monitoring systems, etc., and information of each data source is comprehensively considered through evidence theory, so that accuracy and reliability of data fusion are improved.

(III) conflict resolution

Constructing conflict detection rule base, adopting game theory negotiation model to resolve conflictWhereinIs a priority of units and is a priority of units,For the utility value of the conflict resolution, bel represents the confidence after data fusion, resolution represents the conflict resolution function, maxmize represents the maximizing function. When the data ownership conflict is processed, the game theory negotiation model finds a solution which can be accepted by all parties according to the importance of all units and the demand degree of the data. For example, when the attribution right of an environmental protection department and an enterprise to certain environmental protection data is disputed, the model determines the final attribution and use mode of the data by considering the priority of the two parties and the utility value of different solutions.

(IV) data presentation and logging

And the fused data is displayed to the inquirer in real time and presented in the form of an intuitive chart, report form and the like, so that the inquirer can understand and use conveniently. For example, the punishment conditions of highly polluted enterprises in a certain area are displayed in a table form, and the punishment conditions comprise enterprise names, punishment time, punishment reasons, punishment results and the like. And simultaneously recording a query log, and recording detailed information of identity information, query time, query conditions, the called data and query results of the querier in detail. By means of the query log, once the data security event occurs, event investigation and responsibility tracing can be rapidly performed by locating the event-related inquirer and the operation time, the query log is analyzed, the using habit and the demand preference of the user on the data are known, and the system function is improved and expanded in a targeted manner.

The foregoing is merely exemplary of the present application, and specific structures and features well known in the art will not be described in detail herein, so that those skilled in the art will be aware of all the prior art to which the present application pertains, and will be able to ascertain the general knowledge of the technical field in the application or prior art, and will not be able to ascertain the general knowledge of the technical field in the prior art, without using the prior art, to practice the present application, with the aid of the present application, to ascertain the general knowledge of the same general knowledge of the technical field in general purpose. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present application, and these should also be considered as the scope of the present application, which does not affect the effect of the implementation of the present application and the utility of the patent. The protection scope of the present application is subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. A cross-unit data management method, characterized in that it includes the following steps:

S01 standardizes the format of data in the data pool of each agency, converts data of different formats into a unified standard format, extracts semantic features for each data item, builds a semantic association network, records the semantic association relationship between different data items through the semantic association network, and implements a dynamic authorization and permission mapping mechanism;

S02 dynamically generates query authorization based on the inquirer's identity information, unit attributes, business requirements, and data sensitivity level, and establishes a permission mapping table to map the inquirer's permissions with data nodes in the semantic association network;

S03 When receiving the query conditions, the query conditions are parsed using natural language processing technology, combined with the semantic association network and the permission mapping table, and the natural language query is converted into a structured query tree using dependency syntax analysis and intent recognition models. Combined with the semantic association network and the permission mapping table, the optimal data node path is selected based on reinforcement learning to generate the query route. At the same time, materialized views and query rewriting technology are introduced to decompose the query into sub-queries for parallel execution to optimize the query statement. The reward function of reinforcement learning is

in, is the permission compliance, the value range is [0,1], Indicates data processing delay in milliseconds. is the predicted value of conflict probability, and the parallelism is based on the formula,

calculate, Refers to the total amount of data to be processed, in bytes; Nodecapacity is the upper limit of the amount of data that a data node can carry, in bytes; Priority is the priority of the query task, in the range of [1-10];

S04 retrieves corresponding data from the data pool of each agency unit according to the query route, integrates the retrieved data, and finally displays the integrated data to the inquirer in real time and records the query log.

2. A cross-unit data management method according to claim 1, characterized in that, in said S01, semantic feature extraction uses BERT to vectorize data items, and at the same time combines named entity recognition and relationship extraction technology to build a semantic association network, uses the language understanding ability of the pre-trained language model to convert data items into vector form, determines key entities in the data through named entity recognition, and uses relationship extraction technology to mine semantic relationships between entities, thereby building a semantic association network, and recording the semantic association relationships between different data items through the network;

Based on cosine similarity, a metadata mapping relationship across data pools is established: for data A and data B, their semantic similarity is

in, and The vector generated by BERT, To obtain the strength of association between relationships, attribute-based access control is introduced to adjust permissions by combining timestamp, geographic location, and data update frequency.

When Sim(A,B) is greater than the set threshold θ, the metadata mapping relationship between data A and data B is established, and an attribute-based access control model is introduced, where the permission model is uth ,U is the user, D is the data, is the attribute weight, Including sensitivity level and unit trust; dynamically adjust user permissions through timestamp, geographic location, and data update frequency.

Among them, P(U,D) represents the final access permission of user U to data D. is the attribute weight, including the sensitivity level weight of data D and the trust weight of the unit to which it belongs, and its value range is [0,1]; are attribute values related to user U and data D, including timestamp-related attribute values, geographic location-related attribute values, and data update frequency-related attribute values, and their value range is [0,1].

3. According to the cross-unit data management method of claim 1, it is characterized in that in said S02, the permission mapping table uses Neo4j to store the mapping relationship between permission nodes and data nodes, the permission granularity is represented by edge weights, and the complex mapping relationship between permissions and data is represented by a graph database. The mapping algorithm is ,in, is the sensitive change threshold, recursively updates the relevant permissions, P represents the permissions, A function that recursively updates permissions.

4. A cross-unit data management method according to claim 1, characterized in that, in said S04, corresponding data is retrieved from the data pool of each agency unit, the difference of heterogeneous data is eliminated by practical ontology alignment and data cleaning technology, the confidence of multi-source data is integrated based on evidence theory, and data fusion is performed according to the fusion formula ,

in is the basic probability distribution of the ith data source, The credibility of the data source;

Build a conflict detection rule base and use game theory negotiation model to resolve conflicts; ,in is the unit priority, is the utility value of the conflict resolution method, Bel represents the confidence after data fusion, Resolve represents the conflict resolution function, and Maxmize represents the maximum value function.

5. A cross-unit data management method according to claim 1, characterized in that the format standardization processing includes unifying the encoding format of text data in different formats and unifying the measurement unit and precision of numerical data.

6. A cross-unit data management method according to claim 1, characterized in that, in the process of dynamically generating query authorization in S02, the query frequency, query data type, and query time distribution information of the inquirer within a certain time period are counted through the log analysis system. Based on these statistical results, if the inquirer frequently queries a certain type of data, more query permissions for this type of data are provided when the query authorization is dynamically generated; if an abnormal query that is significantly different from the historical query pattern occurs, the system automatically conducts strict review or restriction on the query authorization.

7. The cross-unit data management method according to claim 1 is characterized in that the changes in data are monitored in real time in S03. Once the data sensitivity changes, the permission settings of the corresponding data nodes in the permission mapping table are adjusted according to the preset sensitivity level rules. When the queryer's permission changes, the mapping relationship between the corresponding queryer and the data node is directly modified in the permission mapping table. The permission mapping table uses the Neo4j graph database to store the mapping relationship between the permission node and the data node, uses the edge weight to represent the permission granularity, and uses the characteristics of the graph database to achieve rapid updates.

8. A cross-unit data management method according to claim 1, characterized in that the natural language processing technology includes word segmentation, part-of-speech tagging, named entity recognition and semantic understanding.

9. A cross-unit data management method according to claim 1, characterized in that, in S03, the load information of the CPU usage rate and memory occupancy rate of each data pool, as well as the storage location information of the data are obtained in real time through the monitoring system, and when generating a query route, a data pool that is closer to the query initiator or has a shorter data transmission path is preferentially selected; when the data pool is in a high-load state, the query request is directed to a data pool with a lower load.