[go: up one dir, main page]

CN119989418A - Cross-unit data management method - Google Patents

Cross-unit data management method Download PDF

Info

Publication number
CN119989418A
CN119989418A CN202510468954.3A CN202510468954A CN119989418A CN 119989418 A CN119989418 A CN 119989418A CN 202510468954 A CN202510468954 A CN 202510468954A CN 119989418 A CN119989418 A CN 119989418A
Authority
CN
China
Prior art keywords
data
query
unit
semantic
permission
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510468954.3A
Other languages
Chinese (zh)
Inventor
谭海东
吴如富
张玉波
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Huizhi Electronic Technology Co ltd
Original Assignee
Guizhou Huizhi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Huizhi Electronic Technology Co ltd filed Critical Guizhou Huizhi Electronic Technology Co ltd
Priority to CN202510468954.3A priority Critical patent/CN119989418A/en
Publication of CN119989418A publication Critical patent/CN119989418A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a cross-unit data management method, which firstly carries out format standardization processing on data in each institution unit data pool, converting into a unified standard format, extracting semantic features to construct a semantic association network, and recording the semantic relation of data items and a dynamic authorization and authority mapping mechanism. And then, dynamically generating inquiry authorization and establishing an authority mapping table according to the identity information of the inquirer, the unit attribute, the service requirement and the data sensitivity level. And when the query condition is received, analyzing by utilizing a natural language processing technology, and generating a query route and optimizing sentences by combining a semantic association network and a permission mapping table. Finally, the data is transferred from each data pool according to the route, and is displayed to the inquirer in real time after fusion processing and the inquiry log is recorded.

Description

Cross-unit data management method
Technical Field
The invention belongs to the technical field of data processing, and relates to a cross-unit data management method.
Background
Along with the continuous improvement of government affair informatization and data sharing demands, the data collaborative management of cross-organization units becomes an important means for optimizing public service and improving treatment efficiency. Currently, independent data pools are commonly established by various institutions (e.g., environmental, business, tax, etc.), storing large amounts of structured, semi-structured, and unstructured data. However, cross-unit data integration and querying presents significant challenges due to data format non-uniformity, semantic definition differences, rights management dispersion, and the like. For example, the environmental protection department's "enterprise blowdown data" and the business department's "enterprise registration information" may employ different field naming and unit standards, making it difficult for the data to be directly used in association.
At present, the data formats of all units are not uniform, such as text coding and numerical precision difference, a great deal of manpower is required to be consumed for format conversion, the data semantics are isolated, and business logic association is lacked. For example, when a certain unit of data becomes sensitive due to policy adjustment, the system cannot automatically contract related query authorities, the risk of data leakage exists, meanwhile, the mapping relation between authorities and data is rough, fine granularity control is difficult to realize, and the existing query method depends on a preset routing policy and cannot dynamically optimize paths according to data pool loads, semantic association and the like. For example, complex queries require traversing multiple unit data pools, have long response times, lack natural language processing support, require structured query statements for users, have high thresholds, and are prone to errors.
Aiming at the existing problems, the publication number is CN119377582A, and the patent name is a management method and system aiming at multi-terminal engineering internal data, wherein the engineering information data set is constructed through space-time feature fusion data, so that the physical integration of the multi-terminal data is realized, the management efficiency of the engineering data is improved, however, only the space-time feature is concerned, the semantic association among data items is not constructed, and the complex business logic query cannot be supported; another example is a general data management method and system based on identification analysis technology, the patent number is CN119378532A, the scheme uses identification analysis technology to standardize data format, realize industry data convergence, solve data format heterogeneous problem, support cross-system data exchange, but rely on predefined identification rules, unable to dynamically expand semantic association, right management is still based on fixed roles, and a hybrid analysis middleware and query method based on cross-type database, the patent number is CN119396848A, uniform access of multi-type databases is realized through middleware, and cross-database query is maintained, application system development complexity is reduced, however, routing strategy is preset and static, query path cannot be optimized according to data semantics and real-time state, and the data processing method and system, the patent number is CN118838968A, realize data synchronization through message middleware, support distributed cluster data integration, promote data synchronization efficiency, support high concurrency scene, but unresolved semantic conflict, unable to meet sensitive data fine granularity control requirement.
The existing method has a certain progress in the aspects of data integration, authority management and query efficiency, but has the common defects that business logic association analysis among data items is lacking, compound semantic query cannot be supported, an authorization mechanism and a query routing strategy depend on manual configuration, data change and scene requirements cannot be responded in real time, the conflict resolution capability is weak, an effective cross-unit data semantic conflict resolution scheme is not provided, and data consistency is difficult to guarantee.
Disclosure of Invention
The invention provides a cross-unit data management method, which aims to solve the problems of data semantic isolation, static authority management, low query efficiency and insufficient cross-unit data conflict resolution capability in the existing cross-unit data management.
In order to solve the problems, the invention adopts the following technical scheme:
A method of cross-unit data management, comprising the steps of:
S01, carrying out format standardization processing on data in each institution unit data pool, converting the data in different formats into a unified standard format, extracting semantic features for each data item, constructing a semantic association network, and recording semantic association relations among different data items through the semantic association network, wherein a dynamic authorization and permission mapping mechanism is adopted;
S02, dynamically generating query authorization based on identity information, unit attributes, service requirements and data sensitivity level of the inquirer, and simultaneously establishing a permission mapping table to map permissions of the inquirer with data nodes in a semantic association network;
S03, when receiving the query condition, analyzing the query condition by utilizing a natural language processing technology, combining a semantic association network and a permission mapping table, converting the natural language query into a structured query tree by utilizing a dependency syntax analysis and intention recognition model, combining the semantic association network and the permission mapping table, selecting an optimal data node path based on reinforcement learning, generating a query route, simultaneously introducing a materialized view and a query rewriting technology, decomposing the complex query into sub-queries, and executing in parallel to optimize the query statement, wherein the reward function of the reinforcement learning
Wherein, For authority compliance, the value range is 0,1,Representing the data processing delay, in milliseconds,For the collision probability predictor, parallelism is based on the formula,
The calculation is performed such that,The method comprises the steps of indicating the total amount of data to be processed, wherein the unit is bytes, nodecapacity is the upper limit of the amount of data which can be borne by a data node, the unit is bytes, the Priority is the Priority of a query task, and the value range is [1-10];
s04, corresponding data are called from the data pools of each institution unit according to the query route, fusion processing is carried out on the called data, and finally, the fused data are displayed to a querier in real time, and query logs are recorded.
The principle and the advantages of the scheme are as follows:
In the data preprocessing stage, format standardization is carried out on data in each institution unit data pool, semantic features are extracted, a semantic association network is constructed, semantic association relations among data items are recorded, query authorization is dynamically generated based on multiple aspects of information of a querier, a permission mapping table is established, the permission is associated with semantic network nodes, accurate control of the querier permission is ensured, when query conditions are received, natural language processing technology is utilized to analyze the conditions, query routes are generated by combining the semantic association network and the permission mapping table, and query sentences are optimized, so that needed data can be efficiently positioned. And finally, according to the query route, the data are called, are displayed to a querier after fusion processing, and query logs are recorded for subsequent system optimization.
Compared with the prior art, the scheme has obvious creativity and advantages, and in the aspect of data integration, the prior art focuses on unification of data formats, and the scheme not only performs format standardization, but also deeply mines semantic information of data to construct a semantic association network. For example, in the data integration of environmental protection and meteorological departments, the 'pollutant emission data' of the environmental protection departments and the 'air quality data' of the meteorological departments can be logically connected through a semantic association network, so that originally isolated data form an organic whole, and support is provided for complex data analysis and decision. In the authority management, the prior art generally adopts a static authorization mode, and cannot adapt to dynamic changes. The scheme dynamically generates the authorization based on the identity, the business requirement and the data sensitivity level of the inquirer, and realizes fine-granularity authority control. If a inquirer needs to temporarily access sensitive data in a specific time period due to work, the system can dynamically adjust authorization according to the service requirement, and ensure the data security while meeting the requirement. This can be achieved in various ways, such as username/password combinations, digital certificates, biometric technologies such as fingerprint recognition, face recognition. The basic identity information of the inquirer, such as name, department, position and the like, is determined through identity authentication, and when the inquirer submits an inquiry request, the system analyzes the service requirement behind the request. This may involve natural language processing and semantic understanding of the query conditions. For example, the system may identify that business needs are market analysis by querying "acquire last month sales department performance data for market analysis report", the data involved is sales department performance data, and the time frame is last month. The system can determine the data resources and operation types to be accessed according to predefined business rules and data association, and the dynamic authorization can set the valid time of the authority. For example, when a querier needs access to certain sensitive data for a temporary task, the system only gives him access rights during the task, e.g. one week. Once the expiration date is exceeded, the rights are automatically disabled, preventing rights abuse. In terms of query processing, the query route and statement optimization capability of the prior art are limited, and the scheme can directly analyze the natural language query condition of a user by utilizing a natural language processing technology, generate an optimal query route by combining a semantic association network and a permission mapping table, and intelligently optimize query statements. For example, the user inputs' inquiry about punishment conditions of highly polluted enterprises in a region of a month, and the system can rapidly and accurately call relevant data from a plurality of institution data pools, so that inquiry efficiency and accuracy are improved. In addition, the data fusion processing of the scheme combines a semantic association network, so that the problems of data conflict and redundancy can be effectively solved, and the data displayed to the inquirer is ensured to be accurate, complete and consistent. Meanwhile, recording and analyzing the query log are beneficial to continuously optimizing the system performance and the authority management strategy, and the quality and efficiency of cross-unit data management are further improved.
Further, in the S01, semantic feature extraction adopts BERT to vectorize data items, and combines named entity recognition and relationship extraction technology to construct a semantic association network, the data items are converted into vector form by using the language understanding capability of the pre-training language model, key entities in the data are determined by the named entity recognition, and semantic relationships among the entities are mined by using the relationship extraction technology, so that a semantic association network is constructed, and the semantic association relationships among different data items are recorded by the network;
establishing a metadata mapping relation across data pools based on cosine similarity, wherein for data A and data B, the semantic similarity is that
Wherein, AndThe vector generated for the BERT is a vector,Introducing attribute-based access control in combination with time stamp, geographic position and data update frequency adjustment authority for the association strength obtained by the relation, when
When Sim (A, B) is larger than a set threshold value theta, establishing metadata mapping relation between data A and data B, and introducing an access control model based on attributes, wherein the authority model isuthU is a user, D is data,As the weight of the attribute(s),Comprises a sensitivity level and a unit trust level, dynamically adjusting the user authority through a time stamp, a geographic position and a data updating frequency,
Wherein P (U, D) represents the final access rights of the user U to the data D,The attribute weight comprises a sensitivity level weight of the data D and a unit trust degree weight which belongs to the data D, and the value range of the attribute weight is 0, 1; The attribute values related to the user U and the data D specifically comprise a time stamp related attribute value, a geographic position related attribute value and a data updating frequency related attribute value, and the range of the attribute values is [0,1].
The scheme BERT is used as a pre-training language model, has strong language understanding capability after training of a large-scale corpus, can capture deep semantic information in data items, can more accurately represent semantic content of the data after converting the data items into vector forms, can accurately identify key entities in the data, such as name, place name, organization name and the like, and can mine semantic relations among the entities by a relation extraction technology. By constructing the semantic association network, semantic association relations among different data items are clearly recorded, so that the context and association of data can be better understood in cross-unit data management, for example, in business data related to a plurality of units, the data interaction relation among the different units can be found through the semantic association network, more comprehensive information is provided for data analysis and decision, meanwhile, the similarity among vectors and the association strength among entities are considered due to semantic similarity, and the problem of mismatching possibly occurring in a traditional query mode based on keyword matching can be avoided. For example, when querying a "financial statement of a company", not only data containing keywords of the "company" and "financial statement" may be found, but also other data related to the company's finance, such as a financial analysis report, etc., may be found through a semantic association network. And multiple attributes such as sensitivity level and unit trust are considered according to the authority model, so that personalized authority allocation can be performed according to the characteristics of different users and data. Different data may have different sensitivity levels, different units to which the user belongs may have different degrees of trust, and by weighting these attributes, the most appropriate rights may be assigned to each user and data combination. For example, for data with high sensitivity level, only users with corresponding rights from high trust units can access, and the rights of users can be adjusted in real time with the change of factors such as time, geographic position, data updating frequency and the like. For example, the system may automatically increase the operating frequency limit of users having data update rights as the frequency of data updates increases, and may automatically limit their access rights to certain data as users leave a particular geographic location. The real-time authority adjustment mechanism can adapt to the dynamically-changed service environment, and ensures the safety and usability of data.
In the S02, the authority mapping table uses Neo4j to store the mapping relation between the authority node and the data node, the authority granularity is represented by the edge weight, the mapping relation between the authority and the data is represented by the graph database, and the mapping algorithm is thatWhereinFor sensitive change thresholds, the associated rights are recursively updated, P denotes the rights,Is a function of recursively updating permissions.
In the cross-unit data management, the mapping relation between the rights and the data is often complicated, and different users, roles, data resources and various rights are combined to form a complex network structure. The Neo4j is used for storing the mapping relation between the authority node and the data node, the relation can be intuitively displayed in a graph form, the authority node and the data node are used as vertexes in the graph, edges between the authority node and the data node represent the mapping relation, and the weight of the edges can clearly represent the authority granularity, such as different operation authorities of reading, writing, modifying and the like. This visual representation helps administrators quickly understand and manage the rights hierarchy, which may change over time in cross-unit data management. When new rights, data nodes or modification mapping relations are needed to be added, corresponding vertexes and edges are only needed to be added or modified in the graph by using Neo4j to store the rights mapping relations, operation is simple, and influence on the existing data structure is small. For example, when a new business segment is added, a new authority node can be conveniently created for the new business segment, and a mapping relation can be established between the new authority node and related data nodes, and the whole database structure does not need to be adjusted in a large scale.
Further, in S04, corresponding data is called from the data pool of each organization, the heterogeneous data difference is eliminated by using the ontology alignment and data cleaning technology, the multi-source data confidence is fused based on the evidence theory, and the data fusion is performed according to the fusion formula
WhereinFor the basic probability allocation of the ith data source,The credibility of the data source;
Constructing conflict detection rule base, adopting game theory negotiation model to resolve conflict WhereinIs a priority of units and is a priority of units,For the utility value of the conflict resolution, bel represents the confidence after data fusion, resolution represents the conflict resolution function, maxmize represents the maximizing function.
In the above scheme, in cross-unit data management, the same data may have multiple data sources, and the reliability and accuracy of each data source may be different. Evidence theory can comprehensively consider information of a plurality of data sources through basic probability distributionAnd data source trustworthinessTo calculate confidence of data
The method can fully utilize the complementarity of the multi-source data and improve the accuracy and reliability of data fusion. For example, for the probability of occurrence of an event, different data sources may give different estimates, a more reasonable comprehensive estimate may be obtained by evidence theory fusion, and in cross-unit data management, different units may have different interest appeal and priorities for the data. The game theory negotiation model may treat these units as participants in the game by taking into account unit prioritiesAnd utility value of conflict resolutionTo find the optimal conflict resolution. The method can fully consider the interests of all parties, so that the conflict resolution result is fairer and more reasonable, and the acceptance of all parties to the data fusion result is improved. For example, when processing the conflict of the data ownership, the game theory negotiation model can find a solution acceptable to all parties according to the importance of each unit and the demand level of the data, meanwhile, the game theory negotiation model has the dynamics, the conflict solution can be continuously adjusted according to the actual situation, and when the priority of the unit changes or a new benefit relation appears, the model can recalculate the optimal solution, so that the conflict is ensured to be effectively solved all the time. This enables the conflict resolution mechanism to adapt to changing traffic environments and data conditions.
Further, the format standardization process comprises unified coding formats of text data in different formats, unified measurement units and precision of numerical data, wide data sources of the same organization units, possibility of adopting various coding formats of text data such as UTF-8, GBK and the like, and huge differences of measurement units and precision of the numerical data. After unifying the coding format and the unit accuracy, the difference in the formats can be eliminated, so that the data from different units can be successfully integrated. For example, in the data integration of the environmental protection department and the weather department, the GBK code may be used for the monitoring report text of the environmental protection department, the UTF-8 code is used for the data text of the weather department, after unified coding, the data of the environmental protection department and the weather department can be combined in a system to construct a comprehensive environmental information database, and the unified coding format and the accuracy of the measurement units can reduce the threshold of data sharing, so that the data of each unit is easier to be understood and used by other units. For example, in an information sharing platform between government departments, data in a uniform format can be directly invoked and analyzed by other departments, so that the efficiency and effect of data sharing are improved.
Further, in the process of dynamically generating the query authorization, the log analysis system is used for counting the query frequency, the query data type and the information of query time distribution of the querier in a certain period of time, based on the statistical results, if the querier frequently queries certain data, the query authorization is dynamically generated, more query authorities of the data are provided for the querier, and if abnormal query with larger difference from the historical query mode occurs, the system automatically performs strict examination or limitation on the query authorization. By analyzing the recent query behavior patterns and the historical query records of the querier, the actual requirements of the querier can be deeply known. For example, when a inquirer who pays attention to the atmospheric pollution data of an environmental protection department for a long time dynamically generates inquiry authorization, the system can provide more data inquiry authorities related to the atmospheric pollution for the inquirer, including detailed data of a specific area and a specific time period, so that the authorization is more consistent with the actual working requirements of the inquirer, and the historical inquiry behaviors of the inquirer usually have a certain regularity. If abnormal query behavior that does not match the historical pattern occurs, it may mean that there is a data security risk. For example, a querier only querying public data suddenly and frequently requests sensitive data in one level, and the system can timely find out the abnormality by comparing the historical query records of the querier and strictly examine or limit query authorization, so that safety problems such as data leakage and illegal access are effectively prevented.
Further, in the step S03, the change of the data is monitored in real time, once the sensitivity of the data is changed, the authority setting of the corresponding data node in the authority mapping table is adjusted according to a preset sensitivity level rule, when the authority of the inquirer is changed, the mapping relation between the corresponding inquirer and the data node is directly modified in the authority mapping table, the authority mapping table stores the mapping relation between the authority node and the data node by adopting a Neo4j graph database, the authority granularity is represented by using the edge weight, and the rapid update is realized by means of the characteristics of the graph database. When data update causes that some originally disclosed data become sensitive data, the access rights of the inquirer can be timely limited by updating the rights mapping table in real time. For example, in the medical industry, part of examination data of patients is originally open to specific departments in hospitals, if the data are found to contain new sensitive information later, unauthorized persons can be prevented from continuously accessing the data by updating the authority mapping table in real time, so that data leakage is effectively prevented, and if the authority mapping table cannot be updated in real time, hysteresis phenomenon that the authority of a inquirer is inconsistent with the actual situation can occur. This can result in the inquirer being blocked by insufficient rights when access to certain data is required, or still being able to access the data if rights have been retracted, affecting the efficiency of the work. The real-time update can timely eliminate the hysteresis, so that a inquirer can successfully acquire the required data, and the smoothness of the workflow is improved.
Further, the natural language processing technology comprises word segmentation, part-of-speech tagging, named entity recognition and semantic understanding, and through semantic understanding of query sentences, the system can acquire more context information and user intention, so that more intelligent decisions can be made. For example, when a user queries "environmental protection standard condition of an enterprise in a certain industry in the home city", the system can not only provide relevant standard data, but also perform intelligent analysis according to the data, such as environmental protection situation assessment of the whole industry, comparison analysis with other industries, and the like, so as to provide more valuable decision support for the user.
In step S03, load information of CPU utilization rate and memory occupancy rate of each data pool and storage position information of data are obtained in real time through a monitoring system, when a query route is generated, a data pool which is closer to a query initiating end or has a shorter data transmission path is preferentially selected, when the data pool is in a high load state, a query request is guided to the data pool with lower load, and the storage position of the data can influence the distance and time of data transmission. The query route is generated by considering the storage position of the data, so that a data pool which is closer to the query initiating end or has a shorter data transmission path can be preferentially selected. For example, in cross-regional unit data management, if an inquirer is located in an area A, and related data is stored in a data pool B which is closer to the area A, a query route generated by a system can be preferentially directed to the data pool B, so that the transmission time of the data in a network is shortened, the query response speed is increased, and meanwhile, when some data pools are in a high-load state, the query processing speed is obviously reduced. Considering the load situation of the data pool, the system may direct the query request to the data pool with the lower load. For example, at a certain moment, the CPU utilization and memory occupancy of the C data pool are high, while the D data pool is in a light load state, the system directs the query route to the D data pool, avoiding query delay caused by waiting for processing of the high load data pool.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
Detailed Description
Embodiment 1, as shown in fig. 1, a cross-unit data management method includes the following steps:
S01, carrying out format standardization processing on data in each institution unit data pool, converting the data in different formats into a unified standard format, extracting semantic features for each data item, constructing a semantic association network, and recording semantic association relations among different data items through the semantic association network, wherein a dynamic authorization and permission mapping mechanism is adopted;
S02, dynamically generating query authorization based on identity information, unit attributes, service requirements and data sensitivity level of the inquirer, and simultaneously establishing a permission mapping table to map permissions of the inquirer with data nodes in a semantic association network;
S03, when receiving the query condition, analyzing the query condition by utilizing a natural language processing technology, combining a semantic association network and a permission mapping table, converting the natural language query into a structured query tree by utilizing a dependency syntax analysis and intention recognition model, combining the semantic association network and the permission mapping table, selecting an optimal data node path based on reinforcement learning, generating a query route, simultaneously introducing a materialized view and a query rewriting technology, decomposing the complex query into sub-queries, and executing in parallel to optimize the query statement, wherein the reward function of the reinforcement learning
Wherein, For authority compliance, the value range is 0,1,Representing the data processing delay, in milliseconds,For the collision probability predictor, parallelism is based on the formula,
The calculation is performed such that,The method comprises the steps of indicating the total amount of data to be processed, wherein the unit is bytes, nodecapacity is the upper limit of the amount of data which can be borne by a data node, the unit is bytes, the Priority is the Priority of a query task, and the value range is [1-10];
s04, corresponding data are called from the data pools of each institution unit according to the query route, fusion processing is carried out on the called data, and finally, the fused data are displayed to a querier in real time, and query logs are recorded.
In the data preprocessing stage, format standardization is carried out on data in each institution unit data pool, semantic features are extracted, a semantic association network is constructed, semantic association relations among data items are recorded, query authorization is dynamically generated based on multiple aspects of information of a querier, a permission mapping table is established, the permission is associated with semantic network nodes, accurate control of the querier permission is ensured, when query conditions are received, natural language processing technology is utilized to analyze the conditions, query routes are generated by combining the semantic association network and the permission mapping table, and query sentences are optimized, so that needed data can be efficiently positioned. And finally, according to the query route, the data are called, are displayed to a querier after fusion processing, and query logs are recorded for subsequent system optimization.
In the aspect of data integration, the prior art focuses on unification of data formats, and the scheme not only performs format standardization, but also deeply mines semantic information of data to construct a semantic association network. For example, in the data integration of environmental protection and meteorological departments, the 'pollutant emission data' of the environmental protection departments and the 'air quality data' of the meteorological departments can be logically connected through a semantic association network, so that originally isolated data form an organic whole, and support is provided for complex data analysis and decision. In the authority management, the prior art generally adopts a static authorization mode, and cannot adapt to dynamic changes. The scheme dynamically generates the authorization based on the identity, the business requirement and the data sensitivity level of the inquirer, and realizes fine-granularity authority control. If a inquirer needs to temporarily access sensitive data in a specific time period due to work, the system can dynamically adjust authorization according to the service requirement, and ensure the data security while meeting the requirement. This can be achieved in various ways, such as username/password combinations, digital certificates, biometric technologies such as fingerprint recognition, face recognition. The basic identity information of the inquirer, such as name, department, position and the like, is determined through identity authentication, and when the inquirer submits an inquiry request, the system analyzes the service requirement behind the request. This may involve natural language processing and semantic understanding of the query conditions. For example, the system may identify that business needs are market analysis by querying "acquire last month sales department performance data for market analysis report", the data involved is sales department performance data, and the time frame is last month. The system can determine the data resources and operation types to be accessed according to predefined business rules and data association, and the dynamic authorization can set the valid time of the authority. For example, when a querier needs access to certain sensitive data for a temporary task, the system only gives him access rights during the task, e.g. one week. Once the expiration date is exceeded, the rights are automatically disabled, preventing rights abuse. In terms of query processing, the query route and statement optimization capability of the prior art are limited, and the scheme can directly analyze the natural language query condition of a user by utilizing a natural language processing technology, generate an optimal query route by combining a semantic association network and a permission mapping table, and intelligently optimize query statements. For example, the user inputs' inquiry about punishment conditions of highly polluted enterprises in a region of a month, and the system can rapidly and accurately call relevant data from a plurality of institution data pools, so that inquiry efficiency and accuracy are improved. In addition, the data fusion processing of the scheme combines a semantic association network, so that the problems of data conflict and redundancy can be effectively solved, and the data displayed to the inquirer is ensured to be accurate, complete and consistent. Meanwhile, recording and analyzing the query log are beneficial to continuously optimizing the system performance and the authority management strategy, and the quality and efficiency of cross-unit data management are further improved.
In the S01, semantic feature extraction adopts BERT to vectorize data items, and combines named entity recognition and relation extraction technology to construct a semantic association network, the data items are converted into vector form by utilizing the language understanding capability of a pre-training language model, key entities in the data are determined by the named entity recognition, and semantic relations among the entities are mined by utilizing the relation extraction technology, so that a semantic association network is constructed, and the semantic association relations among different data items are recorded by the network;
establishing a metadata mapping relation across data pools based on cosine similarity, wherein for data A and data B, the semantic similarity is that
Wherein, AndThe vector generated for the BERT is a vector,Introducing attribute-based access control in combination with time stamp, geographic position and data update frequency adjustment authority for the association strength obtained by the relation, when
When Sim (A, B) is larger than a set threshold value theta, establishing metadata mapping relation between data A and data B, and introducing an access control model based on attributes, wherein the authority model isuthU is a user, D is data,As the weight of the attribute(s),Comprises a sensitivity level and a unit trust level, dynamically adjusting the user authority through a time stamp, a geographic position and a data updating frequency,
Wherein P (U, D) represents the final access rights of the user U to the data D,The attribute weight comprises a sensitivity level weight of the data D and a unit trust degree weight which belongs to the data D, and the value range of the attribute weight is 0, 1; The attribute values related to the user U and the data D specifically comprise a time stamp related attribute value, a geographic position related attribute value and a data updating frequency related attribute value, and the range of the attribute values is [0,1].
The scheme BERT is used as a pre-training language model, has strong language understanding capability after training of a large-scale corpus, can capture deep semantic information in data items, can more accurately represent semantic content of the data after converting the data items into vector forms, can accurately identify key entities in the data, such as name, place name, organization name and the like, and can mine semantic relations among the entities by a relation extraction technology. By constructing the semantic association network, semantic association relations among different data items are clearly recorded, so that the context and association of data can be better understood in cross-unit data management, for example, in business data related to a plurality of units, the data interaction relation among the different units can be found through the semantic association network, more comprehensive information is provided for data analysis and decision, meanwhile, the similarity among vectors and the association strength among entities are considered due to semantic similarity, and the problem of mismatching possibly occurring in a traditional query mode based on keyword matching can be avoided. For example, when querying a "financial statement of a company", not only data containing keywords of the "company" and "financial statement" may be found, but also other data related to the company's finance, such as a financial analysis report, etc., may be found through a semantic association network. And multiple attributes such as sensitivity level and unit trust are considered according to the authority model, so that personalized authority allocation can be performed according to the characteristics of different users and data. Different data may have different sensitivity levels, different units to which the user belongs may have different degrees of trust, and by weighting these attributes, the most appropriate rights may be assigned to each user and data combination. For example, for data with high sensitivity level, only users with corresponding rights from high trust units can access, and the rights of users can be adjusted in real time with the change of factors such as time, geographic position, data updating frequency and the like. For example, the system may automatically increase the operating frequency limit of users having data update rights as the frequency of data updates increases, and may automatically limit their access rights to certain data as users leave a particular geographic location. The real-time authority adjustment mechanism can adapt to the dynamically-changed service environment, and ensures the safety and usability of data.
In the S02, the authority mapping table uses Neo4j to store the mapping relation between the authority node and the data node, the authority granularity is represented by the edge weight, the mapping relation between the authority and the data is represented by the graph database, and the mapping algorithm is thatWhereinFor sensitive change thresholds, the associated rights are recursively updated, P denotes the rights,Is a function of recursively updating permissions.
In the cross-unit data management, the mapping relation between the rights and the data is often complicated, and different users, roles, data resources and various rights are combined to form a complex network structure. The Neo4j is used for storing the mapping relation between the authority node and the data node, the relation can be intuitively displayed in a graph form, the authority node and the data node are used as vertexes in the graph, edges between the authority node and the data node represent the mapping relation, and the weight of the edges can clearly represent the authority granularity, such as different operation authorities of reading, writing, modifying and the like. This visual representation helps administrators quickly understand and manage the rights hierarchy, which may change over time in cross-unit data management. When new rights, data nodes or modification mapping relations are needed to be added, corresponding vertexes and edges are only needed to be added or modified in the graph by using Neo4j to store the rights mapping relations, operation is simple, and influence on the existing data structure is small. For example, when a new business segment is added, a new authority node can be conveniently created for the new business segment, and a mapping relation can be established between the new authority node and related data nodes, and the whole database structure does not need to be adjusted in a large scale.
In S04, corresponding data is called from the data pool of each organization, the heterogeneous data difference is eliminated by using the ontology alignment and data cleaning technology, the confidence coefficient of the multi-source data is fused based on the evidence theory, and the data fusion is performed according to the fusion formula
WhereinFor the basic probability allocation of the ith data source,The credibility of the data source;
Constructing conflict detection rule base, adopting game theory negotiation model to resolve conflict WhereinIs a priority of units and is a priority of units,For the utility value of the conflict resolution, bel represents the confidence after data fusion, resolution represents the conflict resolution function, maxmize represents the maximizing function.
In the above scheme, in cross-unit data management, the same data may have multiple data sources, and the reliability and accuracy of each data source may be different. Evidence theory can comprehensively consider information of a plurality of data sources through basic probability distributionAnd data source trustworthinessTo calculate confidence of data
The method can fully utilize the complementarity of the multi-source data and improve the accuracy and reliability of data fusion. For example, for the probability of occurrence of an event, different data sources may give different estimates, a more reasonable comprehensive estimate may be obtained by evidence theory fusion, and in cross-unit data management, different units may have different interest appeal and priorities for the data. The game theory negotiation model may treat these units as participants in the game by taking into account unit prioritiesAnd utility value of conflict resolutionTo find the optimal conflict resolution. The method can fully consider the interests of all parties, so that the conflict resolution result is fairer and more reasonable, and the acceptance of all parties to the data fusion result is improved. For example, when processing the conflict of the data ownership, the game theory negotiation model can find a solution acceptable to all parties according to the importance of each unit and the demand level of the data, meanwhile, the game theory negotiation model has the dynamics, the conflict solution can be continuously adjusted according to the actual situation, and when the priority of the unit changes or a new benefit relation appears, the model can recalculate the optimal solution, so that the conflict is ensured to be effectively solved all the time. This enables the conflict resolution mechanism to adapt to changing traffic environments and data conditions.
The format standardization processing comprises unified coding formats of text data in different formats, unified measurement units and precision of numerical data, wide data sources of the same organization, possibility of adopting various coding formats of the text data, such as UTF-8, GBK and the like, and quite different measurement units and precision of the numerical data. After unifying the coding format and the unit accuracy, the difference in the formats can be eliminated, so that the data from different units can be successfully integrated. For example, in the data integration of the environmental protection department and the weather department, the GBK code may be used for the monitoring report text of the environmental protection department, the UTF-8 code is used for the data text of the weather department, after unified coding, the data of the environmental protection department and the weather department can be combined in a system to construct a comprehensive environmental information database, and the unified coding format and the accuracy of the measurement units can reduce the threshold of data sharing, so that the data of each unit is easier to be understood and used by other units. For example, in an information sharing platform between government departments, data in a uniform format can be directly invoked and analyzed by other departments, so that the efficiency and effect of data sharing are improved.
In the process of dynamically generating query authorization, the log analysis system is used for counting the query frequency, query data type and query time distribution information of a querier in a certain period of time, based on the statistics results, if the querier frequently queries certain data, the querier is provided with more query rights of the data when the query authorization is dynamically generated, and if abnormal query with larger difference from a historical query mode occurs, the system automatically carries out strict examination or limitation on the query authorization. By analyzing the recent query behavior patterns and the historical query records of the querier, the actual requirements of the querier can be deeply known. For example, when a inquirer who pays attention to the atmospheric pollution data of an environmental protection department for a long time dynamically generates inquiry authorization, the system can provide more data inquiry authorities related to the atmospheric pollution for the inquirer, including detailed data of a specific area and a specific time period, so that the authorization is more consistent with the actual working requirements of the inquirer, and the historical inquiry behaviors of the inquirer usually have a certain regularity. If abnormal query behavior that does not match the historical pattern occurs, it may mean that there is a data security risk. For example, a querier only querying public data suddenly and frequently requests sensitive data in one level, and the system can timely find out the abnormality by comparing the historical query records of the querier and strictly examine or limit query authorization, so that safety problems such as data leakage and illegal access are effectively prevented.
And S03, monitoring the change of data in real time, once the sensitivity of the data is changed, adjusting the authority setting of the corresponding data node in the authority mapping table according to a preset sensitivity level rule, and directly modifying the mapping relation between the corresponding inquirer and the data node in the authority mapping table when the authority of the inquirer is changed, wherein the authority mapping table stores the mapping relation between the authority node and the data node by adopting a Neo4j graph database, and the authority granularity is represented by using the edge weight, so that the quick updating is realized by means of the characteristics of the graph database. When data update causes that some originally disclosed data become sensitive data, the access rights of the inquirer can be timely limited by updating the rights mapping table in real time. For example, in the medical industry, part of examination data of patients is originally open to specific departments in hospitals, if the data are found to contain new sensitive information later, unauthorized persons can be prevented from continuously accessing the data by updating the authority mapping table in real time, so that data leakage is effectively prevented, and if the authority mapping table cannot be updated in real time, hysteresis phenomenon that the authority of a inquirer is inconsistent with the actual situation can occur. This can result in the inquirer being blocked by insufficient rights when access to certain data is required, or still being able to access the data if rights have been retracted, affecting the efficiency of the work. The real-time update can timely eliminate the hysteresis, so that a inquirer can successfully acquire the required data, and the smoothness of the workflow is improved.
The natural language processing technology comprises word segmentation, part-of-speech tagging, named entity recognition and semantic understanding, and through semantic understanding of query sentences, the system can acquire more context information and user intention, so that more intelligent decisions can be made. For example, when a user queries "environmental protection standard condition of an enterprise in a certain industry in the home city", the system can not only provide relevant standard data, but also perform intelligent analysis according to the data, such as environmental protection situation assessment of the whole industry, comparison analysis with other industries, and the like, so as to provide more valuable decision support for the user.
In the step S03, load information of CPU utilization rate and memory occupancy rate of each data pool and storage position information of data are obtained in real time through a monitoring system, when a query route is generated, a data pool which is closer to a query initiating end or has a shorter data transmission path is preferentially selected, when the data pool is in a high load state, a query request is guided to the data pool with lower load, and the storage position of the data can influence the distance and time of data transmission. The query route is generated by considering the storage position of the data, so that a data pool which is closer to the query initiating end or has a shorter data transmission path can be preferentially selected. For example, in cross-regional unit data management, if an inquirer is located in an area A, and related data is stored in a data pool B which is closer to the area A, a query route generated by a system can be preferentially directed to the data pool B, so that the transmission time of the data in a network is shortened, the query response speed is increased, and meanwhile, when some data pools are in a high-load state, the query processing speed is obviously reduced. Considering the load situation of the data pool, the system may direct the query request to the data pool with the lower load. For example, at a certain moment, the CPU utilization and memory occupancy of the C data pool are high, while the D data pool is in a light load state, the system directs the query route to the D data pool, avoiding query delay caused by waiting for processing of the high load data pool.
In the actual use process, the water-soluble fiber is prepared,
1. Data preprocessing
For data in the data pool of different institutions, format standardization is performed first. For text data, a professional text code detection tool, such as chardet library, is used for identifying a code format, and if the monitoring report text of the environmental protection department adopts GBK code and the data text of the weather department adopts UTF-8 code, the data text is uniformly converted into UTF-8 code so as to ensure the compatibility of the data in subsequent processing. For numerical data, the original measurement unit and precision of the numerical data are determined by analyzing metadata information and business rules of the data. When integrating the data of environmental protection and meteorological departments, the method finds that the pollutant concentration data recorded by the environmental protection departments are in ppm units, the related data of the meteorological departments are in mg/m3 units, all the numerical data are unified into mg/m3 according to scientific unit conversion rules, the precision is unified to two decimal places, the difference of the numerical data in metering units and precision is eliminated, and a foundation is laid for data integration.
And extracting semantic features of the processed standardized data by adopting a BERT model based on deep learning. Taking the monitoring data of the environmental protection department as an example, the data is input into a pre-trained BERT model, and the model can carry out deep analysis on the data content, the context and the related business rules. And simultaneously, a semantic association network is constructed by combining Named Entity Recognition (NER) and a relation extraction technology. The NER technology is utilized to accurately identify key entities in the data, such as person names, place names, organization names, pollutant names and the like, and semantic relations between the entities, such as causal relations between pollutant emission and air quality, belongings between enterprises and pollutant emission data and the like, are mined through the relation extraction technology. The extracted semantic features are used as nodes, the semantic association relationship is used as an edge, a semantic association network is constructed, the semantic association relationship among different data items is clearly recorded, and the subsequent data analysis and query processing are facilitated.
And establishing a cross-data-pool metadata mapping relation based on cosine similarity. For data A and data B, the semantic similarity is
Wherein, AndThe vector generated for the BERT is a vector,The correlation strength obtained for the relationship, and a threshold value θ, for example θ=0.75, is set through a large number of experiments and business experiences. When Sim (a, B) is greater than a set threshold θ, a metadata mapping relationship between data a and data B is established. Introducing an access control model based on attributes, wherein the authority model is as followsuthU is a user, D is data,As the weight of the attribute(s),Comprises a sensitivity level and a unit trust level, dynamically adjusting the user authority through a time stamp, a geographic position and a data updating frequency,
Wherein P (U, D) represents the final access rights of the user U to the data D,The attribute weight comprises a sensitivity level weight of the data D and a unit trust degree weight which belongs to the data D, and the value range of the attribute weight is 0, 1; The attribute values related to the user U and the data D specifically comprise a time stamp related attribute value, a geographic position related attribute value and a data updating frequency related attribute value, and the range of the attribute values is [0,1]. For example, for data of a high sensitivity level, only users from a high trust level unit and having corresponding rights can access, the system automatically increases the operating frequency limit of users having data update rights when the frequency of data update increases, and automatically limits access rights to certain data when users leave a specific geographic location.
2. Dynamically generating query grants and rights mappings
Collecting the identity information of the inquirer, such as name, affiliated unit, job position and the like, obtaining the agreement of the inquirer, obtaining the service requirement through the service requirement form filled in the system by the user, and setting the sensitivity level of the data according to the data content and related regulations by the data owner or manager. Meanwhile, by means of a log analysis system, information such as query frequency, query data type, query time period and the like of a querier in the past period of time, such as the last three months, is counted. If a inquirer is found to pay attention to the atmospheric pollution data of the environmental protection department for a long time, when inquiry authorization is generated, more data inquiry authorities related to the atmospheric pollution are provided for the inquirer, and the inquiry authorities comprise detailed data of specific areas and specific time periods, so that the authorization is more in accordance with the actual working requirements of the inquirer. If the inquirer has abnormal inquiry behaviors which are inconsistent with the history mode, if the inquirer only inquires the public data suddenly and frequently requests the sensitive data at ordinary times, the system timely finds out the abnormality by comparing the history inquiry records of the inquirer, and strictly checks or limits the inquiry authorization, thereby effectively preventing the safety problems such as data leakage, illegal access and the like.
And using Neo4j to store the mapping relation between the authority node and the data node, and establishing an authority mapping table. The authority nodes and the data nodes are used as vertexes in the graph, edges between the authority nodes and the data nodes represent mapping relations, and weights of the edges are used for clearly representing authority granularity, such as a read authority weight of 0.3, a write authority weight of 0.5, a modification authority weight of 0.7 and the like. The mapping algorithm of the authority mapping table is as follows: Wherein For sensitive change thresholds, the associated rights are recursively updated, P denotes the rights,Is a function of recursively updating permissions. For example, when a new business department is added, a new authority node is created for the new business department in the Neo4j graph database, and a mapping relation is established between the new authority node and the related data node, so that the operation is simple and the influence on the existing data structure is small. The rights mapping table is updated in real time according to the update of the data and the change of the rights of the inquirer. When the data update causes that some originally disclosed data are changed into sensitive data, the system monitors the data change in real time, adjusts the authority setting of the corresponding data node in the authority mapping table according to a preset sensitivity level rule, and when the authority of the inquirer is changed, the mapping relation between the corresponding inquirer and the data node is directly modified in the authority mapping table.
3. Query processing
When query conditions are received, parsing is performed using natural language processing techniques. Natural language processing techniques encompass word segmentation, part-of-speech tagging, named entity recognition, and semantic understanding. Taking the example of query statement 'query the punishment of high pollution enterprises in a certain area of a month, using a bargain word segmentation tool to segment words to obtain words such as' query ',' near a month ',' certain area ',' high pollution enterprises ',' punishment ', and the like, labeling each word segment with parts of speech such as' query '(verb),' near a month '(time phrase),' certain area '(place noun),' high pollution enterprises '(noun phrase),' punishment condition '(noun phrase),' using a named entity recognition model based on deep learning to recognize entities in the query statement such as 'certain area' (place entity), 'high pollution enterprises' (organization entity), and carrying out overall semantic understanding on the query statement through a model based on a Transformer architecture to obtain the query intention of a user.
The natural language query is converted into a structured query tree using a dependency syntax analysis and intent recognition model in combination with a semantic association network and a rights mapping table. Selecting an optimal data node path based on reinforcement learning to generate a query route, wherein a reward function of the reinforcement learning is as follows:
Wherein, For authority compliance, the value range is 0,1,Representing the data processing delay, in milliseconds,Is a predicted value of the collision probability. And through the reward function, the authority compliance, the data processing delay and the conflict probability are comprehensively considered, and the optimal data node path is selected.
(II) query statement optimization
Materialized view and query rewrite techniques are introduced to decompose a complex query into sub-queries for parallel execution to optimize query statements. Parallelism is based on the formula: The calculation is performed such that, The method is characterized in that the total data to be processed is in bytes, nodecapacity is the upper limit of the data quantity which can be borne by the data node, the unit is bytes, the Priority is the Priority of the query task, and the value range is [1-10]. For example, for a complex query involving a large amount of data, resources executed in parallel are reasonably allocated according to the data amount, the carrying capacity of the data node and the priority of the query task, so that the query efficiency is improved.
(III) data pool selection
Load information such as CPU utilization rate, memory occupancy rate and the like of each data pool and storage position information such as IP addresses and geographic positions of data are obtained in real time through a monitoring system. When generating the query route, the data pool which is closer to the query initiating terminal or has a shorter data transmission path is preferentially selected, and when some data pools are in a high-load state, the query request is led to the data pool with lower load. For example, in the cross-regional unit data management, if the inquirer is located in the area A, and the related data is stored in the B data pool close to the area A, and the load of the B data pool is lower, the inquiry route generated by the system can be preferentially directed to the B data pool, so that the transmission time of the data in the network is shortened, the inquiry response speed is accelerated, and if the CPU use rate and the memory occupancy rate of the C data pool are high, and the D data pool is in a light load state, the system can direct the inquiry route to the D data pool, so that the inquiry delay caused by waiting for processing of the high load data pool is avoided.
4. Data retrieval and fusion processing
And according to the generated query route, corresponding data is called from the data pool of each institution. For example, data regarding the penalties of highly contaminated businesses in a region are obtained from environmental protection agency data pools and related law enforcement agency data pools as directed by query routing.
The heterogeneous data difference is eliminated by adopting the ontology alignment and data cleaning technology. The ontology alignment makes the data unified on the semantic level by establishing the mapping relation between the ontologies of different data sources, and the data cleaning removes noise, repeated data and error data in the data. Based on evidence theory, the confidence coefficient of the multi-source data is fused, and according to a fusion formula:
Wherein For the basic probability allocation of the ith data source,The credibility of the data source;
For example, for environmental protection data of an enterprise, there may be a plurality of data sources such as environmental protection departments, enterprise own monitoring systems, etc., and information of each data source is comprehensively considered through evidence theory, so that accuracy and reliability of data fusion are improved.
(III) conflict resolution
Constructing conflict detection rule base, adopting game theory negotiation model to resolve conflictWhereinIs a priority of units and is a priority of units,For the utility value of the conflict resolution, bel represents the confidence after data fusion, resolution represents the conflict resolution function, maxmize represents the maximizing function. When the data ownership conflict is processed, the game theory negotiation model finds a solution which can be accepted by all parties according to the importance of all units and the demand degree of the data. For example, when the attribution right of an environmental protection department and an enterprise to certain environmental protection data is disputed, the model determines the final attribution and use mode of the data by considering the priority of the two parties and the utility value of different solutions.
(IV) data presentation and logging
And the fused data is displayed to the inquirer in real time and presented in the form of an intuitive chart, report form and the like, so that the inquirer can understand and use conveniently. For example, the punishment conditions of highly polluted enterprises in a certain area are displayed in a table form, and the punishment conditions comprise enterprise names, punishment time, punishment reasons, punishment results and the like. And simultaneously recording a query log, and recording detailed information of identity information, query time, query conditions, the called data and query results of the querier in detail. By means of the query log, once the data security event occurs, event investigation and responsibility tracing can be rapidly performed by locating the event-related inquirer and the operation time, the query log is analyzed, the using habit and the demand preference of the user on the data are known, and the system function is improved and expanded in a targeted manner.
The foregoing is merely exemplary of the present application, and specific structures and features well known in the art will not be described in detail herein, so that those skilled in the art will be aware of all the prior art to which the present application pertains, and will be able to ascertain the general knowledge of the technical field in the application or prior art, and will not be able to ascertain the general knowledge of the technical field in the prior art, without using the prior art, to practice the present application, with the aid of the present application, to ascertain the general knowledge of the same general knowledge of the technical field in general purpose. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present application, and these should also be considered as the scope of the present application, which does not affect the effect of the implementation of the present application and the utility of the patent. The protection scope of the present application is subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims (9)

1.一种跨单位数据管理方法,其特征在于,包括以下步骤:1. A cross-unit data management method, characterized in that it includes the following steps: S01对各机关单位数据池中的数据进行格式标准化处理,将不同格式的数据转换为统一的标准格式,同时为每个数据项提取语义特征,构建语义关联网络,通过语义关联网络记录不同数据项之间的语义关联关系,动态授权与权限映射机制;S01 standardizes the format of data in the data pool of each agency, converts data of different formats into a unified standard format, extracts semantic features for each data item, builds a semantic association network, records the semantic association relationship between different data items through the semantic association network, and implements a dynamic authorization and permission mapping mechanism; S02基于查询者的身份信息、单位属性、业务需求以及数据的敏感级别,动态生成查询授权,同时建立权限映射表,将查询者的权限与语义关联网络中的数据节点进行映射;S02 dynamically generates query authorization based on the inquirer's identity information, unit attributes, business requirements, and data sensitivity level, and establishes a permission mapping table to map the inquirer's permissions with data nodes in the semantic association network; S03当接收到查询条件时,利用自然语言处理技术对查询条件进行解析,结合语义关联网络和权限映射表,使用依存句法分析和意图识别模型将自然语言查询转换为结构化查询树,结合语义关联网络和权限映射表,基于强化学习选择最优数据节点路径,以此生成查询路由,同时引入物化视图和查询重写技术,将查询分解为子查询并行执行来优化查询语句,其中强化学习的奖励函数 S03 When receiving the query conditions, the query conditions are parsed using natural language processing technology, combined with the semantic association network and the permission mapping table, and the natural language query is converted into a structured query tree using dependency syntax analysis and intent recognition models. Combined with the semantic association network and the permission mapping table, the optimal data node path is selected based on reinforcement learning to generate the query route. At the same time, materialized views and query rewriting technology are introduced to decompose the query into sub-queries for parallel execution to optimize the query statement. The reward function of reinforcement learning is 其中,为权限合规性,取值范围为[0,1],表示数据处理延迟,单位为毫秒,为冲突概率预测值,并行度依据公式,in, is the permission compliance, the value range is [0,1], Indicates data processing delay in milliseconds. is the predicted value of conflict probability, and the parallelism is based on the formula, 计算,指待处理的数据总量,单位为字节;Nodecapacity为数据节点可承载的数据量上限,单位为字节;Priority为查询任务的优先级,取值范围为[1-10]; calculate, Refers to the total amount of data to be processed, in bytes; Nodecapacity is the upper limit of the amount of data that a data node can carry, in bytes; Priority is the priority of the query task, in the range of [1-10]; S04根据查询路由从各个机关单位的数据池中调取对应数据,对调取的数据进行融合处理,最后将融合后的数据实时展示给查询者,并记录查询日志。S04 retrieves corresponding data from the data pool of each agency unit according to the query route, integrates the retrieved data, and finally displays the integrated data to the inquirer in real time and records the query log. 2.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述S01中,语义特征提取采用BERT对数据项进行向量化,同时结合命名实体识别和关系抽取技术构建语义关联网络,利用预训练语言模型语言理解能力,将数据项转化为向量形式,通过命名实体识别确定数据中的关键实体,利用关系抽取技术挖掘实体之间的语义关系,从而构建出语义关联网络,通过该网络记录不同数据项之间的语义关联关系;2. A cross-unit data management method according to claim 1, characterized in that, in said S01, semantic feature extraction uses BERT to vectorize data items, and at the same time combines named entity recognition and relationship extraction technology to build a semantic association network, uses the language understanding ability of the pre-trained language model to convert data items into vector form, determines key entities in the data through named entity recognition, and uses relationship extraction technology to mine semantic relationships between entities, thereby building a semantic association network, and recording the semantic association relationships between different data items through the network; 基于余弦相似度建立跨数据池元数据映射关系:对于数据A和数据B,其语义相似度为 Based on cosine similarity, a metadata mapping relationship across data pools is established: for data A and data B, their semantic similarity is 其中,为BERT生成的向量,为关系取得的关联强度,引入基于属性的访问控制结合时间戳、地理位置、数据更新频率调整权限,当in, and The vector generated by BERT, To obtain the strength of association between relationships, attribute-based access control is introduced to adjust permissions by combining timestamp, geographic location, and data update frequency. Sim(A,B)大于设定的阈值θ时,建立数据A和数据B之间的元数据映射关系,引入基于属性的访问控制模型,其中权限模型为 uth ,U为用户,D为数据,为属性权重,包括敏感级别和单位信任度;通过时间戳、地理位置、数据更新频率对用户权限进行动态调整,When Sim(A,B) is greater than the set threshold θ, the metadata mapping relationship between data A and data B is established, and an attribute-based access control model is introduced, where the permission model is uth ,U is the user, D is the data, is the attribute weight, Including sensitivity level and unit trust; dynamically adjust user permissions through timestamp, geographic location, and data update frequency. 其中,P(U,D)表示用户U对数据D的最终访问权限,为属性权重,包括数据D的敏感级别权重和所属单位信任度权重,其取值范围为[0,1];为与用户U和数据D相关的各项属性值,具体包含时间戳相关属性值、地理位置相关属性值以及数据更新频率相关属性值,其取值范围为[0,1]。Among them, P(U,D) represents the final access permission of user U to data D. is the attribute weight, including the sensitivity level weight of data D and the trust weight of the unit to which it belongs, and its value range is [0,1]; are attribute values related to user U and data D, including timestamp-related attribute values, geographic location-related attribute values, and data update frequency-related attribute values, and their value range is [0,1]. 3.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述S02中,权限映射表使用Neo4j存储权限节点与数据节点的映射关系,通过边权重表示权限粒度,利用图数据库表示权限与数据之间的复杂映射关系,映射算法为,其中,为敏感变化阈值,递归更新相关权限,P表示权限,为递归更新权限的函数。3. According to the cross-unit data management method of claim 1, it is characterized in that in said S02, the permission mapping table uses Neo4j to store the mapping relationship between permission nodes and data nodes, the permission granularity is represented by edge weights, and the complex mapping relationship between permissions and data is represented by a graph database. The mapping algorithm is ,in, is the sensitive change threshold, recursively updates the relevant permissions, P represents the permissions, A function that recursively updates permissions. 4.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述S04中,各个机关单位的数据池中调取对应数据,实用本体对齐和数据清洗技术消除异构数据差异,基于证据理论融合多源数据置信度,根据融合公式进行数据融合4. A cross-unit data management method according to claim 1, characterized in that, in said S04, corresponding data is retrieved from the data pool of each agency unit, the difference of heterogeneous data is eliminated by practical ontology alignment and data cleaning technology, the confidence of multi-source data is integrated based on evidence theory, and data fusion is performed according to the fusion formula , 其中为第i个数据源的基本概率分配,为数据源可信度;in is the basic probability distribution of the ith data source, The credibility of the data source; 构建冲突检测规则库,采用博弈论协商模型解决冲突;通过,其中为单位优先级,为冲突解决方案的效用值,Bel表示数据融合后的置信度,Resolve表示冲突解决函数,Maxmize表示求最大值函数。Build a conflict detection rule base and use game theory negotiation model to resolve conflicts; ,in is the unit priority, is the utility value of the conflict resolution method, Bel represents the confidence after data fusion, Resolve represents the conflict resolution function, and Maxmize represents the maximum value function. 5.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述格式标准化处理包括对不同格式的文本数据统一编码格式,对数值型数据统一计量单位和精度。5. A cross-unit data management method according to claim 1, characterized in that the format standardization processing includes unifying the encoding format of text data in different formats and unifying the measurement unit and precision of numerical data. 6.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述S02中动态生成查询授权的过程中,通过日志分析系统,统计查询者在一定时间段内的查询频率、查询数据类型、查询时间分布的信息,基于这些统计结果,若查询者频繁查询某类数据,在动态生成查询授权时,为其提供更多该类数据的查询权限;若出现与历史查询模式差异较大的异常查询,系统自动对查询授权进行严格审查或限制。6. A cross-unit data management method according to claim 1, characterized in that, in the process of dynamically generating query authorization in S02, the query frequency, query data type, and query time distribution information of the inquirer within a certain time period are counted through the log analysis system. Based on these statistical results, if the inquirer frequently queries a certain type of data, more query permissions for this type of data are provided when the query authorization is dynamically generated; if an abnormal query that is significantly different from the historical query pattern occurs, the system automatically conducts strict review or restriction on the query authorization. 7.根据权利要求1所述的跨单位数据管理方法,其特征在于,所述S03中实时监测数据的变化,一旦数据敏感度改变,依据预设的敏感级别规则,调整权限映射表中对应数据节点的权限设置,当查询者权限变更时,直接在权限映射表中修改相应查询者与数据节点的映射关系,权限映射表采用Neo4j图数据库存储权限节点与数据节点的映射关系,利用边权重表示权限粒度,借助图数据库特性实现快速更新。7. The cross-unit data management method according to claim 1 is characterized in that the changes in data are monitored in real time in S03. Once the data sensitivity changes, the permission settings of the corresponding data nodes in the permission mapping table are adjusted according to the preset sensitivity level rules. When the queryer's permission changes, the mapping relationship between the corresponding queryer and the data node is directly modified in the permission mapping table. The permission mapping table uses the Neo4j graph database to store the mapping relationship between the permission node and the data node, uses the edge weight to represent the permission granularity, and uses the characteristics of the graph database to achieve rapid updates. 8.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述自然语言处理技术包括分词、词性标注、命名实体识别和语义理解。8. A cross-unit data management method according to claim 1, characterized in that the natural language processing technology includes word segmentation, part-of-speech tagging, named entity recognition and semantic understanding. 9.根据权利要求1所述的一种跨单位数据管理方法,其特征在于,所述S03中,通过监控系统实时获取各数据池的CPU使用率、内存占用率的负载信息,以及数据的存储位置信息,生成查询路由时,优先选择距离查询发起端较近或数据传输路径更短的数据池;当数据池处于高负载状态时,将查询请求导向负载较低的数据池。9. A cross-unit data management method according to claim 1, characterized in that, in S03, the load information of the CPU usage rate and memory occupancy rate of each data pool, as well as the storage location information of the data are obtained in real time through the monitoring system, and when generating a query route, a data pool that is closer to the query initiator or has a shorter data transmission path is preferentially selected; when the data pool is in a high-load state, the query request is directed to a data pool with a lower load.
CN202510468954.3A 2025-04-15 2025-04-15 Cross-unit data management method Pending CN119989418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510468954.3A CN119989418A (en) 2025-04-15 2025-04-15 Cross-unit data management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510468954.3A CN119989418A (en) 2025-04-15 2025-04-15 Cross-unit data management method

Publications (1)

Publication Number Publication Date
CN119989418A true CN119989418A (en) 2025-05-13

Family

ID=95630461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510468954.3A Pending CN119989418A (en) 2025-04-15 2025-04-15 Cross-unit data management method

Country Status (1)

Country Link
CN (1) CN119989418A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120256451A (en) * 2025-06-06 2025-07-04 南方电网科学研究院有限责任公司 Intelligent query system and method for power transmission and distribution production data based on natural language interaction
CN120449213A (en) * 2025-07-10 2025-08-08 北京电子数智科技有限责任公司 Data anonymization adjustment method and system based on dynamic association risk analysis
CN120541825A (en) * 2025-07-28 2025-08-26 湖南百菲特信息技术有限公司 Dynamic authority management system and method based on multi-source salary data integration

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11893994B1 (en) * 2019-12-12 2024-02-06 Amazon Technologies, Inc. Processing optimization using machine learning
CN119276632A (en) * 2024-11-26 2025-01-07 珠海晞曼科技有限公司 A method for combing and strengthening the attack surface of a firewall
CN119336831A (en) * 2024-12-18 2025-01-21 国网信通亿力科技有限责任公司 A method for intelligent search of power data based on resource map
CN119537424A (en) * 2025-01-21 2025-02-28 北京科杰科技有限公司 Visual table management method and system for different types of databases
CN119622645A (en) * 2025-02-11 2025-03-14 江苏禾冠信息技术有限公司 An automatic hierarchical data management method based on artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11893994B1 (en) * 2019-12-12 2024-02-06 Amazon Technologies, Inc. Processing optimization using machine learning
CN119276632A (en) * 2024-11-26 2025-01-07 珠海晞曼科技有限公司 A method for combing and strengthening the attack surface of a firewall
CN119336831A (en) * 2024-12-18 2025-01-21 国网信通亿力科技有限责任公司 A method for intelligent search of power data based on resource map
CN119537424A (en) * 2025-01-21 2025-02-28 北京科杰科技有限公司 Visual table management method and system for different types of databases
CN119622645A (en) * 2025-02-11 2025-03-14 江苏禾冠信息技术有限公司 An automatic hierarchical data management method based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOHAMED RAMADAN: "TowardsonlinetrainingforRL-basedqueryoptimizer", INTERNATIONALJOURNALOFDATASCIENCEANDANALYTICS, 10 September 2024 (2024-09-10), pages 1 - 10 *
蔡伟鸿: "基于映射机制的细粒度RBAC委托授权模型", 电子学报, vol. 38, no. 8, 31 August 2010 (2010-08-31) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120256451A (en) * 2025-06-06 2025-07-04 南方电网科学研究院有限责任公司 Intelligent query system and method for power transmission and distribution production data based on natural language interaction
CN120449213A (en) * 2025-07-10 2025-08-08 北京电子数智科技有限责任公司 Data anonymization adjustment method and system based on dynamic association risk analysis
CN120541825A (en) * 2025-07-28 2025-08-26 湖南百菲特信息技术有限公司 Dynamic authority management system and method based on multi-source salary data integration

Similar Documents

Publication Publication Date Title
US7567968B2 (en) Integration of a non-relational query language with a relational data store
US8108367B2 (en) Constraints with hidden rows in a database
US8078595B2 (en) Secure normal forms
CN110019176B (en) Data management control system for improving success rate of data management service
CN119989418A (en) Cross-unit data management method
US20100262625A1 (en) Method and system for fine-granularity access control for database entities
EP4280545A2 (en) Differentially private database permissions system
CN110291517A (en) Query language interoperability in chart database
US20130054563A1 (en) Self-learning semantic search engine
US20240303235A1 (en) Natural Language To Query Language Transformation
WO2021159834A1 (en) Abnormal information processing node analysis method and apparatus, medium and electronic device
CN115221337B (en) Data weaving processing method, device, electronic device and readable storage medium
WO2020214304A1 (en) Constraint querying for collaborative intelligence and constraint computing
US11537747B1 (en) Generating and continuously maintaining a record of data processing activity for a computer-implemented system
CN119228533A (en) A provident fund management system and control method based on multi-heterogeneous data fusion
CN118096058A (en) Transparent sharing method for government affair data and storage medium
CN117407893A (en) Data rights management method, device, equipment and media based on API configuration
Fisun et al. Knowledge management applications based on user activities feedback
US12332906B1 (en) Automatic data analysis formula phrase generation
US20250265276A1 (en) Automatic Target Visualization Adaptation
CN119598499B (en) Data line and row authority control and desensitization method based on Spark engine
US12443741B2 (en) Fine-grained authorization as a service via relationship-based access control within a multi-tenant system
US20230351039A1 (en) Fine-grained authorization as a service via relationship- based access control within a multi-tenant system
CN119067114A (en) Customer tree construction method, device and related equipment
CN119783125A (en) Data permission configuration method and system based on semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination