CN105303123A

CN105303123A - Blocking confusion based dynamic data privacy protection system and method

Info

Publication number: CN105303123A
Application number: CN201510734401.4A
Authority: CN
Inventors: 史玉良; 张宏磊; 周中民; 吕梁; 管永明; 张晖
Original assignee: Dareway Software Co ltd; Shandong University
Current assignee: Dareway Software Co ltd; Shandong University
Priority date: 2015-11-02
Filing date: 2015-11-02
Publication date: 2016-02-03

Abstract

The invention discloses a blocking confusion based dynamic data privacy protection system and method. Newly inserted and modified data is cached through a trusted third party, and the data is grouped and stored when conditions are met; the privacy security of deleted data and residual data in a delete operation is ensured by keeping key fragmentations; and the reduction of storage resource consumption and the optimization of the application performance are realized by falsification of a data recycling mechanism. Experiments prove that the proposed dynamic data privacy protection mechanism is higher in feasibility and practicability.

Description

A dynamic data privacy protection system and method based on block obfuscation

技术领域technical field

本发明涉及一种基于分块混淆的动态数据隐私保护系统及方法。The invention relates to a dynamic data privacy protection system and method based on block obfuscation.

背景技术Background technique

随着云计算的迅速发展，具有多租户特点的SaaS应用以其低费用、规模效益的商业运营模式和单实例、按需定制的软件交付特点，被越来越多的企业和服务商所采用。With the rapid development of cloud computing, SaaS applications with multi-tenant characteristics are adopted by more and more enterprises and service providers due to their low-cost, scale-effective business operation model and single-instance, on-demand software delivery features. .

在SaaS应用中，一方面，租户通过按需租赁和个性化定制，在满足自身业务需要的同时，节省了用于基础设施和后续升级、维护、管理等方面的高昂费用。另一方面，通过与SaaS服务商签订服务等级协议(servicelevelagreement，SLA)，保证了应用的服务质量，维护了租户和服务商双方的利益。In SaaS applications, on the one hand, tenants meet their own business needs through on-demand leasing and personalized customization, while saving high costs for infrastructure and subsequent upgrades, maintenance, and management. On the other hand, by signing a service level agreement (service level agreement, SLA) with the SaaS service provider, the service quality of the application is guaranteed, and the interests of both the tenant and the service provider are maintained.

然而，随着SaaS应用的广泛推广和使用，租户隐私数据在云中的安全性也受到了越来越多的关注。在多租户应用中，为了满足租户对数据进行业务操作的需求，租户敏感数据通常需要以明文的形式在非完全可信的服务商处进行存储和处理，使租户的隐私数据脱离了租户的直接控制，可能被服务提供商恶意泄漏。例如，在利益的驱动下，服务商可以将租赁其应用的某公司的产品定价信息及其客户关系转卖给其竞争对手，导致该公司经济利益受损。However, with the widespread promotion and use of SaaS applications, the security of tenant private data in the cloud has also received more and more attention. In multi-tenant applications, in order to meet the needs of tenants for business operations on data, tenants' sensitive data usually needs to be stored and processed in plain text on non-fully trusted service providers, so that the tenant's private data is separated from the tenant's direct control, which may be maliciously leaked by the service provider. For example, driven by profit, a service provider may resell the product pricing information and customer relationship of a company that leases its application to its competitors, resulting in damage to the company's economic interests.

针对SaaS应用面临的租户隐私泄露问题，[1ZhangKun,LiQingzhong,ShiYuliang.ResearchonDataCombinationPrivacyPreservationMechanismforSaaS[J].ChineseJournalofComputers,2010,33(11):2044-2054(inChinese)(张坤,李庆忠,史玉良.面向SaaS应用的数据组合隐私保护机制研究[J].计算机学报,2010,33(11):2044-2054)]中提出了一种基于分块混淆的数据组合隐私保护机制：首先根据租户定制的隐私约束将组合隐私属性切分到不同的分块中并混淆不同分片间的关联关系；然后针对分块中数据分布不均衡导致的隐私泄露问题，提出基于伪造数据的均衡化机制，通过添加伪造数据使各分块分布达到均衡；最后通过与可信第三方进行交互构建混淆数据的重构机制，保证租户隐私数据的可用性。Aiming at the problem of tenant privacy leakage faced by SaaS applications, [1 ZhangKun, LiQingzhong, ShiYuliang. Research on DataCombinationPrivacyPreservationMechanismforSaaS[J]. Research on Combined Privacy Protection Mechanism [J]. Journal of Computer Science, 2010, 33(11): 2044-2054)] proposed a data combination privacy protection mechanism based on block obfuscation: firstly, the combined privacy Segment attributes into different blocks and confuse the relationship between different blocks; then, aiming at the privacy leakage problem caused by unbalanced data distribution in blocks, a balancing mechanism based on fake data is proposed, by adding fake data to make each block The block distribution is balanced; finally, the reconstruction mechanism of obfuscated data is constructed by interacting with a trusted third party to ensure the availability of tenants' private data.

工作[2ShiY,JiangZ,ZhangK.Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications[C]//GridandPervasiveComputing.BerlinHeidelberg:Springer,2013:491-500，(SaaS应用程序的基于策略的个性化隐私保护机制)]和工作[3ShaoY,ShiY,LiH.ANovelCloudDataFragmentationCluster-basedPrivacyPreservingMechanism[J].InternationalJournalofGrid&DistributedComputing,2014,7(4):21-32,(一种新的基于聚类的云数据碎片隐私保护机制)]在此基础上又分别从不同方面进行了补充和优化。工作[2ShiY,JiangZ,ZhangK.Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications[C]//GridandPervasiveComputing.BerlinHeidelberg:Springer,2013:491-500,(SaaS应用程序的基于策略的个性化隐私保护机制)]基于租户的个性化隐私保护和事务处理需求，提出了基于策略的个性化隐私保护机制。工作[3ShaoY,ShiY,LiH.ANovelCloudDataFragmentationCluster-basedPrivacyPreservingMechanism[J].InternationalJournalofGrid&DistributedComputing,2014,7(4):21-32,(一种新的基于聚类的云数据碎片隐私保护机制)]通过键能算法对属性进行聚类，将关联程度较高的属性尽量分到同一分块中，通过减少分块间的连接次数对应用性能进行提高。Work [2ShiY, JiangZ, ZhangK.Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications[C]//GridandPervasiveComputing.BerlinHeidelberg:Springer,2013:491-500, (Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications)] and work[3ShaoY,ShiY, LiH.ANovelCloudDataFragmentationCluster-basedPrivacyPreservingMechanism[J].InternationalJournalofGrid&DistributedComputing, 2014,7(4):21-32, (a new cluster-based cloud data fragmentation privacy protection mechanism)] on this basis, respectively from different aspects Complements and optimizations. Work [2ShiY, JiangZ, ZhangK.Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications[C]//GridandPervasiveComputing.BerlinHeidelberg:Springer,2013:491-500, (Policy-BasedCustomizedPrivacyPreservingMechanismforSaaSApplications)]Tenant-based personalized privacy protection and transaction processing requirements, a policy-based personalized privacy protection mechanism is proposed. Work[3ShaoY, ShiY, LiH.ANovelCloudDataFragmentationCluster-basedPrivacyPreservingMechanism[J].InternationalJournalofGrid&DistributedComputing,2014,7(4):21-32, (a new cluster-based privacy protection mechanism for cloud data fragmentation)] Attributes are clustered, attributes with high correlations are divided into the same block as much as possible, and application performance is improved by reducing the number of connections between blocks.

然而在云计算环境下，随着多租户应用的持续运行，租户对数据的增加、删除、修改等业务操作将导致底层数据存储的持续变化，各分块的分布规律也将相应地发生变化，当数据分布不再均匀时，分片间的关联关系将以较大概率面临着被泄露的风险。另一方面，若攻击者可以获取局部时间内各分块的操作日志和数据快照，仍然可以通过对比分析推测出这部分数据中蕴含的隐私信息。这就要求所采取的隐私保护机制必须能适应这种变化，确保租户的隐私保护需求持续得到满足。However, in the cloud computing environment, with the continuous operation of multi-tenant applications, tenants' business operations such as adding, deleting, and modifying data will lead to continuous changes in the underlying data storage, and the distribution of each block will also change accordingly. When the data distribution is no longer uniform, the relationship between shards will face the risk of being leaked with a high probability. On the other hand, if the attacker can obtain the operation logs and data snapshots of each block in a local time, the private information contained in this part of the data can still be deduced through comparative analysis. This requires that the privacy protection mechanism adopted must be able to adapt to this change to ensure that the privacy protection needs of tenants are continuously met.

针对隐私保护，数据加密和数据混淆是目前比较成熟的两种解决方案。但是加密后的数据往往丢失了可操作性，因此提高密文检索速度和处理效率是加密隐私保护的研究热点。文献[4Q.Liu,G.Wang,J.Wu.AnEfficientPrivacyPreservingKeywordSearchSchemeinCloudComputing[C]//inProceedingsofthe2009InternationalConferenceonComputationalScienceandEngineering.Piscataway,NJ:IEEE,2009:715-720,(一种有效的隐私保护的云计算关键词检索模式)]分析了云计算的特征，提出一种在云中实现隐私保护的关键词检索模式，它支持服务提供商可以参与部分解密工作以减少客户端负担，同时在加密数据上实现关键词检索，以保护租户数据隐私和用户查询隐私。文献[5C.Gentry.Fullyhomomorphicencryptionusingideallattices[C]//inProceedingsofthe41stannualACMsymposiumonTheoryofcomputing.Bethesda:ACM,2009:169-178,(基于理想格的充分同态加密技术)]利用“理想格(ideallattice)”的数学对象构造了隐私同态(privacyhomomorphism)算法，使人们可以充分地操作加密状态的数据。加密方法对数据处理性能有较大影响，研究者提出通过其他方式来防止泄露隐私。文献[6RaymondChi-WingWong,LiJ,AdaWai-CheeFu,WangK.(α,k)-anonymity:Anenhancedk-anonymitymodelforprivacy-preservingdatapublishing[C]//ProceedingsoftheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.NewYork:ACM,2006:754-759,(一种改进的k-匿名化隐私保护数据发布模型)]提出了(α,k)-匿名原则，其在保证数据表满足k-匿名化原则的同时，要求每个等价类中任一敏感属性值相关的记录的百分比不高与α，从而避免攻击者利用一致性攻击和背景知识攻击来确认敏感数据与个人身份的联系。文献[7MachanavajjhalaA,KiferD,GehrkeJ,etal.l-diversity:Privacybeyondk-anonymity[J].ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),2007,1(1)]提出l-diversity原则，要求每个等价类的敏感属性至少有l个不同的值，使得攻击者最多以1/l的概率确认某个体的敏感信息。文献[8CirianiV,DiVimercatiSDC,ForestiS,etal.Fragmentationandencryptiontoenforceprivacyindatastorage[C]//ComputerSecurity–ESORICS2007.BerlinHeidelberg:Springer,2007:171-186,(基于信息分解和数据加密的数据存储隐私保护)]有效结合了信息分解和数据加密，提出使用隐私约束的概念来实现信息分解，提出隐私约束的概念，用来描述需要经过加密保护的数据属性和同时出现会泄露隐私的数据属性组合，根据这些隐私约束，经过信息分解，得到满足要求的分块模式，其中各个数据分块之间的关联关系保存在客户端。文献[9OuyangJia,YinJian,LiuShaopeng,etal.AnEffectiveDifferentialPrivacyTransactionDataPublicationStrategy[J].JournalofComputerResearchandDevelopment,2014,51(10):2195-2205(inChinese)(欧阳佳,印鉴,刘少鹏,等.一种有效的差分隐私事务数据发布策略[J].计算机研究与发展,2014,51(10):2195-2205)]针对事务数据库隐私保护发布的数据安全性与效用性不足，提出了一种有效的满足差分隐私约束事物数据发布策略TDPS，该策略基于项集I，构建事务数据库D的完整树Trie，然后基于压缩感知技术对完整树添加满足拆分隐私约束的噪音得到含噪Trie树，最后在从此树上进行频繁项集挖掘任务，很好地保护了数据隐私。但是，这些研究主要面向的是数据发布领域中的隐私保护问题，很少涉及对隐私数据的增删改操作。For privacy protection, data encryption and data obfuscation are currently two relatively mature solutions. However, the encrypted data often loses its operability, so improving the retrieval speed and processing efficiency of ciphertext is a research hotspot in encryption privacy protection. Literature [4Q.Liu, G.Wang, J.Wu.AnEfficientPrivacyPreservingKeywordSearchSchemeinCloudComputing[C]//inProceedingofthe2009InternationalConferenceonComputationalScienceandEngineering.Piscataway, NJ:IEEE,2009:715-720, (An Efficient Privacy-Preserving Keyword Search Scheme for Cloud Computing)] The characteristics of cloud computing are analyzed, and a keyword retrieval mode for privacy protection in the cloud is proposed. It supports service providers to participate in part of the decryption work to reduce the burden on the client, and at the same time implements keyword retrieval on encrypted data to protect Tenant data privacy and user query privacy. Literature [5C.Gentry.Fullyhomomorphic encryption using ideallattices[C]//inProceedingsofthe41stannualACMsymposiumonTheoryofcomputing.Bethesda:ACM,2009:169-178, (Fully homomorphic encryption technology based on ideal lattice)] uses the mathematical object of "ideallattice" to construct privacy Homomorphic (privacyhomomorphism) algorithm enables people to fully manipulate encrypted data. Encryption methods have a great impact on data processing performance, and researchers propose other methods to prevent privacy leaks. Literature [6RaymondChi-WingWong, LiJ, AdaWai-CheeFu, WangK. (α, k)-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing [C] // Proceeding of the ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. NewYork: ACM, 2006: 754-759, (an improved -Anonymized privacy-preserving data release model)] proposed (α,k)-anonymity principle, which ensures that the data table meets the k-anonymization principle, and requires records related to any sensitive attribute value in each equivalence class The percentage of α is not as high as α, so as to avoid attackers using consistency attack and background knowledge attack to confirm the connection between sensitive data and personal identity. The literature [7MachanavajjhalaA, KiferD, GehrkeJ, et al.l-diversity: Privacybeyondk-anonymity[J].ACMTransactionsonKnowledgeDiscoveryfromData(TKDD), 2007,1(1)] proposes the principle of l-diversity, which requires that the sensitive attributes of each equivalence class have at least l different values, so that the attacker can at most confirm the sensitive information of an individual with the probability of 1/l. The literature [8CirianiV, DiVimercatiSDC, ForestiS, et al. Fragmentation and encryption to enforce privacy in data storage [C]//Computer Security–ESORICS2007. Berlin Heidelberg: Springer, 2007: 171-186, (data storage privacy protection based on information decomposition and data encryption)] effectively combines information decomposition and Data encryption, proposes to use the concept of privacy constraints to achieve information decomposition, and proposes the concept of privacy constraints to describe the combination of data attributes that need to be protected by encryption and data attributes that will leak privacy at the same time. According to these privacy constraints, after information decomposition, A block pattern that meets the requirements is obtained, wherein the association relationship between each data block is saved in the client. Literature [9 OuyangJia, YinJian, LiuShaopeng, et al. An Effective Differential Privacy Transaction Data Publication Strategy [J]. Journal of Computer Research and Development, 2014, 51 (10): 2195-2205 (in Chinese) [J].Computer Research and Development, 2014,51(10):2195-2205)] Aiming at the lack of data security and utility published by transactional database privacy protection, an effective transactional data release strategy that satisfies differential privacy constraints is proposed TDPS, this strategy is based on itemset I, constructs a complete tree Trie of transaction database D, and then adds noise that meets the split privacy constraints to the complete tree based on compressed sensing technology to obtain a noisy Trie tree, and finally mines frequent itemsets from this tree tasks, data privacy is well protected. However, these studies mainly focus on privacy protection issues in the field of data publishing, and rarely involve the addition, deletion, and modification of private data.

对于动态数据的隐私保护，目前已经有多钟不同的方法相继被提出，但是这些方法主要解决的也大都是面向数据发布和数据挖掘中的动态数据隐私保护，并不适合SaaS应用中的隐私保护。For the privacy protection of dynamic data, many different methods have been proposed one after another, but these methods mainly solve the privacy protection of dynamic data in data publishing and data mining, and are not suitable for privacy protection in SaaS applications. .

J.Byun等人最先在文献[10ByunJW,SohnY,BertinoE,etal.Secureanonymizationforincrementaldatasets[C]//SecureDataManagement.BerlinHeidelberg:Springer,2006:48-63，(持续增长数据集的隐私保护)]中对持续增长数据集的隐私保护做了相应的研究，其思路是新的记录要插入必须满足两个限定条件：一是待插入记录数不能少于一定的数量，二是待插入的记录集要符合l-diversity技术的要求，如果有其一达不到要求就不能插入，该方法存在数据更新不及时且数据的更新只局限于插入这一种操作的问题。同样的，文献[11PeiJ,XuJ,WangZ,etal.Maintainingk-anonymityagainstincrementalupdates[C]//ScientificandStatisticalDatabaseManagement,2007.SSBDM'07.19thInternationalConferenceon.Piscataway,NJ:IEEE,2007:5-5]和文献[12ByunJW,LiT,BertinoE,etal.Privacy-preservingincrementaldatadissemination[J].JournalofComputerSecurity,2009,17(1):43-68，(针对插入操作的k-匿名化原则维护)]中提出的方法也只是针对插入操作对数据进行隐私保护，而在SaaS应用中租户需要经常对数据进行插入、删除和修改等操作，因此这些方法并不适应于SaaS应用。文献[13XiaoX,TaoY.M-invariance:towardsprivacypreservingre-publicationofdynamicdatasets[C]//Proceedingsofthe2007ACMSIGMODinternationalconferenceonManagementofdata.NewYork:ACM,2007:689-700，(M-invariance：针对动态数据集发布过程的隐私保护)]提出了m-invariance匿名机制，其核心思想是在数据集的任何一个快照中，一条指定的数据记录都只能被放置在具有固定的隐私属性集的分片中，该方法很好地解决了数据发布过程中的值相关攻击。文献[14HeY,BarmanS,NaughtonJF.Preventingequivalenceattacksinupdated,anonymizeddata[C]//DataEngineering(ICDE),2011IEEE27thInternationalConferenceon.Piscataway,NJ:IEEE,2011:529-540，(针对匿名数据发布中的等值攻击问题的保护)]针对数据发布中存在的值等价攻击问题，在文献[13XiaoX,TaoY.M-invariance:towardsprivacypreservingre-publicationofdynamicdatasets[C]//Proceedingsofthe2007ACMSIGMODinternationalconferenceonManagementofdata.NewYork:ACM,2007:689-700，(M-invariance：针对动态数据集发布过程的隐私保护)]的基础上利用“最小割”算法提出了一种基于图的匿名算法，该算法同时对值相关攻击和值等价攻击问题进行了保护。文献[15NergizAE,CliftonC,MalluhiQM.Updatingoutsourcedanatomizedprivatedatabases[C]//Proceedingsofthe16thInternationalConferenceonExtendingDatabaseTechnology.NewYork:ACM,2013:179-190，(动态变化的外包数据库的隐私保护)]针对动态变化的外包数据库的隐私保护问题，在数据分解的基础上提出将用户最近插入和修改的数据通过加密后存放到外包数据库中，将加密秘钥保存在客户端，在删除数据时只删除包含识别信息的部分数据，该方法要求客户端保存加密秘钥并且需要明确区分标识信息和敏感信息，并且在业务操作中需要频繁进行加密和解密操作。而在SaaS应用中，租户不需要存储任何信息或拥有任何计算能力，并且租户需要能够对隐私需求进行个性化定制，指定的隐私约束中往往无法明确区分识别信息和敏感信息，因此文献[15NergizAE,CliftonC,MalluhiQM.Updatingoutsourcedanatomizedprivatedatabases[C]//Proceedingsofthe16thInternationalConferenceonExtendingDatabaseTechnology.NewYork:ACM,2013:179-190，(动态变化的外包数据库的隐私保护)]中的隐私保护方法并不能完全适合SaaS应用的动态隐私保护。J.Byun et al. were the first to discuss continuous growth in the literature [10ByunJW, SohnY, BertinoE, etal. The privacy protection of the data set has been studied accordingly. The idea is that new records must meet two restrictions to be inserted: one is that the number of records to be inserted cannot be less than a certain number, and the other is that the record set to be inserted must meet l- If the requirements of the diversity technology are not met, it cannot be inserted. This method has the problem that the data update is not timely and the update of the data is limited to the operation of inserting. Similarly, literature [11PeiJ, XuJ, WangZ, etal.Maintainingk-anonymityagainstincrementalupdates[C]//ScientificandStatisticalDatabaseManagement, 2007.SSBDM'07.19thInternationalConferenceon.Piscataway, NJ: IEEE, 2007:5-5] and literature [12ByunJW, LiT, BertinoE ,etal.Privacy-preservingincrementaldatadissemination[J].JournalofComputerSecurity,2009,17(1):43-68, (Maintenance of k-anonymization principle for insertion operation)]The method proposed in is only for privacy protection of data for insertion operation , and in SaaS applications, tenants need to frequently insert, delete, and modify data, so these methods are not suitable for SaaS applications. Literature [13XiaoX,TaoY.M-invariance:towardsprivacypreservingre-publicationofdynamicdatasets[C]//Proceedingofthe2007ACMSIGMODinternationalconferenceonManagementofdata.NewYork:ACM,2007:689-700, (M-invariance: privacy protection for dynamic dataset publishing process)] proposed m- The core idea of the invariance anonymity mechanism is that in any snapshot of the data set, a specified data record can only be placed in a shard with a fixed set of privacy attributes. value-dependent attacks. Literature[14HeY,BarmanS,NaughtonJF.Preventingequivalenceattacksinupdated,anonymizeddata[C]//DataEngineering(ICDE),2011IEEE27thInternationalConferenceon.Piscataway,NJ:IEEE,2011:529-540, (protection against equivalence attacks in anonymous data publishing)] Aiming at the problem of value equivalence attack in data publishing, in the literature [13XiaoX,TaoY. Based on the privacy protection of the dataset publishing process)], a graph-based anonymity algorithm is proposed using the "minimum cut" algorithm, which protects both value-related attacks and value-equivalent attacks. Literature [15NergizAE, CliftonC, MalluhiQM.Updatingoutsourcedanatomizedprivatedatabases[C]//Proceedingofthe16thInternationalConferenceonExtendingDatabaseTechnology.NewYork:ACM,2013:179-190, (Privacy protection of dynamically changing outsourced databases)] Aiming at the privacy protection of dynamically changing outsourced databases, in the data On the basis of decomposition, it is proposed to encrypt and store the data recently inserted and modified by the user in the outsourced database, save the encryption key on the client, and only delete part of the data containing identification information when deleting data. This method requires the client to save Encryption keys need to clearly distinguish identification information from sensitive information, and frequent encryption and decryption operations are required in business operations. In SaaS applications, tenants do not need to store any information or have any computing power, and tenants need to be able to personalize privacy requirements, and the specified privacy constraints often cannot clearly distinguish between identification information and sensitive information, so literature [15NergizAE, CliftonC, MalluhiQM.Updatingoutsourcedanatomizedprivatedatabases[C]//Proceedingofthe16thInternationalConferenceonExtendingDatabaseTechnology.NewYork:ACM,2013:179-190, (Privacy Protection of Dynamically Changing Outsourced Databases)] The privacy protection method in is not completely suitable for the dynamic privacy protection of SaaS applications.

虽然上面提出的各种方法很好地应对了数据发布和数据挖掘中存在的各种动态隐私保护问题，但由于SaaS应用与数据发布具有完全不同的特点，使得上述方法都不能完全适应于SaaS应用。工作[1ZhangKun,LiQingzhong,ShiYuliang.ResearchonDataCombinationPrivacyPreservationMechanismforSaaS[J].ChineseJournalofComputers,2010,33(11):2044-2054(inChinese)(张坤,李庆忠,史玉良.面向SaaS应用的数据组合隐私保护机制研究[J].计算机学报,2010,33(11):2044-2054)]针对SaaS应用，提出一种基于分块的隐私保护机制，根据租户提出的隐私约束将租户数据中的组合隐私分解到不同的数据分块中，并隐藏分片之间的关联关系，实现了明文状态下的SaaS数据隐私保护，通过插入伪造数据的方式，确保各分块中数据分布的持续均衡，防止服务运营商泄漏租户的数据组合隐私，但该文献并没有提及租户对数据进行插入、删除和修改时如何对隐私数据进行保护，存在泄漏数据块关联关系的危险。Although the various methods proposed above have well dealt with various dynamic privacy protection problems in data publishing and data mining, due to the completely different characteristics of SaaS applications and data publishing, the above methods cannot be fully adapted to SaaS applications. . Work [1 Zhang Kun, Li Qingzhong, Shi Yuliang. Research on Data Combination Privacy Preservation Mechanism for SaaS [J]. Chinese Journal of Computers, 2010, 33 (11): 2044-2054 (in Chinese) (Zhang Kun, Li Qingzhong, Shi Yuliang. Research on Data Combination Privacy Preservation Mechanism for SaaS Applications [J]. Journal of Computer Science, 2010, 33(11):2044-2054)] Aiming at SaaS applications, a block-based privacy protection mechanism is proposed, which decomposes the combined privacy in tenant data into different data blocks according to the privacy constraints proposed by tenants , and hide the association between the shards, realize the SaaS data privacy protection in the plaintext state, ensure the continuous balance of data distribution in each block by inserting forged data, and prevent the service operator from leaking the tenant's data combination Privacy, but this document does not mention how to protect private data when tenants insert, delete, and modify data, and there is a danger of leaking data block associations.

随着信息技术的高速发展，对数据的采集、存储和分析变得更加方便和快捷，技术手段也更加先进和完善。在现实生活中，由于企业管理和权限分配的不完善，云计算服务中经常会发生服务商或数据库管理员(本发明将恶意泄露租户隐私的相关人员统称为攻击者)通过各种技术手段恶意获取用户数据及其变更记录，并通过分析后泄露用户隐私的情况。虽然租户数据在存储到云端之前已经通过分块混淆隐藏了不同分片之间的关联关系，但攻击者仍然可以通过对数据变更后各分块的不均匀分布情况或者数据变更日志及相应的局部数据快照进行分析得到部分租户的隐私信息。下面将通过图1所示示例对工作[1ZhangKun,LiQingzhong,ShiYuliang.ResearchonDataCombinationPrivacyPreservationMechanismforSaaS[J].ChineseJournalofComputers,2010,33(11):2044-2054(inChinese)(张坤,李庆忠,史玉良.面向SaaS应用的数据组合隐私保护机制研究[J].计算机学报,2010,33(11):2044-2054)]所提隐私保护机制在租户进行业务操作时面临的隐私泄露威胁及其它不足分别进行分析。With the rapid development of information technology, the collection, storage and analysis of data has become more convenient and faster, and the technical means have become more advanced and perfect. In real life, due to the imperfection of enterprise management and authority distribution, it often occurs in cloud computing services that service providers or database administrators (in this invention, the relevant personnel who maliciously disclose the privacy of tenants are collectively referred to as attackers) use various technical means to maliciously Obtain user data and its change records, and disclose user privacy after analysis. Although the tenant data has hidden the relationship between different shards through block obfuscation before being stored in the cloud, the attacker can still analyze the uneven distribution of each block after the data change or the data change log and the corresponding local data. Data snapshots are analyzed to obtain private information of some tenants. The following will use the example shown in Figure 1 to work [1 ZhangKun, Li Qingzhong, Shi Yuliang. Research on Data Combination Privacy Preservation Mechanism for SaaS [J]. Chinese Journal of Computers, 2010, 33 (11): 2044-2054 (in Chinese) (Zhang Kun, Li Qingzhong, Shi Yuliang. Research on Combined Privacy Protection Mechanism[J].Journal of Computer Science,2010,33(11):2044-2054)]Analyze the threat of privacy leakage and other deficiencies that the proposed privacy protection mechanism faces when tenants conduct business operations.

图1中Snapshot(T1)是进行业务操作前根据工作[1ZhangKun,LiQingzhong,ShiYuliang.ResearchonDataCombinationPrivacyPreservationMechanismforSaaS[J].ChineseJournalofComputers,2010,33(11):2044-2054(inChinese)(张坤,李庆忠,史玉良.面向SaaS应用的数据组合隐私保护机制研究[J].计算机学报,2010,33(11):2044-2054)]提出的方法对租户数据的逻辑视图进行分块后得到的数据对象物理存储模式，Snapshot(T2)是对Snapshot(T1)分别执行了一次插入、删除和修改后得到的数据对象物理存储模式。表1为通过某种工具或技术获取的这段时间内租户对数据库的业务操作日志。The Snapshot (T1) in Figure 1 is based on the work before the business operation [1 ZhangKun, Li Qingzhong, Shi Yuliang. Research on Data Combination Privacy Preservation Mechanism for SaaS [J]. Research on the data combination privacy protection mechanism of SaaS application[J].Journal of Computer Science,2010,33(11):2044-2054)] The proposed method divides the logical view of tenant data into blocks to obtain the physical storage mode of data objects, Snapshot (T2) is the physical storage mode of the data object obtained after the Snapshot (T1) is inserted, deleted and modified once respectively. Table 1 is the tenant's business operation log on the database during this period obtained through a certain tool or technology.

1)通过操作日志和数据快照泄露租户隐私1) Leaking tenant privacy through operation logs and data snapshots

当攻击者获取到数据变更日志(表1)以及变更前后的数据快照Snapshot(T1)和Snapshot(T2)后，通过日志可以发现三个分块同时插入和删除过分片，通过对比Snapshot(T1)和Snapshot(T2)可以推测出该过程删除了一条关于Bob的数据并且Bob患有疾病Measles，新增了一条Greg的数据且Greg患有疾病Flu。由于同时在Chunk2和Chunk3中分别修改了一个分片的取值，可以推断出43岁且原先邮编为62000的那位病人其患病信息已由Measles变更为Flu且其邮编也发生了变化。After the attacker obtains the data change log (Table 1) and the data snapshots Snapshot(T1) and Snapshot(T2) before and after the change, through the log, it can be found that three blocks are inserted and deleted at the same time. By comparing the Snapshot(T1) And Snapshot(T2), it can be inferred that the process deletes a piece of data about Bob and Bob suffers from the disease Measles, and adds a piece of data about Greg and Greg suffers from the disease Flu. Since the value of one slice is modified in Chunk2 and Chunk3 at the same time, it can be inferred that the disease information of the 43-year-old patient with the original zip code of 62000 has been changed from Measles to Flu and his zip code has also changed.

2)分块数据不均匀导致租户隐私泄露2) Inhomogeneous block data leads to leakage of tenant privacy

假设租户签订的SLA协议中规定攻击者最大只能以1/3的概率将不同分块中的分片数据进行关联，从Snapshot(T1)中疾病信息的分布可以看出，在对数据进行更新前，分块中关于疾病的取值信息共有Cancer、Measles和Flu三种，且每种取值出现的概率都不超过1/3，符合租户指定的隐私保护水平要求，即从其它分块中任取一个分片，将其准确关联到对应的疾病信息的最大正确率都不超过1/3。而更新后的数据分布Snapshot(T2)中，疾病取值只剩下了Cancer和Flu两种，且Flu出现的概率增长到了2/3，若此时攻击者猜测某位病人患有的疾病是Flu，则其正确率将上升到2/3，显然此时的数据分布已经破坏了租户提出的隐私保护要求。Assuming that the SLA agreement signed by the tenant stipulates that the attacker can only associate the fragmented data in different blocks with a probability of 1/3 at most. From the distribution of disease information in Snapshot (T1), it can be seen that when the data is updated Currently, there are three types of disease value information in the block: Cancer, Measles, and Flu, and the probability of occurrence of each value does not exceed 1/3, which meets the privacy protection level requirements specified by the tenant, that is, from other blocks Taking any shard, the maximum correct rate of accurately associating it with the corresponding disease information does not exceed 1/3. In the updated data distribution Snapshot (T2), there are only two disease values: Cancer and Flu, and the probability of Flu has increased to 2/3. If the attacker guesses that the disease a patient suffers from is Flu, then its correct rate will rise to 2/3. Obviously, the data distribution at this time has destroyed the privacy protection requirements proposed by the tenants.

表1数据库业务操作日志Table 1 Database business operation log

3)由分块均衡引起的伪造数据激增问题3) The problem of spurious data surge caused by block equalization

图1中，业务操作完成后，Chunk3中分片只有两种取值：Cancer和Flu，且取值Flu所占比例为2/3。此时要使其符合租户指定的隐私保护水平要求1/3，则Chunk3中的分片数量至少需要达到12条，即至少需要向Chunk3中插入6条伪造数据进行均衡。此时伪造数据占总数据量的比值将达到1/2，在云计算环境中，随着用户数据量的急剧增加，较高的伪造数据占比不仅会浪费大量的存储空间，而且也会导致应用性能的急剧下降。另一方面，工作[1ZhangKun,LiQingzhong,ShiYuliang.ResearchonDataCombinationPrivacyPreservationMechanismforSaaS[J].ChineseJournalofComputers,2010,33(11):2044-2054(inChinese)(张坤,李庆忠,史玉良.面向SaaS应用的数据组合隐私保护机制研究[J].计算机学报,2010,33(11):2044-2054)]提出的均衡策略是通过不断插入伪造分片实现的，随着应用的持续运行，系统中的伪造数据会一直增加，而在数据量较大时，再想通过删除伪造数据来实现数据均衡将需要付出巨大的计算代价。In Figure 1, after the business operation is completed, the shards in Chunk3 have only two values: Cancer and Flu, and the value of Flu accounts for 2/3. At this time, to make it meet the privacy protection level requirement 1/3 specified by the tenant, the number of shards in Chunk3 needs to reach at least 12, that is, at least 6 pieces of fake data need to be inserted into Chunk3 for balance. At this time, the ratio of forged data to the total data volume will reach 1/2. In the cloud computing environment, with the sharp increase of user data volume, a high proportion of forged data will not only waste a lot of storage space, but also lead to A drastic drop in application performance. On the other hand, work [1 ZhangKun, Li Qingzhong, Shi Yuliang. Research on Data Combination Privacy Preservation Mechanism for SaaS [J]. Chinese Journal of Computers, 2010, 33(11): 2044-2054 (in Chinese) [J].Journal of Computer Science,2010,33(11):2044-2054)] The proposed balance strategy is achieved by continuously inserting fake fragments. As the application continues to run, the fake data in the system will always increase, while When the amount of data is large, if you want to achieve data balance by deleting fake data, you will need to pay a huge computational cost.

从上面的举例分析可以看到，如果直接将租户的业务操作作用于云端数据，不仅会直接泄露本次被操作的租户的隐私信息，还会导致表中其它隐私数据因为分块中数据分布的不均匀而引起泄露。此外不合理的均衡机制也会导致系统因较高的伪造数据占比而付出较大的存储消耗和性能代价。From the above example analysis, we can see that if the tenant's business operations are directly applied to the cloud data, not only will the private information of the tenant being operated this time be directly leaked, but also other private data in the table will be caused due to the data distribution in the block. Inhomogeneity causes leakage. In addition, an unreasonable balance mechanism will also cause the system to pay a large storage consumption and performance cost due to the high proportion of forged data.

发明内容Contents of the invention

本发明的目的就是为了解决上述问题，提供一种基于分块混淆的动态数据隐私保护系统及方法，它将租户最新插入和修改的数据在可信第三方进行缓存，当具备分块条件时再将数据以分组为单位分块均衡后存入存储层；对于删除操作，保留关键分块和低权重分块中的数据不被删除，防止攻击者通过数据操作行为对租户隐私进行重构；使用伪造数据回收机制对存储层冗余的伪造数据进行回收，降低伪造数据占总数据量的比例，可对租户业务操作进行有效支持，保证数据动态变化过程中租户隐私信息的安全。The purpose of the present invention is to solve the above problems, and to provide a dynamic data privacy protection system and method based on block obfuscation, which caches the latest data inserted and modified by tenants in a trusted third party, and when the block conditions are met, then The data is divided into blocks and stored in the storage layer after being balanced in groups; for deletion operations, the data in key blocks and low-weight blocks are kept from being deleted, preventing attackers from reconstructing tenant privacy through data manipulation behavior; using The counterfeit data recovery mechanism recovers redundant counterfeit data in the storage layer, reduces the proportion of counterfeit data in the total data volume, effectively supports tenants’ business operations, and ensures the security of tenants’ private information during dynamic data changes.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于分块混淆的动态数据隐私保护系统，包括：A dynamic data privacy protection system based on block obfuscation, including:

应用层，负责管理供租户进行租赁和个性化定制的应用组件，接收租户提交的数据业务处理请求并将请求转交给隐私保护层进行处理；应用层能够部署多种不同的应用组件，每种应用组件又能够部署多个实例进行负载均衡；The application layer is responsible for managing the application components for tenants to lease and customize, receives the data service processing requests submitted by the tenants and forwards the requests to the privacy protection layer for processing; the application layer can deploy a variety of different application components, each application Components can deploy multiple instances for load balancing;

隐私保护层，是动态数据隐私保护框架的核心部分，负责接收和处理从应用层转交下来的数据业务处理请求，并将处理结果返回给应用层；数据业务处理请求包括数据插入请求、数据删除请求、数据查询请求或数据修改请求；The privacy protection layer is the core part of the dynamic data privacy protection framework, responsible for receiving and processing data business processing requests transferred from the application layer, and returning the processing results to the application layer; data business processing requests include data insertion requests and data deletion requests , data query request or data modification request;

存储层，负责对租户数据进行存储，是所有数据处理操作的最终执行者；存储层部署多个数据库实例，数据库实例选用相同或不同数据库进行部署；The storage layer is responsible for storing tenant data and is the final executor of all data processing operations; the storage layer deploys multiple database instances, and the database instances use the same or different databases for deployment;

可信第三方，负责对租户个性化的隐私保护策略进行管理，还负责对最新插入或修改的数据进行管理，能对数据租户身份进行识别，阻止攻击者对隐私保护策略和临时数据的恶意访问和窃取。A trusted third party is responsible for managing the tenant's personalized privacy protection policy, and is also responsible for managing the latest inserted or modified data. It can identify the identity of the data tenant and prevent attackers from maliciously accessing the privacy protection policy and temporary data. and steal.

对于数据插入请求，将新插入数据进行包装后在可信第三方进行临时存储；For data insertion requests, the newly inserted data is packaged and temporarily stored in a trusted third party;

对于数据删除请求，在存储层中将非关键分块和高权重分块中的对应分片进行删除；所述非关键模块是指不包含关键属性的分块；For the data deletion request, the non-key block and the corresponding slice in the high-weight block are deleted in the storage layer; the non-key module refers to a block that does not contain key attributes;

对于数据查询请求，由隐私保护层的重构模块将对原始数据的重构结果返回；For data query requests, the reconstruction module of the privacy protection layer will return the reconstruction results of the original data;

对于数据修改请求，先将存储层中对应的原始数据删除，然后将修改后数据包装后在可信第三方进行临时存储。For data modification requests, the corresponding original data in the storage layer is deleted first, and then the modified data is packaged and temporarily stored in a trusted third party.

隐私保护层包括：更新处理模块、分块处理模块、均衡模块、重构模块和伪造数据回收模块；The privacy protection layer includes: update processing module, block processing module, balance module, reconstruction module and forged data recovery module;

所述更新处理模块，负责接收应用层传递过来的业务操作请求，然后对操作请求类型进行分析，根据操作请求类型执行不同的操作过程并，操作完成后，将操作结果返回给应用层；所述操作请求类型包括：插入、删除、修改和查询；所述更新处理模块还接收重构模块的数据；更新处理模块将操作结果还存储到存储层；更新处理模块还将认证信息存储到认证模块；将临时数据存储到临时数据管理模块；The update processing module is responsible for receiving the business operation request delivered by the application layer, then analyzing the operation request type, performing different operation processes according to the operation request type, and returning the operation result to the application layer after the operation is completed; The type of operation request includes: insert, delete, modify and query; the update processing module also receives the data of the reconstruction module; the update processing module also stores the operation result in the storage layer; the update processing module also stores the authentication information in the authentication module; Store the temporary data in the temporary data management module;

所述分块处理模块，负责对临时数据管理模块产生的新分组进行垂直分块；当有新分组传递至分块处理模块时，向隐私策略管理模块发送请求获取对应的数据分块策略，根据分块策略对数据进行垂直分块，并将分块后的数据存储到存储层；The block processing module is responsible for performing vertical block on the new group generated by the temporary data management module; when a new group is delivered to the block processing module, it sends a request to the privacy policy management module to obtain the corresponding data block strategy, according to The block strategy divides the data vertically and stores the block data in the storage layer;

所述均衡模块，从分块处理模块获取各分块的数据分布状态，产生满足α、β和γ均衡的伪造数据，对各分块进行α、β和γ均衡；数据分布状态是指所有的分片取值出现的频率；The equalization module obtains the data distribution state of each block from the block processing module, generates forged data satisfying α, β, and γ balance, and performs α, β, and γ balance on each block; the data distribution state refers to all The frequency of fragmentation values;

所述伪造数据回收模块，负责周期性地检查分组状态，根据各分组的有效数据占比，判断是否需要回收该分组的伪造数据；对需要回收伪造数据的分组，将分组中剩余的有效数据转移至临时数据管理模块进行缓存处理，然后将该分组在各分块中的数据全部删除；The forged data recovery module is responsible for periodically checking the grouping status, and judging whether the forged data of the grouping needs to be recovered according to the proportion of valid data in each grouping; for the grouping that needs to recover the forged data, transfer the remaining valid data in the grouping Go to the temporary data management module for cache processing, and then delete all the data grouped in each block;

所述重构分块，根据租户提交业务操作中的查询、删除或修改条件，对分块前的原始隐私数据进行连接，并将原始数据对应的全局ID返回；重构模块从存储层中读取分块数据，然后对分块数据进行重构，将重构后的结果反馈给更新处理模块。The reconstructed block connects the original private data before the block according to the query, deletion or modification conditions submitted by the tenant in the business operation, and returns the global ID corresponding to the original data; the reconstructed module reads from the storage layer Take the block data, then reconstruct the block data, and feed back the reconstructed result to the update processing module.

所述可信第三方包括：隐私策略管理模块、认证模块和临时数据管理模块。The trusted third party includes: a privacy policy management module, an authentication module and a temporary data management module.

所述认证模块，负责接收租户的身份认证信息，对租户身份进行识别，并根据识别结果阻止恶意攻击者访问对应的隐私保护策略以及临时数据，同时允许合法租户访问对应的隐私保护策略以及临时数据；The authentication module is responsible for receiving the identity authentication information of the tenant, identifying the identity of the tenant, and preventing malicious attackers from accessing the corresponding privacy protection policy and temporary data according to the identification result, while allowing legitimate tenants to access the corresponding privacy protection policy and temporary data ;

所述隐私策略管理模块，负责对租户的个性化隐私保护策略进行存储和访问管理，根据租户ID传递对应租户的隐私保护策略至分块处理模块。The privacy policy management module is responsible for storing and accessing the tenant's personalized privacy protection policy, and transferring the corresponding tenant's privacy protection policy to the block processing module according to the tenant ID.

所述临时数据管理模块，负责对租户新插入或新修改的数据进行临时存储，并根据分组条件对临时数据进行水平分组，产生新分组后，将分组数据传递至分块处理模块。所述分组条件包括分组的大小和分组中数据的分布情况；The temporary data management module is responsible for temporarily storing the newly inserted or modified data of the tenant, and horizontally grouping the temporary data according to the grouping conditions, and passing the grouped data to the block processing module after generating new groups. The grouping condition includes the size of the grouping and the distribution of data in the grouping;

一种基于分块混淆的动态数据隐私保护方法，该方法通过可信第三方对新插入和修改的数据进行缓存，并在满足条件时将数据进行分组和存储；通过保留关键分片来保证删除操作中被删数据和剩余数据的隐私安全；通过伪造数据回收算法实现存储资源消耗的降低和应用性能的优化。A dynamic data privacy protection method based on block obfuscation, which caches newly inserted and modified data through a trusted third party, and groups and stores data when conditions are met; guarantees deletion by retaining key shards The privacy and security of deleted data and remaining data during operation; the reduction of storage resource consumption and the optimization of application performance are achieved through forged data recovery algorithms.

一种基于分块混淆的动态数据隐私保护方法，包括如下步骤：A dynamic data privacy protection method based on block obfuscation, comprising the following steps:

步骤(1)：应用层的应用组件负责接收租户提交的业务操作，并将操作转交到隐私保护层的更新处理模块；Step (1): The application component of the application layer is responsible for receiving the business operation submitted by the tenant, and forwarding the operation to the update processing module of the privacy protection layer;

步骤(2)：更新处理模块对操作类型进行分析，若是插入操作，则继续执行步骤(3)，否则，执行步骤(7)；Step (2): The update processing module analyzes the operation type, if it is an insert operation, then proceed to step (3), otherwise, execute step (7);

步骤(3)：更新处理模块提交租户身份信息至可信第三方的认证模块进行身份识别，认证通过后允许租户访问临时数据管理模块；Step (3): The update processing module submits the tenant's identity information to the authentication module of the trusted third party for identification, and allows the tenant to access the temporary data management module after passing the authentication;

步骤(4)：临时数据管理模块将新插入数据以明文形式进行缓存，并周期地检测缓存数据是否达到分组条件，若达到，则调用分组生成和分块算法产生新的分组，上传给分块处理模块；若未达到，则不做任何处理；Step (4): The temporary data management module caches the newly inserted data in the form of plain text, and periodically checks whether the cached data meets the grouping conditions. If so, calls the group generation and block algorithm to generate a new group and uploads it to the block Processing module; if not reached, no processing will be done;

步骤(5)：若有新的分组上传至分块处理模块，则分块处理模块向可信第三方的隐私策略管理模块请求对应租户的分块策略，依据分块策略对分组数据进行垂直分块；Step (5): If a new packet is uploaded to the block processing module, the block processing module requests the block policy of the corresponding tenant from the privacy policy management module of the trusted third party, and vertically divides the packet data according to the block policy. piece;

步骤(6)：均衡模块根据每个分块的数据分布产生伪造数据，使各分块的数据分布都满足α、β和γ均衡，然后将分块数据存入存储层；所述分块数据包括伪造数据和用户存入的隐私数据；Step (6): The equalization module generates forged data according to the data distribution of each block, so that the data distribution of each block satisfies α, β and γ balance, and then stores the block data in the storage layer; the block data Including forged data and private data stored by users;

步骤(7)：重构模块通过隐私数据重构算法对隐私数据进行重构，判断是否是修改操作，若是，则对原始数据进行修改并组装成新数据，然后将原始数据删除，转步骤(3)，否则转步骤(8)；Step (7): The reconstruction module reconstructs the private data through the private data reconstruction algorithm, and judges whether it is a modification operation. If so, the original data is modified and assembled into new data, and then the original data is deleted, and the step ( 3), otherwise go to step (8);

步骤(8)：判断是否删除操作，若是，则执行数据删除算法将存储层非关键分块和高权重分块中对应的分片删除，删除成功后，通过伪造数据回收模块调用伪造数据回收算法检查是否有分组的伪造数据需要回收；否则，则判定为查询操作，执行步骤(9)；Step (8): Determine whether to delete the operation, and if so, execute the data deletion algorithm to delete the corresponding fragments in the non-critical blocks and high-weight blocks of the storage layer. After the deletion is successful, call the forged data recovery algorithm through the forged data recovery module Check whether there is grouped forged data that needs to be recycled; otherwise, it is determined as a query operation, and step (9) is performed;

步骤(9)：根据重构模块计算的全局标识以及查询条件，筛选查询结果，并将结果返回给租户。Step (9): Filter the query results according to the global identifier and query conditions calculated by the reconstruction module, and return the results to the tenant.

所述插入操作将新插入的数据添加全局ID和时间戳后插入临时更新表中临时存储，当临时更新表中数据量大于设定阈值时，调用分组生成和分块算法对临时更新表中的数据进行水平分组，并以水平分组为单位对数据进行分块和均衡，最后存储到存储层。The insertion operation inserts the newly inserted data into the temporary update table after adding a global ID and a time stamp for temporary storage, and when the amount of data in the temporary update table is greater than a set threshold, call the grouping generation and block algorithm to the temporary update table The data is grouped horizontally, and the data is divided into blocks and balanced in units of horizontal groups, and finally stored in the storage layer.

在进行删除处理时，对于同一组不相容分块，保留被删除数据在关键分块和低权重中的分片。When performing deletion processing, for the same group of incompatible blocks, keep the deleted data in key blocks and low-weight shards.

数据删除算法如下：The data deletion algorithm is as follows:

输入：数据对象物理视图DOPV，临时更新表U，隐私分块策略PPS，删除条件RCInput: Data Object Physical View DOPV, Temporary Update Table U, Privacy Blocking Policy PPS, Delete Condition RC

步骤(1-1)：从U中删除符合RC的数据；Step (1-1): Delete the data conforming to RC from U;

步骤(1-2)：获取除关键分块和低权重分块的所有分块号，存入数组chunk；Step (1-2): Obtain all block numbers except key blocks and low-weight blocks, and store them in the array chunk;

步骤(1-3)：根据隐私数据重构算法获取DOPV中符合条件RC的原始数据的全局标识resultSet；Step (1-3): According to the privacy data reconstruction algorithm, obtain the global identifier resultSet of the original data that meets the conditions of RC in DOPV;

步骤(1-4)：foreachidinresultSetdoSteps (1-4): foreachinresultSetdo

fori＝0tochunk.length-1dofori=0tochunk.length-1do

计算id在chunk[i]中的DSID并删除其在chunk[i]中的对应的分片；Calculate the DSID whose id is in chunk[i] and delete its corresponding fragment in chunk[i];

伪造数据回收算法为：The fake data recovery algorithm is:

当一个水平分组中的剩余数据所占比例小于设定的下限值T_remain时，将剩余数据插入临时更新表U中，并将该分组中的所有数据删除，对表U中数据进行分组后重新进行存储。算法过程描述如下：When the proportion of the remaining data in a horizontal grouping is less than the set lower limit value T _remain , insert the remaining data into the temporary update table U, and delete all the data in the grouping, after grouping the data in the table U Store again. The algorithm process is described as follows:

输入：分组信息表group_info，剩余数据占比阈值T_remain Input: group information table group_info, remaining data ratio threshold T _remain

步骤(2-1)：foreachgidingroup_infodoStep (2-1): foreachgidinggroup_infodo

步骤(2-2)：ifremain/tatal<T_remain Step (2-2): ifremain/tal<T _remain

步骤(2-3)：调用隐私数据重构算法获取分组gid中剩余数据对应的全局ID的集合resultSet；Step (2-3): call the privacy data reconstruction algorithm to obtain the set resultSet of the global ID corresponding to the remaining data in the group gid;

步骤(2-4)：foreachidinresultSetStep (2-4): foreachinresultSet

步骤(2-5)：将id对应的原始数据插入U中；Step (2-5): Insert the original data corresponding to the id into U;

步骤(2-6)：调用数据删除算法，将分组gid所有数据从存储层中删除Step (2-6): Call the data deletion algorithm to delete all data of the group gid from the storage layer

算法3.分组生成和分块算法Algorithm 3. Packet Generation and Blocking Algorithm

输入：临时数据表U，最小组大小N，最小分块熵H₁和最小条件熵H₂ Input: temporary data table U, minimum group size N, minimum block entropy _H1 and minimum conditional entropy _H2

输出：分组gOutput: group g

步骤(3-1)：创建空组g(g中元素为U中记录的唯一标识)；Step (3-1): Create an empty group g (the element in g is the unique identifier of the record in U);

步骤(3-2)：对表U中关键属性的取值按其出现频率从小到大进行排序；Step (3-2): sort the values of the key attributes in table U according to their frequency of occurrence from small to large;

步骤(3-3)：whileg.length<NdoStep (3-3): whileeg.length<Ndo

按照步骤(3-2)：中的排序结果，依次选取关键属性取值，并将包含此取值的数据ID加入g；According to the sorting results in step (3-2): select key attribute values in turn, and add the data ID containing this value to g;

步骤(3-4)：从表U的剩余数据中依次选取数据t，若将其加入g后各对应各分块的熵值都不小于H1，则将t.ID加入g；Step (3-4): Select data t in turn from the remaining data in table U, if the entropy value of each corresponding block is not less than H1 after adding it to g, then add t.ID to g;

步骤(3-5)：调用分块策略对g中数据进行分块；Step (3-5): call the block strategy to block the data in g;

步骤(3-6)：根据H1和H2构造伪造数据对各分块进行均衡；Step (3-6): Construct fake data according to H1 and H2 to balance each block;

步骤(3-7)：将各分块数据存入存储层；Step (3-7): storing each block data into the storage layer;

步骤(3-8)：returng；Step (3-8): return;

算法4.隐私数据重构算法Algorithm 4. Privacy data reconstruction algorithm

输入：数据对象物理视图DOPV，隐私分块策略PPS，重构条件RCInput: data object physical view DOPV, privacy block policy PPS, reconstruction condition RC

输出：原始数据全局ID集合resultSetOutput: original data global ID set resultSet

步骤(4-1)：根据PPS划分重构条件RC；Step (4-1): Divide and reconstruct condition RC according to PPS;

步骤(4-2)：fori＝1tokdoStep (4-2): fori=1tokdo

步骤(4-3)：查询分块Chunk_i上符合rc_i的DSID，将其对应的全局ID放入集合IDSet_i Step (4-3): Query the DSID matching rc _i on the block Chunk _i , and put its corresponding global ID into the set IDSet _i

步骤(4-4)：将所有IDSet求交集，得到 Step (4-4): Intersect all IDSets to get

步骤(4-5)：过滤resultSet中伪造数据对应的全局ID；Step (4-5): filter the global ID corresponding to the forged data in the resultSet;

步骤(4-6)：将RC和resultSet存入cache；Step (4-6): Store RC and resultSet in cache;

步骤(4-7)：returnresultSet。Steps (4-7): returnresultSet.

本发明的有益效果：Beneficial effects of the present invention:

1本发明提出基于分块混淆的动态数据隐私保护机制，该机制以水平分组为单位对租户业务操作进行处理，通过可信第三方对新插入和修改的数据进行缓存并在满足条件时将数据进行分组并上传至存储层进行存储；通过保留关键分片来保证删除操作中被删数据和剩余数据的隐私安全；最后提出伪造数据回收机制，降低了存储资源的消耗并实现了应用性能的优化。论文最后进行了实验验证和性能评估，实验结果表明，该机制使租户隐私在业务处理过程中得到有效保护的同时，也具有良好的处理性能。1 The present invention proposes a dynamic data privacy protection mechanism based on block obfuscation, which processes tenant business operations in units of horizontal grouping, caches newly inserted and modified data through a trusted third party, and stores the data when conditions are met. Carry out grouping and upload to the storage layer for storage; by retaining key fragments to ensure the privacy and security of deleted data and remaining data in the deletion operation; finally, a fake data recovery mechanism is proposed, which reduces the consumption of storage resources and optimizes application performance . At the end of the paper, the experimental verification and performance evaluation are carried out. The experimental results show that the mechanism can effectively protect the privacy of tenants in the process of business processing, and also has good processing performance.

2为了应对SaaS应用中因多租户对数据的业务操作而引起的隐私泄露和分块均衡问题，本发明提出先对数据进行水平分组，并以更细粒度的水平分组为单位对租户数据进行隐私保护，并引入分块信息熵对分块的均衡状态进行评估，通过保证各水平分组在租户业务操作中的安全性，实现对整体隐私数据的保护。通过在水平分组时对分组均衡状态的兼顾以及后续对分组合并和伪造数据的回收，降低伪造数据对应用性能的影响。2 In order to deal with the privacy leakage and block balance problems caused by multi-tenants’ business operations on data in SaaS applications, the present invention proposes to horizontally group data first, and perform privacy protection on tenant data in units of finer-grained horizontal grouping. Protection, and the introduction of block information entropy to evaluate the equilibrium state of the block, by ensuring the security of each level of grouping in the tenant business operation, to achieve the protection of the overall privacy data. The impact of forged data on application performance is reduced by taking into account the group balance state during horizontal grouping and subsequent group merging and recovery of forged data.

附图说明Description of drawings

图1为数据变更示意图；Figure 1 is a schematic diagram of data change;

图2为本发明的系统结构示意图；Fig. 2 is a schematic structural diagram of the system of the present invention;

图3为业务操作随数据量增加的变化示意图；Fig. 3 is a schematic diagram of changes in business operations as the amount of data increases;

图4为不同方法下业务操作开销示意图；Figure 4 is a schematic diagram of business operation overhead under different methods;

图5为伪造数据占比随时间的变化示意图；Figure 5 is a schematic diagram of the change in the proportion of forged data over time;

图6为伪造数据占比随分组大小的变化示意图；Figure 6 is a schematic diagram of the change of the proportion of forged data with the size of the group;

图7为本发明的方法流程图。Fig. 7 is a flow chart of the method of the present invention.

具体实施方式detailed description

下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

1动态数据隐私保护框架1 Dynamic Data Privacy Protection Framework

针对租户业务操作过程所面对的隐私泄露问题及其它不足，本发明提出了基于分块混淆的动态数据隐私保护框架(图2)和动态数据隐私保护方法(图7)。该模型共包含应用层、隐私保护层、存储层和可信第三方四个部分。其中应用层负责管理租户租赁的应用并对租户的业务请求做出响应。隐私保护层负责对租户提交的业务请求进行处理并根据租户定制的隐私保护需求在处理过程中对租户隐私进行保护。存储层负责对分块处理后的租户数据进行存储。可信第三方负责管理租户最新更新的数据以及租户的隐私保护策略，并通过身份认证机制防止攻击者冒用身份对数据和隐私保护策略越权访问。Aiming at the privacy leakage problem and other deficiencies faced by tenants in the business operation process, the present invention proposes a dynamic data privacy protection framework (Fig. 2) and a dynamic data privacy protection method (Fig. 7) based on block obfuscation. The model consists of four parts: application layer, privacy protection layer, storage layer and trusted third party. The application layer is responsible for managing the applications leased by tenants and responding to tenants' business requests. The privacy protection layer is responsible for processing the business requests submitted by the tenants and protecting the privacy of the tenants during the processing according to the privacy protection requirements customized by the tenants. The storage layer is responsible for storing the tenant data after block processing. The trusted third party is responsible for managing the latest updated data of the tenant and the tenant's privacy protection policy, and prevents attackers from unauthorized access to the data and privacy protection policy through the identity authentication mechanism.

图2所示的框架中，应用层接收到租户的业务请求后，通过步骤①将请求转发至更新处理模块进行处理；更新处理模块根据请求类型决定需要对存储层和可信第三方中分别进行哪些操作，并通过②分别发送对应的操作指令；临时数据管理模块负责对最近更新的数据进行分组，分组完成后将通知分块处理模块；分块处理模块通过③从可信第三方调用分块策略对临时数据分组进行分块处理，分块完成后通过④对分块进行均衡并将结果存入存储层⑤；伪造数据回收模块⑥通过对存储层中各分组数据剩余量进行监控，定时回收多余伪造数据。In the framework shown in Figure 2, after the application layer receives the service request from the tenant, it forwards the request to the update processing module for processing through step 1; Which operations, and send the corresponding operation instructions through ②; the temporary data management module is responsible for grouping the latest updated data, and will notify the block processing module after the grouping is completed; the block processing module calls the block from the trusted third party through ③ The strategy divides the temporary data group into blocks. After the block is completed, the block is balanced by ④ and the result is stored in the storage layer ⑤; the fake data recovery module ⑥ monitors the remaining data of each group in the storage layer and recycles it regularly Redundant falsified data.

2动态数据隐私保护机制的实现2 Realization of dynamic data privacy protection mechanism

前面举例分析了SaaS应用持续运行过程中因租户业务操作引起的隐私泄露和分块均衡问题，并针对这些问题提出了基于分块混淆的动态数据隐私保护框架。本节将在上一节的基础上，详细介绍该框架的相关概念及具体实现，并将通过具体的公式和定理证明通过该框架对数据进行业务操作时可有效防止租户隐私的泄露。下面首先给出几个相关定义：The previous example analyzes the privacy leakage and block balance problems caused by tenant business operations during the continuous operation of SaaS applications, and proposes a dynamic data privacy protection framework based on block confusion to address these problems. Based on the previous section, this section will introduce the relevant concepts and specific implementation of the framework in detail, and will use specific formulas and theorems to prove that the disclosure of tenant privacy can be effectively prevented when the framework is used for business operations on data. Here are some relevant definitions first:

2.1相关定义2.1 Related definitions

定义1.水平分组G，一个水平分组表示为租户数据逻辑视图T的一个子集，租户数据逻辑视图T表示为且其中任意两个不同的水平分组Gi和Gj均不重叠，即G_i∩G_j＝φ(1≤i＜j≤m)。Definition 1. Horizontal grouping G, a horizontal grouping is represented as a subset of the tenant data logical view T, and the tenant data logical view T is represented as And any two different horizontal groups Gi and Gj do not overlap, that is, G _i ∩G _j = φ (1≤i<j≤m).

定义2.隐私分块策略PPS，根据租户提出的隐私约束将逻辑视图T的属性集Attrs＝{A₁,A₂,...,A_n}划分为几个不同子集，即Attrs＝∪subAttrs(i)(i＝1,2,...,k)，其中k为分块数，使得每个子集均不违背隐私约束条件，且任意两个属性子集均不相互重叠，即有subAttrs(i)∩subAttrs(j)＝φ(1≤i＜j≤k)。Definition 2. Privacy partitioning strategy PPS, divides the attribute set Attrs={A ₁ ,A ₂ ,...,A _n } of the logical view T into several different subsets according to the privacy constraints proposed by the tenants, that is, Attrs=∪ subAttrs(i)(i=1,2,...,k), where k is the number of blocks, so that each subset does not violate the privacy constraints, and any two attribute subsets do not overlap each other, that is, subAttrs(i)∩subAttrs(j)=φ(1≦i<j≦k).

定义3.分块信息熵H(X)，给定一个数据分块X，分块中分片的取值域为{v₁,v₂,...,v_n}，对应的每个取值出现的概率为{p(v₁),p(v₂),...,p(v_n)}，则该分块的信息熵为：Definition 3. Block information entropy H(X), given a data block X, the value range of the block in the block is {v ₁ ,v ₂ ,...,v _n }, each corresponding The probability of value occurrence is {p(v ₁ ),p(v ₂ ),...,p(v _n )}, then the information entropy of the block is:

$H h ((X x)) = = - - {Σ Σ}_{j j = = 11}^{n no} p p (({v v}_{j j})) {log log}_{b b} p p (({v v}_{j j})),, - - - - - - ((11))$

其中b表示为对数所使用的底，通常为2，对应熵的单位为bit。数据分块中的各分片取值越均匀则其信息熵越大，当各分片取值出现的概率均相等时，该分块的信息熵取得最大值，反之，当只有一种分片出现时其熵值最小为0。Where b represents the base used in logarithms, usually 2, and the unit of corresponding entropy is bit. The more uniform the value of each fragment in the data block, the greater its information entropy. When the probability of occurrence of each fragment value is equal, the information entropy of the block reaches the maximum value. Conversely, when there is only one fragment Its entropy value is at least 0 when it appears.

定义4.分片依赖p(x/y)，用条件概率进行表示，其中x表示分块A中属性X的某一取值，y表示分块B中属性Y的某一取值，此时称块A中X取值为x的分片以概率p(x/y)依赖于块B中Y取值为y的分片。Definition 4. Fragmentation depends on p(x/y), which is represented by conditional probability, where x represents a certain value of attribute X in block A, and y represents a certain value of attribute Y in block B. At this time Say that the shard with X value x in block A depends on the shard with Y value y in block B with probability p(x/y).

在实际应用中，分片依赖的概念是普遍存在的，譬如公司中员工工资水平与其职位的关系。假设某公司普通员工的工资水平一般为3000-5000，部门经理的工资水平为5000-8000，但5000-8000工资段也有约15％的是业绩突出的普通员工，而3000-5000工资段也有5％的是业绩较差的部门经理，此时若给定某一工资水平为6500，则我们可以以85％的概率推断其为某一部门经理，即工资为6500的分片以85％的概率依赖于职位为“部门经理”的分片。此时，即使包含工资和职位属性的两个分块均已均衡，仍然可以以较大概率对分片进行关联。In practical applications, the concept of shard dependency is ubiquitous, such as the relationship between the salary level of employees and their positions in a company. Assuming that the salary level of ordinary employees in a company is generally 3000-5000, and the salary level of department managers is 5000-8000, but about 15% of the salary range of 5000-8000 is ordinary employees with outstanding performance, and there are also 5% of the salary range of 3000-5000. % are department managers with poor performance. At this time, if a certain salary level is given as 6500, we can infer that it is a department manager with a probability of 85%, that is, the shard with a salary of 6500 has a probability of 85 Depends on the shard whose position is "Department Manager". At this point, even if the two blocks containing salary and position attributes are balanced, the shards can still be associated with a high probability.

为了描述在不同分块中的属性存在依赖关系时的数据均衡程度，本发明引入了块间条件熵的概念：In order to describe the degree of data balance when attributes in different blocks have dependencies, the present invention introduces the concept of conditional entropy between blocks:

定义5.块间条件熵H(Y/X)，表示在已知随机变量X的值的前提下，随机变量Y的信息熵还有多少，用来衡量在已知随机变量X的条件下随机变量Y的不确定性。表示为，Definition 5. Inter-block conditional entropy H(Y/X), which means that under the premise of knowing the value of random variable X, how much information entropy of random variable Y is left, which is used to measure the random variable under the condition of known random variable X Uncertainty in variable Y. Expressed as,

$\begin{matrix} H h ((Y Y / / X x)) = = \underset{x x &Element; &Element; X x}{Σ Σ} p p ((x x)) H h ((Y Y,, X x = = x x)) \\ = = - - \underset{x x &Element; &Element; X x}{Σ Σ} p p ((x x)) \underset{y the y &Element; &Element; Y Y}{Σ Σ} p p ((y the y / / x x)) log log p p ((y the y / / x x)) \\ = = - - \underset{x x &Element; &Element; X x}{Σ Σ} \underset{y the y &Element; &Element; Y Y}{Σ Σ} p p ((x x / / y the y)) log log p p ((y the y / / x x)) \\ = = - - \underset{x x,, y the y}{Σ Σ} p p ((x x / / y the y)) log log p p ((y the y / / x x)) \end{matrix},, - - - - - - ((22))$

通过公式(2)可以证明，当Y的值完全由X确定时，H(Y/X)取得最小值0，反之，当且仅当X和Y为相互独立变量时，H(Y/X)取最大值。It can be proved by formula (2) that when the value of Y is completely determined by X, H(Y/X) takes the minimum value of 0, conversely, if and only when X and Y are mutually independent variables, H(Y/X) Take the maximum value.

定义6.临时更新表U，该表设置在可信第三方，用来存储租户最近更新的数据，其形式为(ID,A₁,A₂,...,A_n,Date)，其中ID为该记录的全局唯一标识，在分块时用来产生各分片的DSID，Date字段为一个时间戳，用来标识该条数据插入或修改的时间。Definition 6. Temporary update table U, which is set in a trusted third party and used to store the latest updated data of tenants, in the form of (ID,A ₁ ,A ₂ ,...,A _n ,Date), where ID It is the globally unique identifier of the record, which is used to generate the DSID of each shard when it is divided into blocks. The Date field is a timestamp, which is used to identify the time when the piece of data was inserted or modified.

2.2动态数据隐私保护的具体实现2.2 The specific implementation of dynamic data privacy protection

工作[1]和[2]中为防止租户的隐私保护策略被攻击者恶意获取，提出使用可信第三方对隐私保护策略进行管理，并通过身份验证机制阻止攻击者的恶意请求。本发明在可信第三方的基础上增设了临时数据管理模块，主要负责对租户最近更新的数据进行临时存储，并在生成水平分组后对水平分组进行垂直分块并存至存储层。通过可信第三方的身份认证机制防止临时数据被恶意获取，使临时数据可以安全地以明文形式进行存储，保证了租户请求数据时的响应时间。In the work [1] and [2], in order to prevent the tenant's privacy protection policy from being obtained maliciously by the attacker, a trusted third party is proposed to manage the privacy protection policy, and the attacker's malicious request is blocked through the authentication mechanism. The present invention adds a temporary data management module on the basis of a trusted third party, which is mainly responsible for temporarily storing the latest updated data of tenants, and after generating horizontal groups, vertically divides the horizontal groups into blocks and stores them in the storage layer. The identity authentication mechanism of a trusted third party prevents temporary data from being maliciously obtained, so that temporary data can be safely stored in plain text, and the response time when tenants request data is guaranteed.

2.2.1插入数据2.2.1 Insert data

设租户Tenant_i在时刻t₁提交插入请求：Let tenant _i submit an insert request at time t ₁ :

INSERTINTOTVALUES(a₁,a₂,...,a_n)①INSERTINTOTVALUES(a ₁ ,a ₂ ,...,a _n )①

根据第2节分析，若直接将数据写入云端数据库各分块中，则各分块将同时增加相同数目的记录，攻击者可以根据租户操作行为推测出新增分片间的关联关系，使租户的这部分隐私面临很大的泄露风险。According to the analysis in Section 2, if the data is directly written into each block of the cloud database, the same number of records will be added to each block at the same time. This part of the tenant's privacy faces a great risk of leakage.

因此本机制将请求①转换为如下请求：Therefore, this mechanism converts the request ① into the following request:

INSERTINTOUVALUES(id,a₁,a₂,...,a_n,t₁)②INSERTINTOUVALUES(id,a ₁ ,a ₂ ,...,a _n ,t ₁ )②

请求②将新插入的数据添加全局ID和时间戳t₁后插入临时表U中临时存储，当表U中数据量大于设定阈值N时，调用分组生成算法3对表U中数据进行水平分组，并以水平分组为单位对数据进行分块和均衡，最后存入存储层。由于对表U的操作发生在可信第三方，数据存储到存储层时虽然将分组关系暴露给了攻击者，但分组内各分块已进行均衡处理，因此整个分组仍然是安全的，整个插入过程也是可靠的。Request ② to add the global ID and timestamp t ₁ to the newly inserted data and insert them into the temporary table U for temporary storage. When the amount of data in table U is greater than the set threshold N, call the grouping generation algorithm 3 to horizontally group the data in table U , and divide and balance the data in units of horizontal groupings, and finally store them in the storage layer. Since the operation of table U occurs in a trusted third party, although the group relationship is exposed to the attacker when the data is stored in the storage layer, each block in the group has been balanced, so the entire group is still safe, and the entire insertion The process is also reliable.

2.2.2删除数据2.2.2 Delete data

对于临时更新表U中的数据，删除操作发生在可信第三方，直接将对应数据删除即可。而对应于存储层各分块中的数据，同时执行删除操作会引起隐私泄露，因此本发明提出在删除数据时对部分分块中的数据进行保留，为此首先引入以下概念：For the data in the temporary update table U, the deletion operation occurs in a trusted third party, and the corresponding data can be directly deleted. Corresponding to the data in each block of the storage layer, performing the deletion operation at the same time will cause privacy leakage. Therefore, the present invention proposes to retain the data in some blocks when deleting data. For this reason, the following concepts are firstly introduced:

定义7.关键属性KA，指隐私约束中敏感程度较高的属性，由租户在隐私保护需求中进行指定。例如，不相容约束{Owner，Age，Zip，Disease}中，病人对于Disease属性更为敏感，当病人身份信息被泄露时，仍希望其疾病信息能够得到保护，因此保证包含Disease属性的分块的安全更加重要。Definition 7. The key attribute KA refers to the highly sensitive attribute in the privacy constraint, which is specified by the tenant in the privacy protection requirement. For example, in the incompatible constraint {Owner, Age, Zip, Disease}, the patient is more sensitive to the Disease attribute. When the patient’s identity information is leaked, he still hopes that his disease information can be protected, so the block containing the Disease attribute is guaranteed safety is more important.

定义8.关键分块KC，本发明把包含有关键属性的的分块称为关键分块。Definition 8. Key block KC, the present invention refers to a block containing key attributes as a key block.

定义9.低权重分块LWC，指在业务处理过程中，除关键分块外，涉及频率最低的分块，该分块中数据量的大小不会对业务处理效率产生太大影响。Definition 9. Low-weight block LWC refers to the block with the lowest frequency except the key block in the process of business processing, and the amount of data in this block will not have a great impact on the efficiency of business processing.

定义10.不相容分块MEC，本发明将同一不相容隐私约束所涉及的所有分块称为一组不相容分块(如图1中Chunk1、Chunk2和Chunk3即为一组不相容分块)，当隐私需求中包含多个不相容约束时，则对应的分块结果中会存在多组不相容分块。Definition 10. Incompatible block MEC, the present invention refers to all blocks involved in the same incompatible privacy constraint as a group of incompatible blocks (Chunk1, Chunk2 and Chunk3 in Figure 1 are a group of incompatible tolerant block), when the privacy requirement contains multiple incompatible constraints, there will be multiple sets of incompatible blocks in the corresponding block results.

在进行删除处理时，对于同一组不相容分块，保留被删数据在关键分块和低权重分块中的分片，此时攻击者最多只能对同一条数据的部分分片进行重构，无法破坏隐私约束，保证了租户隐私的安全。删除算法的过程描述如下：When performing deletion processing, for the same group of incompatible blocks, keep the fragments of the deleted data in the key blocks and low-weight blocks. Structure, can not break the privacy constraints, to ensure the privacy of tenants. The process of deletion algorithm is described as follows:

算法1.数据删除算法Algorithm 1. Data deletion algorithm

①从U中删除符合RC的数据；①Delete RC-compliant data from U;

②获取除关键分块和低权重分块的所有分块号，存入数组chunk；② Obtain all block numbers except key blocks and low-weight blocks, and store them in the array chunk;

③根据算法4获取DOPV中符合条件RC的原始数据的全局标识resultSet；③According to Algorithm 4, obtain the global identifier resultSet of the original data that meets the conditions of RC in DOPV;

④foreachidinresultSetdo④foreachinresultSetdo

fori＝0tochunk.length-1dofori=0tochunk.length-1do

下面提出两个辅助定理证明算法1在删除操作时不会引起隐私泄露：Two auxiliary theorems are proposed below to prove that Algorithm 1 will not cause privacy leakage during the deletion operation:

定理1.对于一个不相容分块组，保持某一分块数据不变，在每次删除数据时只删除其余分块中的数据，则数据删除后各分块的信息熵大于等于数据删除前各分块信息熵。Theorem 1. For an incompatible block group, keep the data of a certain block unchanged, and only delete the data in the remaining blocks each time the data is deleted, then the information entropy of each block after the data is deleted is greater than or equal to the data deleted The information entropy of each previous block.

证明：设某不相容分块组包含C1和C2两个分块，在执行删除请求时只删除C2的分片，由于C1分块的数据分布始终不变，所以C1分块的信息熵也保持不变。对于分块C2，设删除分片前的数据分布为{p(v₁),p(v₂),...,p(v_n)}，根据最大熵的含义，在已知部分知识的前提下，关于未知部分最合理的推断就是符合已知知识的最随机的推断，因此我们总能找到一种数据填充方式使补充后数据分布为{p₁(v₁),p₁(v₂),...,p₁(v_n)}，并且有：Proof: Assuming that an incompatible block group contains two blocks C1 and C2, only the block of C2 is deleted when the deletion request is executed. Since the data distribution of the C1 block is always the same, the information entropy of the C1 block is also constant. For block C2, suppose the data distribution before deleting the block is {p(v ₁ ),p(v ₂ ),...,p(v _n )}, according to the meaning of maximum entropy, in the known partial knowledge Under the premise, the most reasonable inference about the unknown part is the most random inference that conforms to the known knowledge, so we can always find a data filling method so that the data distribution after supplementation is {p ₁ (v ₁ ),p ₁ (v ₂ ),...,p ₁ (v _n )}, and have:

$- - {Σ Σ}_{j j = = 11}^{n no} p p (({v v}_{j j})) {log log}_{b b} p p (({v v}_{j j})) \leq \leq - - {Σ Σ}_{j j = = 11}^{n no} {p p}_{11} (({v v}_{j j})) {log log}_{b b} {p p}_{11} (({v v}_{j j}))$

即攻击者推测的C2分块的当前熵值只能大于或等于其初始熵值，即在对分块中数据分布无背景知识的前提下，攻击者只能以C2分块符合尽量均匀分布来对隐私进行猜测。That is, the current entropy value of the C2 block speculated by the attacker can only be greater than or equal to its initial entropy value, that is, on the premise that there is no background knowledge of the data distribution in the block, the attacker can only use the C2 block to be as evenly distributed as possible. Take the guesswork out of privacy.

定理2.对于一个不相容分块组，保持某一分块数据不变，在每次删除数据时只删除其余分块中的数据，则数据删除后各分块的条件熵大于等于数据删除前各分块条件熵。Theorem 2. For an incompatible block group, keep the data of a certain block unchanged, and only delete the data in the remaining blocks each time the data is deleted, then the conditional entropy of each block after the data is deleted is greater than or equal to the data deleted Conditional entropy of each block before.

证明：设事件X和Y分别表示从分块C1和C2中取值事件，在本发明我们只关心给定X＝x_i猜测Y＝y_j需要的信息熵或者给定Y＝y_j猜测X＝x_i需要的信息熵，因此只需要证明H(Y|X＝x)和H(X|Y＝y)不减即可。由Proof: Suppose that events X and Y represent the value-taking events from blocks C1 and C2 respectively. In the present invention, we only care about the information entropy required for guessing Y=y _j given X=x _i or guessing X given Y=y _j = the information entropy required by x _i , so it is only necessary to prove that H(Y|X=x) and H(X|Y=y) do not decrease. Depend on

$H h ((X x / / Y Y = = y the y)) = = \underset{x x &Element; &Element; X x}{Σ Σ} p p ((x x / / y the y)) log log p p ((x x / / y the y))$

X的取值不变，而给定x，y时P(x│y)是保持不变的，因此H(X|Y＝y)保持初始大小。而对于H(Y|X＝x)，The value of X remains unchanged, and P(x│y) remains unchanged when x and y are given, so H(X|Y=y) maintains the initial size. And for H(Y|X=x),

$H h ((Y Y / / X x = = x x)) = = \underset{y the y &Element; &Element; Y Y}{Σ Σ} p p ((y the y / / x x)) log log p p ((y the y / / x x))$

当删除C2中分片后可能导致Y的取值数目减少，但在无背景知识时，攻击者只能认为Y可以取剩余取值之外的其他取值，即Y的取值数目只能大于或等于初始数目使得H(Y|X＝x)大于或等于其初始值。After deleting the fragment in C2, the number of values of Y may be reduced, but without background knowledge, the attacker can only think that Y can take other values than the remaining values, that is, the number of values of Y can only be greater than or equal to the initial number such that H(Y|X=x) is greater than or equal to its initial value.

有辅助定理1和2可知，当只删除部分分片时，总能够保持各分块的熵值和分块间的条件熵值是不减的，因此在删除过程中不会因熵值的减少而泄露租户隐私。Auxiliary theorems 1 and 2 show that when only some fragments are deleted, the entropy value of each block and the conditional entropy value between blocks can always be kept unchanged, so the entropy value will not decrease during the deletion process. Leaking tenant privacy.

随着SaaS应用的不断运行，云端数据库中可能会逐渐出现这样的情况，即一个水平分组中的数据经过多次删除和修改后只剩下一条或几条的少量数据，为了保证删除数据时租户的隐私安全，需要完整保留关键分块和低权重分块的数据，同时还有数据均衡时的伪造数据，此时为了存储这少量数据就需要付出比较大的存储代价，而当这种水平分组的数量越来越多时，较大的伪造数据占比也会对查询效率造成较大影响。With the continuous operation of SaaS applications, such a situation may gradually appear in cloud databases, that is, after multiple deletions and modifications of data in a horizontal group, only one or a few small pieces of data remain. privacy security, it is necessary to fully retain the data of key blocks and low-weight blocks, as well as forged data during data balancing. When the number of is increasing, a larger proportion of forged data will also have a greater impact on query efficiency.

因此本发明在设计算法时，增加了伪造数据回收机制，当一个水平分组中的剩余数据所占比例小于设定的下限值T_remain时，将剩余数据插入临时更新表U中，并将该分组中的所有数据删除，对表U中数据进行分组后重新进行存储。算法过程描述如下：Therefore, when the present invention is designing an algorithm, a forged data recovery mechanism is added. When the proportion of the remaining data in a horizontal grouping is less than the set lower limit value T _remain , the remaining data is inserted into the temporary update table U, and the Delete all the data in the group, and store the data in table U again after grouping. The algorithm process is described as follows:

算法2.伪造数据回收Algorithm 2. Forged data recovery

①foreachgidingroup_infodo①foreachgidinggroup_infodo

②ifremain/tatal<T_remain ②ifremain/tatal<T _remain

③调用算法4获取分组gid中剩余数据对应的全局ID的集合resultSet；③ Call Algorithm 4 to obtain the set resultSet of the global ID corresponding to the remaining data in the group gid;

④foreachidinresultSet④foreachinresultSet

⑤将id对应的原始数据插入U中；⑤ Insert the original data corresponding to id into U;

⑥调用算法1，将分组gid所有数据从存储层中删除⑥ Call Algorithm 1 to delete all data of the group gid from the storage layer

2.2.3修改数据2.2.3 Modify data

对于临时更新表U中的数据，修改操作发生在可信第三方，直接修改对应数据即可。而对应于存储层各分块中的数据，由于同时对各分块执行修改操作会引起隐私泄露，因此本发明的处理方式是将修改后得到的数据直接插入临时更新表U中进行存储，并同时将原始数据从存储层各分块中删除。在4.2.2节中证明了删除算法1的删除过程是安全的，而插入操作发生在可信第三方，因此整个修改过程是安全的。For the data in the temporary update table U, the modification operation occurs in a trusted third party, and the corresponding data can be directly modified. Corresponding to the data in each sub-block of the storage layer, since performing modification operations on each sub-block at the same time will cause privacy leakage, the processing method of the present invention is to directly insert the modified data into the temporary update table U for storage, and At the same time, the original data is deleted from each block of the storage layer. In Section 4.2.2, it is proved that the deletion process of deletion algorithm 1 is safe, and the insertion operation occurs in a trusted third party, so the whole modification process is safe.

2.2.4分组生成及数据分块2.2.4 Packet Generation and Data Blocking

输出：分组gOutput: group g

①创建空组g(g中元素为U中记录的唯一标识)；① Create an empty group g (the element in g is the unique identifier recorded in U);

②对表U中关键属性的取值按其出现频率从小到大进行排序；②Sort the values of the key attributes in table U according to their frequency of occurrence from small to large;

③whileg.length<Ndo③while eg. length<Ndo

按照②中的排序结果，依次选取关键属性取值，并将包含此取值的数据ID加入g；According to the sorting results in ②, select key attribute values in turn, and add the data ID containing this value to g;

④从表U的剩余数据中依次选取数据t，若将其加入g后各对应各分块的熵值都不小于H1，则将t.ID加入g；④ Select data t from the remaining data in table U in turn, if the entropy value of each corresponding block is not less than H1 after adding it to g, then add t.ID to g;

⑤调用分块策略对g中数据进行分块；⑤ Call the block strategy to block the data in g;

⑥根据H1和H2构造伪造数据对各分块进行均衡；⑥ Construct fake data according to H1 and H2 to balance each block;

⑦将各分块数据存入存储层；⑦ Store each block data in the storage layer;

⑧returng；⑧ return;

算法3中，以关键属性为基准进行分组，分组结果中首先保证关键分块尽量均衡。算法第2行首先将关键属性依其出现频率进行排序，第3行一次选取出现频率较小的关键属性并将其对应的数据加入分组中，这样可以使分组中关键属性有较多的取值，分布更加均匀。算法第4行采用贪心思想，从剩余分块中依次选取数据进行尝试，将其加入g时能使各分块的熵值不小于H1，则接受其加入g；算法第5-7行在产生分组g后，调用分块策略对分组进行分块，并采用工作[1]的均衡方法对各分块进行均衡处理。In Algorithm 3, grouping is based on key attributes, and the grouping results first ensure that the key blocks are as balanced as possible. The second line of the algorithm first sorts the key attributes according to their frequency of occurrence, and the third line selects the key attributes with a small frequency of occurrence and adds their corresponding data to the group, so that the key attributes in the group can have more values , the distribution is more uniform. The fourth line of the algorithm adopts the idea of greed, and selects the data in turn from the remaining blocks to try. When adding it to g, the entropy value of each block is not less than H1, then it is accepted to add g; lines 5-7 of the algorithm generate After grouping g, call the block strategy to block the group, and use the balancing method in [1] to balance each block.

2.2.5隐私数据重构2.2.5 Privacy data reconstruction

隐私数据经过分块混淆后，其数据可用性也随之丧失，而在云计算环境下，租户需要频繁地对原始数据进行访问、删除和修改等。因此，如何对隐私数据进行快速重构成为SaaS应用隐私保护的一大关键问题。After the privacy data is divided and obfuscated, its data availability is also lost. In the cloud computing environment, tenants need to frequently access, delete and modify the original data. Therefore, how to quickly reconstruct private data has become a key issue in SaaS application privacy protection.

对隐私数据进行分块时，会根据原始数据的全局ID为每个分片生成一个唯一的分片标识(DSID)，该标识主要用于对原始数据进行重构并对伪造数据进行过滤。算法4为基于可信第三方的原始数据重构算法，输入为数据对象物理视图DOPV、隐私分块策略PPS和重构条件RC，输出为满足条件的原始记录的全局ID，算法首行先根据分块策略将重构条件RC划分为RC＝{rc₁,rc₂,...,rc_i,...,rc_k}，其中k为分块数，rc_i为分块Chunk_i上的重构条件；算法第2行在每个分块Chunk_i上查询符合条件rc_i的DSID并将其转换为全局ID后放入集合IDSet_i中，算法第3行对所有IDSet求交集，得到最终符合重构条件RC的原始数据全局ID集合resultSet；算法第4行将resultSet中伪造数据对应的全局ID进行过滤；算法第5行将RC和resultSet缓存到cache中，提高下次相同重构请求的执行效率。算法描述如下：When partitioning private data, a unique shard identifier (DSID) will be generated for each shard based on the global ID of the original data, which is mainly used to reconstruct the original data and filter forged data. Algorithm 4 is an original data reconstruction algorithm based on a trusted third party. The input is the data object physical view DOPV, the privacy partition policy PPS and the reconstruction condition RC, and the output is the global ID of the original record that meets the conditions. The block strategy divides the reconstruction condition RC into RC={rc ₁ ,rc ₂ ,...,rc _i ,...,rc _k }, where k is the number of blocks, rc _i is the weight of block Chunk _i Construction conditions; the second line of the algorithm queries the DSID that meets the condition rc _i on each block Chunk _i and converts it into a global ID and puts it into the set IDSet _i . The third line of the algorithm calculates the intersection of all IDSets to obtain the final match The original data global ID set resultSet of reconstruction condition RC; the fourth line of the algorithm filters the global ID corresponding to the forged data in the resultSet; the fifth line of the algorithm caches RC and resultSet in the cache to improve the execution efficiency of the same reconstruction request next time. The algorithm is described as follows:

①根据PPS划分重构条件RC；① Divide the reconstruction condition RC according to the PPS;

②fori＝1tokdo②fori＝1tokdo

③查询分块Chunk_i上符合rc_i的DSID，将其对应的全局ID放入集合IDSet_i ③ Query the DSID that matches rc _i on the block Chunk _i , and put its corresponding global ID into the set IDSet _i

④将所有IDSet求交集，得到 ④ Find the intersection of all IDSets to get

⑤过滤resultSet中伪造数据对应的全局ID；⑤ Filter the global ID corresponding to the forged data in the resultSet;

⑥将RC和resultSet存入cache；⑥ Store RC and resultSet in cache;

⑦returnresultSet；⑦return resultSet;

上述算法中，在各分块中查询符合条件的DSID采取并行执行方式，查询时间与数据量大小成正比关系。设所有N为IDSet长度的最大值，若采用二路归并求交，则其时间复杂度为N²*log2^k。一般情况下，符合条件的N的取值都不是很大，因此该时间复杂度在可接受范围内。In the above algorithm, querying qualified DSIDs in each block is performed in parallel, and the query time is proportional to the size of the data. Assuming that all N is the maximum length of IDSet, if two-way merge is used for intersection, the time complexity is N ² *log2 ^k . Generally, the values of N that meet the conditions are not very large, so the time complexity is within an acceptable range.

3实验评估3 Experimental evaluation

3.1实验环境3.1 Experimental environment

为进行实验，采用从本实验室云计算平台申请的虚拟机作为可信第三方，其配置为8核、16G内存和500G硬盘。用MongoDB3.0.0存储租户隐私保护策略，用Mysql5.6.22存储临时更新表数据。For the experiment, the virtual machine applied from the laboratory cloud computing platform is used as a trusted third party, and its configuration is 8 cores, 16G memory and 500G hard disk. Use MongoDB3.0.0 to store tenant privacy protection policies, and use Mysql5.6.22 to store temporary update table data.

用一台浪潮刀片服务器模拟部署多租户应用，配置为12核CPU(主频3.10GHz)，32G内存，硬盘大小为2T。存储层同样使用Mysql5.6.22做存储数据库，多租户应用使用java1.8编写程序进行模拟。Use an Inspur blade server to simulate the deployment of multi-tenant applications, configured as a 12-core CPU (main frequency 3.10GHz), 32G memory, and a hard disk size of 2T. The storage layer also uses Mysql5.6.22 as the storage database, and the multi-tenant application uses java1.8 to write programs for simulation.

数据集取自本实验室社保项目内部测试数据集的参保人医疗登记基本信息表中的数据，数据量大小为300万条左右。数据选取了姓氏(已模糊处理)、医疗人员类别、性别、年龄、信任等级、孤寡类别、单位组织、医疗账户以及其它属性等共20个属性进行实验。不相容隐私约束为(姓氏、性别、年龄、单位组织)。The data set is taken from the data in the basic information table of the medical registration of the insured person in the internal test data set of the social security project of the laboratory, and the data volume is about 3 million. A total of 20 attributes including surname (has been blurred), medical personnel category, gender, age, trust level, widow category, unit organization, medical account and other attributes were selected for the experiment. Incompatible privacy constraints are (surname, gender, age, organization).

3.2业务处理开销实验3.2 Business processing overhead experiment

图3为在本发明提出的动态数据隐私保护机制下，增删改三种不同操作在不同数据量下的操作处理时间，横轴为数据量大小，纵轴为操作时间(单位为：ms)。Fig. 3 is under the dynamic data privacy protection mechanism proposed by the present invention, the operation processing time of three different operations of adding, deleting and modifying under different data volumes, the horizontal axis is the data volume, and the vertical axis is the operation time (unit: ms).

从图3中可以看出，由于新插入数据总是被存储到临时表中，而表U中只存储少量数据并且数据量基本保持稳定，所以插入数据的处理速度很快，只有几十毫秒左右。而对于删除和修改操作，由于需要首先对被更新的数据进行重构，导致其处理时间大大增加，并且随着数据量的增大而线性增加。It can be seen from Figure 3 that since the newly inserted data is always stored in the temporary table, while only a small amount of data is stored in table U and the amount of data is basically stable, the processing speed of inserted data is very fast, only about tens of milliseconds . For deletion and modification operations, since the updated data needs to be reconstructed first, the processing time is greatly increased, and it increases linearly with the increase of the data volume.

图4将本发明提出的方法与其它三种方法在业务处理效率方面进行了对比。方法A为不进行隐私保护，方法B为本发明提出的动态隐私保护机制，采用工作[1]中的分块策略，方法C为工作[1]提出的隐私保护机制，方法D为工作[3]对数据属性进行聚类后的隐私保护方法。数据量大小为两百万条左右，查询、插入和修改三类业务操作中每类操作涉及的属性数又分为2、4、6、8、10五个级别，实验结果采用所有级别处理时间的平均值。从图中可以看出，不进行隐私保护时处理速度最快，而方法B和方法C虽然查询、删除和修改都受制于数据量大小，但由于在相同数据量时，本发明提出的隐私保护方法中伪造数据相对减少，所以总体上方法B比工作[1]的方法C在性能上还是有所提升。方法D由于对数据属性按业务操作进行了聚类，所以处理效率要比方法C要高，但受伪造数据的影响，还是要比方法B效率差一些。Fig. 4 compares the method proposed by the present invention with the other three methods in terms of service processing efficiency. Method A is not to protect privacy, method B is the dynamic privacy protection mechanism proposed by the present invention, adopts the block strategy in work [1], method C is the privacy protection mechanism proposed by work [1], method D is work [3] ] A privacy-preserving method after clustering data attributes. The amount of data is about two million. The number of attributes involved in each of the three types of business operations, query, insert, and modify, is divided into five levels: 2, 4, 6, 8, and 10. The experimental results use the processing time of all levels average of. As can be seen from the figure, the processing speed is the fastest when no privacy protection is performed, and although method B and method C query, delete and modify are all subject to the size of the data volume, due to the same data volume, the privacy protection proposed by the present invention The falsified data in the method is relatively reduced, so in general, the performance of method B is still improved compared with method C of work [1]. Method D is more efficient than method C because it clusters data attributes according to business operations, but it is still less efficient than method B due to the influence of forged data.

3.3伪造数据占比实验3.3 Forged data proportion experiment

图5对比了在分块最小信息熵取3时，方法B、C、D中伪造数据占比随时间的变化。从图中可以看出，当使用方法C和D时，伪造数据占比会随着应用的运行逐渐增加，并且开始阶段提升的幅度比较大，原因在于在较短时间内数据集中姓氏和年龄的分布并不完全与总体分布一致，数据分布比较不均匀，当数据量随时间逐渐增加时，其分布越接近于总体分布，但是随着对数据操作次数的不断增加，需要增加更多伪造数据进行均衡。而方法B中由于分组时尽量选取使分组数据分布均匀的数据组成分组，辅以伪造数据回收模块对伪造数据的回收处理，使整个过程中伪造数据占比都比较稳定。Figure 5 compares the change of the proportion of forged data in methods B, C, and D over time when the minimum information entropy of the block is set to 3. It can be seen from the figure that when methods C and D are used, the proportion of forged data will gradually increase as the application runs, and the increase in the initial stage is relatively large. The distribution is not completely consistent with the overall distribution, and the data distribution is relatively uneven. When the amount of data gradually increases over time, the distribution is closer to the overall distribution. However, as the number of data operations continues to increase, more forged data needs to be added for further processing. balanced. In method B, as far as possible to select the data that makes the grouping data evenly distributed to form the grouping, supplemented by the recovery and processing of the forged data by the forged data recovery module, the proportion of forged data in the whole process is relatively stable.

图6对比了伪造数据占比随分组大小产生的变化。当分组较小时，由于分组取值个数相对较少，导致分组内数据分布不均匀，要进行均衡需要添加较多伪造数据。随着分组大小的逐渐增大，分片取值个数逐渐增加，分块分布逐渐均匀。而当分组大小继续增加时，分组内数据分布逐渐接近整体分布，由于总体分布中姓氏和年龄并不是服从均匀分布的，所以分组再次趋向于不均匀状态，导致伪造数据占比又重新升高。Figure 6 compares the change of the proportion of forged data with the group size. When the group is small, due to the relatively small number of group values, the data distribution in the group is uneven, and more fake data needs to be added to balance. As the group size gradually increases, the number of fragment values gradually increases, and the distribution of fragments gradually becomes even. When the group size continues to increase, the data distribution within the group gradually approaches the overall distribution. Since the surname and age in the overall distribution are not uniformly distributed, the grouping tends to be uneven again, resulting in a rise in the proportion of forged data again.

从以上几个实验对比可以看出，相对工作[1]中的方法C，本发明提出的隐私保护机制具有更高的操作效率，而且随着应用的不断运行，其伪造数据占比更低且始终保持稳定。From the comparison of the above several experiments, it can be seen that compared with the method C in the work [1], the privacy protection mechanism proposed by the present invention has higher operating efficiency, and with the continuous operation of the application, the proportion of forged data is lower and Always keep it steady.

对于SaaS应用而言，现有的隐私保护机制缺乏对租户业务操作的有效支持，难以保证数据动态变化过程中租户隐私信息的安全。本发明针对SaaS模式下由租户业务操作引起的隐私泄露问题以及伪造数据占比较高的不足，提出了基于分块混淆的动态数据隐私保护机制。该机制将租户最新插入和修改的数据在可信第三方进行缓存，当具备分块条件时再将数据以分组为单位分块均衡后存入存储层；对于删除操作，保留关键分块和低权重分块中的数据不被删除，防止攻击者通过数据操作行为对租户隐私进行重构；最后提出伪造数据回收机制对存储层冗余的伪造数据进行回收，降低伪造数据占总数据量的比例。实验结果证明，本发明所提出的动态数据隐私保护机制，在保证租户隐私的同时，对应用的性能也起到了很好的优化效果。下一步工作将主要研究数据动态变化过程中，不同的租户数据放置策略对应用性能的影响。For SaaS applications, the existing privacy protection mechanism lacks effective support for tenants' business operations, and it is difficult to guarantee the security of tenants' private information in the process of dynamic data changes. Aiming at the problem of privacy leakage caused by tenants' business operations and the high proportion of forged data in the SaaS mode, the present invention proposes a dynamic data privacy protection mechanism based on block obfuscation. This mechanism caches the latest data inserted and modified by tenants in a trusted third party. When the block conditions are met, the data is divided into blocks and balanced in groups and stored in the storage layer; for delete operations, key blocks and low The data in the weight block will not be deleted to prevent attackers from reconstructing the privacy of tenants through data manipulation behavior; finally, a fake data recovery mechanism is proposed to recover redundant fake data in the storage layer, reducing the proportion of fake data to the total data volume . Experimental results prove that the dynamic data privacy protection mechanism proposed by the present invention has a good optimization effect on application performance while ensuring the privacy of tenants. The next step will be to study the impact of different tenant data placement strategies on application performance during the dynamic data change process.

上述虽然结合附图对本发明的具体实施方式进行了描述，但并非对本发明保护范围的限制，所属领域技术人员应该明白，在本发明的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the specific implementation of the present invention has been described above in conjunction with the accompanying drawings, it does not limit the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical solution of the present invention, those skilled in the art do not need to pay creative work Various modifications or variations that can be made are still within the protection scope of the present invention.

Claims

1. A dynamic data privacy protection system based on block obfuscation, characterized in that it comprises:

The application layer is responsible for managing the application components for tenants to lease and customize, receives the data service processing requests submitted by the tenants and forwards the requests to the privacy protection layer for processing; the application layer can deploy a variety of different application components, each application Components can deploy multiple instances for load balancing;

The privacy protection layer is the core part of the dynamic data privacy protection framework, responsible for receiving and processing data business processing requests transferred from the application layer, and returning the processing results to the application layer; data business processing requests include data insertion requests and data deletion requests , data query request or data modification request;

The storage layer is responsible for storing tenant data and is the final executor of all data processing operations; the storage layer deploys multiple database instances, and the database instances use the same or different databases for deployment;

A trusted third party is responsible for managing the tenant's personalized privacy protection policy, and is also responsible for managing the latest inserted or modified data. It can identify the identity of the data tenant and prevent attackers from maliciously accessing the privacy protection policy and temporary data. and steal.

2. a kind of dynamic data privacy protection system based on block obfuscation as claimed in claim 1, is characterized in that,

For data insertion requests, the newly inserted data is packaged and temporarily stored in a trusted third party;

For the data deletion request, the non-key block and the corresponding slice in the high-weight block are deleted in the storage layer; the non-key module refers to a block that does not contain key attributes;

For data query requests, the reconstruction module of the privacy protection layer will return the reconstruction results of the original data;

For data modification requests, the corresponding original data in the storage layer is deleted first, and then the modified data is packaged and temporarily stored in a trusted third party.

3. A kind of dynamic data privacy protection system based on block obfuscation as claimed in claim 1, wherein the privacy protection layer comprises: an update processing module, a block processing module, an equalization module, a reconstruction module and forged data recovery module;

The update processing module is responsible for receiving the business operation request delivered by the application layer, then analyzing the operation request type, performing different operation processes according to the operation request type, and returning the operation result to the application layer after the operation is completed; The type of operation request includes: insert, delete, modify and query; the update processing module also receives the data of the reconstruction module; the update processing module also stores the operation result in the storage layer; the update processing module also stores the authentication information in the authentication module; Store the temporary data in the temporary data management module;

The block processing module is responsible for performing vertical block on the new group generated by the temporary data management module; when a new group is delivered to the block processing module, it sends a request to the privacy policy management module to obtain the corresponding data block strategy, according to The block strategy divides the data vertically and stores the block data in the storage layer;

The equalization module obtains the data distribution state of each block from the block processing module, generates forged data satisfying α, β, and γ balance, and performs α, β, and γ balance on each block; the data distribution state refers to all The frequency of fragmentation values;

The forged data recovery module is responsible for periodically checking the grouping status, and judging whether the forged data of the grouping needs to be recovered according to the proportion of valid data in each grouping; for the grouping that needs to recover the forged data, transfer the remaining valid data in the grouping Go to the temporary data management module for cache processing, and then delete all the data grouped in each block;

The reconstructed block connects the original private data before the block according to the query, deletion or modification conditions submitted by the tenant in the business operation, and returns the global ID corresponding to the original data; the reconstructed module reads from the storage layer Take the block data, then reconstruct the block data, and feed back the reconstructed result to the update processing module.

4. a kind of dynamic data privacy protection system based on block obfuscation as claimed in claim 1, is characterized in that,

The trusted third party includes: a privacy policy management module, an authentication module and a temporary data management module;

The authentication module is responsible for receiving the identity authentication information of the tenant, identifying the identity of the tenant, and preventing malicious attackers from accessing the corresponding privacy protection policy and temporary data according to the identification result, while allowing legitimate tenants to access the corresponding privacy protection policy and temporary data ;

The privacy policy management module is responsible for storing and accessing the tenant's personalized privacy protection policy, and transferring the corresponding tenant's privacy protection policy to the block processing module according to the tenant ID;

The temporary data management module is responsible for temporarily storing the newly inserted or modified data of the tenant, and horizontally grouping the temporary data according to the grouping conditions, and passing the grouped data to the block processing module after generating new groups.

5. A dynamic data privacy protection method based on block obfuscation, which is characterized in that the method caches newly inserted and modified data through a trusted third party, and groups and stores the data when conditions are met; by retaining Key shards are used to ensure the privacy and security of deleted data and remaining data in the deletion operation; the reduction of storage resource consumption and the optimization of application performance are achieved through forged data recovery algorithms.

6. The method of claim 5, comprising the steps of:

Step (1): The application component of the application layer is responsible for receiving the business operation submitted by the tenant, and forwarding the operation to the update processing module of the privacy protection layer;

Step (2): The update processing module analyzes the operation type, if it is an insert operation, then proceed to step (3), otherwise, execute step (7);

Step (3): The update processing module submits the tenant's identity information to the authentication module of the trusted third party for identification, and allows the tenant to access the temporary data management module after passing the authentication;

Step (4): The temporary data management module caches the newly inserted data in the form of plain text, and periodically checks whether the cached data meets the grouping conditions. If so, calls the group generation and block algorithm to generate a new group and uploads it to the block Processing module; if not reached, no processing will be done;

Step (5): If a new packet is uploaded to the block processing module, the block processing module requests the block policy of the corresponding tenant from the privacy policy management module of the trusted third party, and vertically divides the packet data according to the block policy. piece;

Step (6): The equalization module generates forged data according to the data distribution of each block, so that the data distribution of each block satisfies α, β and γ balance, and then stores the block data in the storage layer; the block data Including forged data and private data stored by users;

Step (7): The reconstruction module reconstructs the private data through the private data reconstruction algorithm, and judges whether it is a modification operation. If so, the original data is modified and assembled into new data, and then the original data is deleted, and the step ( 3), otherwise go to step (8);

Step (8): Determine whether to delete the operation, and if so, execute the data deletion algorithm to delete the corresponding fragments in the non-critical blocks and high-weight blocks of the storage layer. After the deletion is successful, call the forged data recovery algorithm through the forged data recovery module Check whether there is grouped forged data that needs to be recycled; otherwise, it is determined as a query operation, and step (9) is performed;

Step (9): Filter the query results according to the global identifier and query conditions calculated by the reconstruction module, and return the results to the tenant.

7. The method according to claim 6, wherein the data deletion algorithm is as follows:

Input: Data Object Physical View DOPV, Temporary Update Table U, Privacy Blocking Policy PPS, Delete Condition RC

Step (1-1): Delete the data conforming to RC from U;

Step (1-2): Obtain all block numbers except key blocks and low-weight blocks, and store them in the array chunk;

Step (1-3): According to the privacy data reconstruction algorithm, obtain the global identifier resultSet of the original data that meets the conditions of RC in DOPV;

Steps (1-4): foreachinresultSetdo

fori=0tochunk.length-1do

Calculate the DSID whose id is in chunk[i] and delete its corresponding fragment in chunk[i].

8. The method according to claim 6, characterized in that, the fake data recovery algorithm is:

When the proportion of the remaining data in a horizontal grouping is less than the set lower limit value T _remain , insert the remaining data into the temporary update table U, and delete all the data in the grouping, after grouping the data in the table U Re-store; the algorithm process is described as follows:

Input: group information table group_info, remaining data ratio threshold T _remain

Step (2-1): foreachgidinggroup_infodo

Step (2-2): ifremain/tal<T _remain

Step (2-3): call the privacy data reconstruction algorithm to obtain the set resultSet of the global ID corresponding to the remaining data in the group gid;

Step (2-4): foreachinresultSet

Step (2-5): Insert the original data corresponding to the id into U;

Step (2-6): Call the data deletion algorithm to delete all data of the group gid from the storage layer.

9. The method of claim 6, wherein the packet generation and block algorithm:

Input: temporary data table U, minimum group size N, minimum block entropy _H1 and minimum conditional entropy _H2

Output: group g

Step (3-1): Create an empty group g (the element in g is the unique identifier of the record in U);

Step (3-2): sort the values of the key attributes in table U according to their frequency of occurrence from small to large;

Step (3-3): whileeg.length<Ndo

According to the sorting results in step (3-2): select key attribute values in turn, and add the data ID containing this value to g;

Step (3-4): Select data t in turn from the remaining data in table U, if the entropy value of each corresponding block is not less than H1 after adding it to g, then add t.ID to g;

Step (3-5): call the block strategy to block the data in g;

Step (3-6): Construct fake data according to H1 and H2 to balance each block;

Step (3-7): storing each block data into the storage layer;

Step (3-8): return.

10. The method according to claim 6, wherein the privacy data reconstruction algorithm:

Input: data object physical view DOPV, privacy block policy PPS, reconstruction condition RC

Output: original data global ID set resultSet

Step (4-1): Divide and reconstruct condition RC according to PPS;

Step (4-2): fori=1tokdo

Step (4-3): Query the DSID matching rc _i on the block Chunk _i , and put its corresponding global ID into the set IDSet _i

Step (4-4): Intersect all IDSets to get

Step (4-5): filter the global ID corresponding to the forged data in the resultSet;

Step (4-6): Store RC and resultSet in cache;

Steps (4-7): returnresultSet.