CN101175011A

CN101175011A - A Method of Obtaining Highly Available Data Redundancy in DHT-Based P2P System

Info

Publication number: CN101175011A
Application number: CNA2007101350286A
Authority: CN
Inventors: 陈贵海; 吴帆; 李宏兴; 邱彤庆
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2007-11-02
Filing date: 2007-11-02
Publication date: 2008-05-07

Abstract

The invention discloses a redundant mechanism method in a P2P system, which bases on DHT and obtains highly available data. The method includes steps mentioned below. Firstly, Erasure Coding is implemented on the original documents. Secondly, the slices obtained in the coding are distributed to the relative nodes for storing. Thirdly, a user puts forward a download requirement and the query is implemented. Fourthly, the user downloads the documents and reserves a part of the documents as copies. Fifthly, the third step is repeated until services are stopped. Sixthly, the method is over. Compared with the prior art, the invention has the obvious advantages of the traditional replication strategies and the slices redundant strategies. The invention can be easily realized. Compared with the traditional redundant strategies, the invention can save more network bandwidths under various network environments. The redundancy is moderate.

Description

A Method of Obtaining Highly Available Data Redundancy in DHT-Based P2P System

一、技术领域1. Technical field

本发明涉及一种基于DHT的P2P系统中高可用数据冗余机制，这种数据冗余策略在各种网络环境中均较传统冗余策略更节省网络带宽，并且冗余度适中(不高于9.4)。当节点平均可用性低于0.7和文件请求率相对于节点扰动频率较低时，该策略的优越性尤为突出。混合式策略不仅在动态性低的网络环境中表现良好，而且在高度动态的网络中性能卓越。The invention relates to a highly available data redundancy mechanism in a DHT-based P2P system. This data redundancy strategy saves more network bandwidth than the traditional redundancy strategy in various network environments, and the redundancy is moderate (not higher than 9.4 ). The superiority of this strategy is particularly prominent when the average node availability is lower than 0.7 and the file request rate is low relative to the node perturbation frequency. Hybrid strategies not only perform well in low dynamic network environments, but also perform well in highly dynamic networks.

二、背景技术2. Background technology

随着互联网和网络计算技术的迅猛发展，一类基于分布式哈希表(DHT)的对等计算系统应运而生。DHT是一种分布式存储方法。在不需要服务器的情况下，每个客户端负责一个小范围的路由，并负责存储一小部分数据，从而实现整个DHT网络的寻址和存储。DHT提供了一种确定性的对象定位服务，并已有诸多应用。在不能保证节点100％可用的情况下，要实现数据文件的高可用性，则需要某种数据冗余策略。目前应用于基于DHT的P2P系统的数据冗余策略主要有两种：复制(replication)和分片冗余(erasure coding)。With the rapid development of Internet and network computing technology, a peer-to-peer computing system based on Distributed Hash Table (DHT) emerges as the times require. DHT is a distributed storage method. In the absence of a server, each client is responsible for a small range of routing and is responsible for storing a small part of data, thereby realizing the addressing and storage of the entire DHT network. DHT provides a deterministic object location service and has been used in many applications. To achieve high availability of data files without guaranteeing 100% availability of nodes requires some kind of data redundancy strategy. Currently, there are two main data redundancy strategies applied to DHT-based P2P systems: replication and erasure coding.

在以往的应用中，复制和奇偶校验冗余机制(parity scheme)(如RAID)是两种常用的实现数据高可用性的冗余机制。前者会引入极高的冗余度；后者适应环境动态性的能力较差，在高动态性的网络环境中难以保证数据的可用性。In previous applications, replication and parity schemes (such as RAID) are two commonly used redundancy mechanisms to achieve high data availability. The former will introduce extremely high redundancy; the latter has poor ability to adapt to the dynamics of the environment, and it is difficult to guarantee the availability of data in a highly dynamic network environment.

分片冗余克服了复制冗余度高的缺点。它将数据对象分割成m片，再编码成n片(n＞m)。我们将r＝n/m定义为编码冗余度。以冗余度r编码意味着编码后占用的存储空间是原始数据的r倍。分片冗余最重要的特性是只要获得任意m个不同的分片就能重构原文件，而且这m个分片的体积之和与原文件大致相等。例如，当编码冗余度为r＝4，原数据分割为m＝16片，则编码后为n＝64片，所需存储空间增大到原先的4倍。复制和RAID可以看作是分片冗余的特例。冗余度为4的复制系统相当于(m＝1，n＝4)的分片冗余；1、4和5级的RAID用分片冗余分别表示为)(m＝1，n＝2)，(m＝4，n＝5)，(m＝4，n＝5)。Shard redundancy overcomes the disadvantage of high replication redundancy. It divides the data object into m slices, and then encodes into n slices (n>m). We define r=n/m as coding redundancy. Encoding with redundancy r means that the storage space occupied by encoding is r times that of the original data. The most important feature of fragment redundancy is that the original file can be reconstructed as long as any m different fragments are obtained, and the sum of the volumes of the m fragments is roughly equal to the original file. For example, when the coding redundancy is r=4, the original data is divided into m=16 pieces, then after encoding, n=64 pieces, and the required storage space is increased by 4 times. Replication and RAID can be seen as special cases of shard redundancy. A replication system with a redundancy of 4 is equivalent to (m=1, n=4) fragment redundancy; RAID levels 1, 4 and 5 are represented by fragment redundancy as) (m=1, n=2 ), (m=4, n=5), (m=4, n=5).

三、发明内容3. Contents of the invention

本发明目的是：提出了一种混合式的数据冗余策略，一种基于DHT的P2P系统中获得高可用数据冗余的方法，它既像复制策略那样共享用户下载的文件以供后续的下载，又利用分片冗余维持文件的可用性。这种混合式的数据冗余策略在各种网络环境中均能较传统冗余策略更节省网络带宽，并且冗余度适中。The purpose of the present invention is to: propose a hybrid data redundancy strategy, a method for obtaining highly available data redundancy in a DHT-based P2P system, which shares files downloaded by users for subsequent downloads like a copy strategy , and use fragmentation redundancy to maintain the availability of files. This hybrid data redundancy strategy can save network bandwidth more than traditional redundancy strategies in various network environments, and the redundancy is moderate.

为了实现上述目的，本发明的技术方案是：基于DHT的P2P系统中获得高可用数据冗余的方法，其特征是采用混合式的获得数据冗余的方法，该方法包括以下步骤：1、对P2P系统中原文件进行分片冗余Erasure Coding编码；2、将编码得到的分片分配到P2P系统的相应节点储存；3、用户提出下载要求并进行所有节点查询；4、用户从有用文件的节点下载文件并将部分文件保存作为副本；5、重复步骤3直到服务停止；6、结束。In order to achieve the above object, the technical solution of the present invention is: the method for obtaining highly available data redundancy in the DHT-based P2P system is characterized in that a hybrid method for obtaining data redundancy is adopted, and the method comprises the following steps: 1. In the P2P system, the original files are coded with fragmentation redundancy Erasure Coding; 2. The coded fragments are allocated to the corresponding nodes of the P2P system for storage; 3. The user makes a download request and performs all node queries; 4. The user starts from the node of the useful file Download the file and save part of the file as a copy; 5. Repeat step 3 until the service stops; 6. End.

观察目前的P2P文件共享系统，一方面，发现热点文件的下载次数相当多，无需维护，其可用性自动保持在较高水平，这说明了混合式的数据冗余策略的可行性。另一方面，从计算机硬件的发展趋势看，存储器的发展速度大大高于带宽的发展速度，相对于存储资源来说，带宽资源变得越来越贫乏，表明了研究更加有效的节省带宽的数据冗余策略的必要性。Observing the current P2P file sharing system, on the one hand, it is found that hot files are downloaded quite a lot without maintenance, and their availability is automatically maintained at a high level, which shows the feasibility of a hybrid data redundancy strategy. On the other hand, from the perspective of the development trend of computer hardware, the development speed of memory is much higher than the development speed of bandwidth. Compared with storage resources, bandwidth resources are becoming more and more poor, which shows that the research of more effective bandwidth-saving data The need for redundancy strategies.

本发明的有益效果：本发明与现有技术相比，其显著优点是，在各种网络环境中均较传统冗余策略更节省网络带宽，并且冗余度适中(不高于9.4)。当节点平均可用性低于0.7和文件请求率相对于节点扰动频率较低时，该策略的优越性尤为突出。混合式方法或策略不仅在动态性低的网络环境中表现良好，而且在高度动态的网络中性能卓越。它既像复制策略那样共享用户下载的文件以供后续的下载，又利用分片冗余维持文件的可用性。这种混合式的数据冗余策略在各种网络环境中均能较传统冗余策略更节省网络带宽，并且冗余度适中。Beneficial effects of the present invention: Compared with the prior art, the present invention has the remarkable advantages of saving network bandwidth compared with traditional redundancy strategies in various network environments, and the redundancy is moderate (not higher than 9.4). The superiority of this strategy is particularly prominent when the average node availability is lower than 0.7 and the file request rate is low relative to the node perturbation frequency. A hybrid approach or strategy not only works well in low dynamic network environments, but also works well in highly dynamic networks. It not only shares the files downloaded by users for subsequent downloads like a replication strategy, but also maintains the availability of files by using shard redundancy. This hybrid data redundancy strategy can save network bandwidth more than traditional redundancy strategies in various network environments, and the redundancy is moderate.

四，附图说明4. Description of drawings

图1是混合式的数据冗余策略的执行流程；Figure 1 is the execution flow of the hybrid data redundancy strategy;

图2是文件可用性为0.999时，复制策略、分片冗余策略和混合式策略三种策略所需冗余度理论值随节点平均可用性变化的情况。Figure 2 shows the change of the theoretical value of redundancy required by the three strategies of replication strategy, fragmentation redundancy strategy and hybrid strategy with the average availability of nodes when the file availability is 0.999.

图3是文件可用性为0.999时，复制策略、分片冗余策略和混合式策略的冗余度随节点平均可用性变化的情况。Figure 3 shows how the redundancy of replication strategy, shard redundancy strategy and hybrid strategy varies with the average availability of nodes when the file availability is 0.999.

图4是文件可用性为0.999时，复制策略、分片冗余策略和混合式策略的维护带宽占用率随节点平均可用性变化的情况。Figure 4 shows how the maintenance bandwidth occupancy rate of replication strategy, fragmentation redundancy strategy and hybrid strategy varies with the average availability of nodes when the file availability is 0.999.

图5是文件可用性为0.999(图5a)，节点平均可用性为0.807时(图5b)，复制策略、分片冗余策略和混合式策略的维护带宽占用率和维护中文件传输数量随节点生命期内平均文件请求率变化的情况。Figure 5 shows that when the file availability is 0.999 (Figure 5a) and the average node availability is 0.807 (Figure 5b), the maintenance bandwidth occupancy rate and the number of file transfers during maintenance of the replication strategy, fragmentation redundancy strategy and hybrid strategy vary with the node lifetime. Changes in the average file request rate within a given period.

五、具体实施方式5. Specific implementation

混合式策略既像复制策略那样共享用户下载的文件以供后续的下载，又利用分片冗余维持文件的可用性。混合式策略自动将下载的文件视为共享文件，以供随后的下载；当修复某文件可用性水平时，指定任何一个储存该文件完整副本的节点生成并分发新分片即可。一方面，混合式策略像复制策略那样共享用户下载的文件以供后续的下载，节省了网络带宽；另一方面，增加相同的可用性，分片冗余策略产生的新数据的体积比复制策略少，即分片冗余策略需要传送的数据较少，混合式策略利用这一特性进一步节省了网络带宽。The hybrid strategy not only shares the files downloaded by users for subsequent downloads like a replication strategy, but also uses fragmentation redundancy to maintain the availability of files. The hybrid strategy automatically treats downloaded files as shared files for subsequent downloads; when repairing a certain file availability level, designate any node that stores a complete copy of the file to generate and distribute new shards. On the one hand, the hybrid strategy shares the files downloaded by users for subsequent downloads like the replication strategy, saving network bandwidth; on the other hand, increasing the same availability, the volume of new data generated by the shard redundancy strategy is smaller than that of the replication strategy , that is, the fragmentation redundancy strategy needs to transmit less data, and the hybrid strategy uses this feature to further save network bandwidth.

使用混合式策略时的文件可用性和冗余度可由如下方法推出。首先，我们假设同一节点不存放同一文件的副本和分片，可以存放多个不同文件的副本和分片，并且系统中不存在重复的分片。文件的可用性a可根据概率计算出来，即至少获得一个完整的副本或n个分片中的任意m个的概率。或者通过1减去所有副本都无法得到且没有足够的分片重构原文件(可用分片数不足m个)的概率：File availability and redundancy when using a hybrid strategy can be deduced as follows. First, we assume that the same node does not store copies and shards of the same file, but multiple copies and shards of different files can be stored, and there are no duplicate shards in the system. The availability a of a file can be calculated according to the probability, that is, the probability of obtaining at least one complete copy or any m of n shards. Or subtract the probability from 1 that all copies cannot be obtained and there are not enough fragments to reconstruct the original file (the number of available fragments is less than m):

$a a = = 11 - - {((11 - - p p))}^{h h} ((11 - - {Σ Σ}_{i i = = m m}^{n no} (\begin{matrix} n no \\ i i \end{matrix}) {p p}^{i i} {((11 - - p p))}^{n no - - i i})) - - - - - - ((11))$

这里，h为完整副本的个数。Here, h is the number of complete copies.

混合式策略的冗余度可由叠加复制和分片冗余的冗余度得到：The redundancy of the hybrid strategy can be obtained from the redundancy of superimposed replication and shard redundancy:

$r r = = h h + + \frac{n no}{m m} = = h h + + {((\frac{{σ σ}_{a a} ((h h)) \sqrt{\frac{p p ((11 - - p p))}{m m}} + + \sqrt{\frac{{σ σ}_{a a}^{22} ((h h)) p p ((11 - - p p))}{m m} + + 44 p p}}{22 p p}))}^{22} - - - - - - ((22))$

这里，σ_a(h)是关于h的函数，其值和分片冗余需要达到的可用性水平a’的对应关系Here, σ _a (h) is a function of h, and its value corresponds to the availability level a' that shard redundancy needs to achieve

如表1所示。As shown in Table 1.

aa σ_a σ _a

0.8000.9000.9900.9950.9980.9990.8000.9000.9900.9950.9980.999 0.841.282.482.812.883.100.841.282.482.812.883.10

表1.文件可用性和正态分布标准差对照表Table 1. Comparison table of file availability and normal distribution standard deviation

a’可由(1)式推出：a' can be deduced from formula (1):

${a a}^{' '} = = {Σ Σ}_{i i = = m m}^{n no} (\begin{matrix} n no \\ i i \end{matrix}) {p p}^{' '} {((11 - - p p))}^{n no - - 11} = = 11 - - \frac{11 - - a a}{{((11 - - p p))}^{h h}} - - - - - - ((33))$

从(2)式看出冗余度与m和完整副本的个数h有关，而与n无关。给定h，当m增大时，冗余度降低，但系统的复杂度增加、下载延迟增大。分片冗余策略的两个新分片再生策略依然适用于混合式策略，并且第二条策略在混合式策略中优越性更加显著。It can be seen from formula (2) that redundancy is related to m and the number h of complete copies, but has nothing to do with n. Given h, when m increases, the redundancy decreases, but the complexity of the system increases and the download delay increases. The two new shard regeneration strategies of the shard redundancy strategy are still applicable to the hybrid strategy, and the second strategy has more significant advantages in the hybrid strategy.

图2展示了复制策略、分片冗余策略和混合式策略为达到文件可用性为0.999所需冗余度理论值随节点平均可用性变化的情况。混合式策略的冗余度曲线有两个因素决定：节点平均可用性p和完整副本的个数h。对任意给定h值，都有一条与之相对应的曲线。由图可见，当节点平均可用性一定时，分片冗余策略的冗余度最小；除了当节点平均可用性极高时，混合式策略的冗余度略高于分片冗余策略，但大大低于复制策略。Figure 2 shows the change of the theoretical value of the redundancy required by the replication strategy, the fragmentation redundancy strategy and the hybrid strategy to achieve a file availability of 0.999 with the average availability of nodes. The redundancy curve of the hybrid strategy is determined by two factors: the average node availability p and the number of complete copies h. For any given value of h, there is a corresponding curve. It can be seen from the figure that when the average availability of nodes is constant, the redundancy of the shard redundancy strategy is the smallest; except when the average availability of nodes is extremely high, the redundancy of the hybrid strategy is slightly higher than that of the shard redundancy strategy, but much lower. in the replication strategy.

步骤2至4的具体步骤是：(2、将编码得到的分片分配到P2P系统的相应节点储存；3、用户提出下载要求并进行所有节点查询；4、用户从有用文件的节点下载文件并将部分文件保存作为副本；)每个节点周期性地将其储存的文件和分片的标识符注册到M个分散且相互独立的索引节点。节点通过两种索引查找文件：文件副本索引和文件分片索引。当某节点请求文件d时，它首先随机查询一个或多个d的索引节点。如果查询过的索引节点未能提供足够的文件副本和分片位置信息，则继续随机查询其他索引节点。如果所有索引节点都不能提供足够的文件副本和分片位置信息，该节点将等待一段时间(例如10ms)，后重做查询操作，直到查询超时。若找到完整文件副本，则下载该副本；否则，节点转而下载足够的分片并重构原文件。用户下载的文件自动作为共享文件以供后续的下载。The specific steps of steps 2 to 4 are: (2. assign the coded slices to the corresponding nodes of the P2P system for storage; 3. the user makes a download request and queries all nodes; 4. the user downloads the file from the node of the useful file and Save some files as copies;) Each node periodically registers the identifiers of its stored files and fragments to M distributed and independent index nodes. Nodes look up files through two types of indexes: file replica indexes and file shard indexes. When a node requests a file d, it first randomly queries one or more index nodes of d. If the queried inode fails to provide sufficient file copy and shard location information, continue to randomly query other inodes. If all index nodes cannot provide enough file copies and fragment location information, the node will wait for a period of time (for example, 10ms), and then redo the query operation until the query times out. If a complete copy of the file is found, it is downloaded; otherwise, the node moves on to download enough shards and reconstruct the original file. Files downloaded by users are automatically shared as files for subsequent downloads.

各索引节点周期性地维护在其上注册的文件和分片的可用性。若文件d的可用性低于要求水平，则指定某一个储存该文件完整副本的节点生成并分发新分片到随机选择的节点。若当前系统中没有完整的副本，首先等待一段时间(例如10ms)，若有节点下载分片并重构原文件，则指定该节点修复丢失的冗余度；否则，索引节点需要自己下载分片，重构原文件，再生成足够的新分片。实验表明几乎所有文件都至少有一个副本存在于系统中，所以混合式策略没有必要像分片冗余策略那样指定主节点保存文件的永久副本。这样，混合式策略节省了维护主节点所需的网络带宽。Each inode periodically maintains the availability of files and shards registered on it. If the availability of file d is lower than the required level, a node that stores a complete copy of the file is designated to generate and distribute new shards to randomly selected nodes. If there is no complete copy in the current system, first wait for a period of time (for example, 10ms), if a node downloads the fragment and reconstructs the original file, then designate the node to repair the lost redundancy; otherwise, the index node needs to download the fragment by itself , reconstruct the original file, and generate enough new fragments. Experiments show that almost all files have at least one copy in the system, so the hybrid strategy does not need to designate the primary node to keep a permanent copy of the file like the fragmentation redundancy strategy. In this way, the hybrid strategy saves the network bandwidth required to maintain masternodes.

我们在p2psim上实现了上述三种数据冗余策略。模拟的网络环境包含1024个节点。各节点交替离开和重新加入网络，相邻事件的时间间隔呈指数分布。当节点失效或离开，其上的所有文件、分片和索引均被丢弃。每次有节点加入网络，都会使用不同的IP和DHT标识符。对文件的访问请求按照Zipf-like分布，将文件按访问频率从高到低排列，第i个文件的访问频率和1/i^α成正比，这里设置α为0.74。我们设计了两套实验：变化节点平均可用性和变化用户请求频率。在变化节点平均可用性实验中，节点平均可用性变化范围为30％到90％，其它条件不变。在变化用户请求频率实验中，节点生命期内平均文件请求数变化范围为2到20，其它条件不变。每组实验运行模拟器时间6小时，只收集后半段模拟时间的数据。每组实验重复5次，结果取平均值。设置m＝7，这是CFS中使用的重构原文件所需分片数。设置文件可用性为0.999，这是目前大多数网络服务所能达到的可用性。We implement the above three data redundancy strategies on p2psim. The simulated network environment contains 1024 nodes. Each node leaves and rejoins the network alternately, and the time interval between adjacent events is exponentially distributed. When a node fails or leaves, all files, shards and indexes on it are discarded. Each time a node joins the network, it uses a different IP and DHT identifier. The access requests to files are distributed according to Zipf-like, and the files are arranged according to the access frequency from high to low. The access frequency of the i-th file is proportional to 1/i ^α , where α is set to 0.74. We designed two sets of experiments: varying node average availability and varying user request frequency. In the experiment of changing the average availability of nodes, the average availability of nodes varies from 30% to 90%, and other conditions remain unchanged. In the experiment of changing the frequency of user requests, the average number of file requests during the node life span ranges from 2 to 20, and other conditions remain unchanged. Each set of experiments runs the simulator for 6 hours, and only collects data for the second half of the simulation time. Each experiment was repeated 5 times, and the results were averaged. Set m=7, which is the number of fragments required to reconstruct the original file used in CFS. Set the file availability to 0.999, which is the availability that most web services can achieve at present.

图3展示了复制策略、分片冗余策略和混合式策略三种冗余策略在达到文件可用性0.999的情况下，冗余度随节点平均可用性变化的曲线。图3中分片冗余策略的冗余度曲线与图2中的大体一致。但图3中复制策略的冗余度曲线却与图2中的相差甚远，尤其是当节点平均可用性大于0.6时。这是因为分片冗余策略不共享用户下载的文件，仅在缓存中保留一个分片，受用户下载行为的影响小；但复制策略完全共享用户下载的文件，网络中文件副本的数量随用户的下载而增多。同时，节点平均可用性越高，系统中文件副本的丢失率就越低。过多的副本残留在系统中，造成了复制策略的冗余度没有随节点平均可用性提高而明显下降。Figure 3 shows the curves of redundancy changing with the average availability of nodes for the three redundancy strategies of replication strategy, shard redundancy strategy and hybrid strategy when the file availability is 0.999. The redundancy curve of the shard redundancy strategy in Figure 3 is roughly consistent with that in Figure 2. But the redundancy curve of the replication strategy in Figure 3 is far from that in Figure 2, especially when the average availability of nodes is greater than 0.6. This is because the shard redundancy strategy does not share the files downloaded by the user, and only keeps one shard in the cache, which is less affected by the user's download behavior; but the replication strategy completely shares the files downloaded by the user, and the number of file copies in the network varies with the user. downloads increased. At the same time, the higher the average availability of nodes, the lower the loss rate of file copies in the system. Too many copies remain in the system, causing the redundancy of the replication strategy to not decrease significantly with the increase of the average availability of nodes.

由图3可见，虽然节点平均可用性由0.3变化到0.9，但混合式策略的冗余度变化并不明显，徘徊在8.5到9.4之间。当节点的扰动强烈时，即节点平均可用性低时，系统中的文件副本较少，混合式策略主要利用分片冗余来节省存储空间。当节点平均可用性高时，混合式策略的冗余度不降且升，其原因与复制策略相似：过多的副本残留在系统中。这些多余的冗余文件或分片不会对系统的稳定性造成危害，随着时间的推移，多余的文件和分片会被节点逐渐置换出缓存。It can be seen from Figure 3 that although the average availability of nodes changes from 0.3 to 0.9, the redundancy of the hybrid strategy does not change significantly, hovering between 8.5 and 9.4. When the disturbance of nodes is strong, that is, when the average availability of nodes is low, there are fewer file copies in the system, and the hybrid strategy mainly uses fragmentation redundancy to save storage space. When the average availability of nodes is high, the redundancy of the hybrid strategy does not decrease but increases, the reason is similar to the replication strategy: too many copies remain in the system. These redundant redundant files or fragments will not cause harm to the stability of the system. Over time, redundant files and fragments will be gradually replaced out of the cache by the nodes.

由图4可见，当节点平均可用性高于0.47时，复制策略比分片冗余策略更节省维护带宽；当节点平均可用性低于0.47时，分片冗余策略消耗的维护带宽较复制策略少。复制策略通过共享用户下载的文件来减轻维护操作的负担。当节点平均可用性较高时，复制策略在节约维护带宽方面非常有效，原因在于，依赖于用户的下载行为大部分文件的可用性被自动维持在目标水平。但在高度动态的网络环境中，节点频繁地加入和离开，使用户的下载的速度无法弥补文件副本和分片的丢失。在这种情况下，分片冗余策略的优势就显现出来：消耗同样的网络带宽，分片冗余策略比复制策略更能提高文件可用性水平；换言之，提高同样的文件可用性水平，分片冗余策略比复制策略需要更少的带宽。It can be seen from Figure 4 that when the average node availability is higher than 0.47, the replication strategy saves more maintenance bandwidth than the shard redundancy strategy; when the node average availability is lower than 0.47, the shard redundancy strategy consumes less maintenance bandwidth than the replication strategy. Replication strategies ease the burden of maintenance operations by sharing files downloaded by users. When the average node availability is high, the replication strategy is very effective in saving maintenance bandwidth, because the availability of most files is automatically maintained at the target level depending on the user's download behavior. However, in a highly dynamic network environment, nodes join and leave frequently, so that the user's download speed cannot compensate for the loss of file copies and fragments. In this case, the advantages of the sharding redundancy strategy emerge: consuming the same network bandwidth, the sharding redundancy strategy can improve the file availability level more than the replication strategy; in other words, to improve the same file availability level, the sharding redundancy strategy The redundancy strategy requires less bandwidth than the replication strategy.

图4最大的作用是显示了，在节省维护带宽方面，混合式策略具有最好的整体效果。混合式策略既像复制策略那样共享用户下载的文件以供后续的下载，又利用分片冗余维持文件的可用性。当节点平均可用性高于0.7时，由于绝大部分文件的可用性被自动维持在目标水平，故复制策略和混合式策略消耗的维护带宽大致相等。当节点平均可用性低于0.7时，混合式策略的优越性就显现出来了。混合式策略优于分片冗余策略的原因在于：其一，共享用户下载的文件以节省维护带宽；其二，无需维护主节点。The biggest function of Figure 4 is to show that the hybrid strategy has the best overall effect in terms of saving maintenance bandwidth. The hybrid strategy not only shares the files downloaded by users for subsequent downloads like a replication strategy, but also uses fragmentation redundancy to maintain the availability of files. When the average availability of nodes is higher than 0.7, since the availability of most files is automatically maintained at the target level, the maintenance bandwidth consumed by the replication strategy and the hybrid strategy is approximately equal. When the average node availability is lower than 0.7, the superiority of the hybrid strategy appears. The reason why the hybrid strategy is better than the shard redundancy strategy is: first, the files downloaded by users are shared to save maintenance bandwidth; second, there is no need to maintain the master node.

图5所示的节点平均可用性为0.807，这是在一般的公司和高校中我们能够观察到的网络状况。由图5(a)可见，文件请求率越高，维护带宽占用率越低。维护带宽占用率是一个关于维护带宽和文件下载服务带宽的相对指标，图5(b)展示了维护带宽的绝对值。由图5(b)可见，随着平均文件请求率的提高，复制策略和混合式策略用于维持文件可用性传输的文件数目明显减少，但分片冗余策略的曲线却仅在小范围内波动。这一现象证明了共享用户下载的文件以供后续的下载能够有效降低维护带宽开销。The average availability of nodes shown in Figure 5 is 0.807, which is the network condition we can observe in general companies and universities. It can be seen from Figure 5(a) that the higher the file request rate, the lower the maintenance bandwidth occupancy rate. The maintenance bandwidth occupancy rate is a relative indicator of maintenance bandwidth and file download service bandwidth, and Figure 5(b) shows the absolute value of maintenance bandwidth. It can be seen from Figure 5(b) that as the average file request rate increases, the number of files transferred by the replication strategy and the hybrid strategy to maintain file availability is significantly reduced, but the curve of the fragmentation redundancy strategy only fluctuates within a small range . This phenomenon proves that sharing files downloaded by users for subsequent downloads can effectively reduce maintenance bandwidth overhead.

图5(b)还表明，当文件请求率较低时，混合式策略在节省维护带宽方面优于其他两种冗余策略。当文件请求率大于10时，复制策略和混合式策略维护带宽占用率非常接近，并且贴近水平坐标轴，这是因为依赖于用户的大量下载行为使大部分文件的可用性自动维持在目标水平。Figure 5(b) also shows that the hybrid strategy outperforms the other two redundant strategies in saving maintenance bandwidth when the file request rate is low. When the file request rate is greater than 10, the replication strategy and the hybrid strategy maintain very close bandwidth occupancy and are close to the horizontal axis, because the availability of most files is automatically maintained at the target level depending on the user's massive download behavior.

Claims

1. based on the method that obtains high data available redundancy in the P2P system of DHT, it is characterized in that adopting the method for hybrid-type acquisition data redundancy, this method may further comprise the steps: 1), original in the P2P system is carried out the redundant Erasure Coding coding of burst; 2), the burst that coding is obtained is assigned to the respective nodes storage of P2P system; 3), the user proposes downloading request and carries out all querying nodes; 4), the user preserves as copy from the node file in download of useful file and with partial document; 5), repeating step 3) up to service stopping; 6), finish.

2. according to claim 1 a kind of, it is characterized in that step 2 based on obtaining high data available redundancy approach in the P2P system of DHT) to 4) concrete steps be: each node periodically is registered to M dispersion and separate index node with the file of its storage and the identifier of burst; Node passes through two kinds of index search files: duplicate of the document index and file fragmentation index; When certain node demand file d, it is the index node of the one or more d of random challenge at first; If the index node of inquiring about fails enough duplicate of the document and burst positional informations are provided, then continue other index nodes of random challenge; If all index nodes all can not provide enough duplicate of the document and burst positional information, the query manipulation of reforming after this node will wait for a period of time is up to query timeout; If find the complete file copy, then download this copy, otherwise node transfers to download enough burst and reconstruct original; The file of user's download automatically as shared file for follow-up download;

Each index node is periodically safeguarded the file of registration thereon and the availability of burst, if the availability of file d is lower than the requirement level, then specifies the node generation of some storage this document complete copy and distributes new burst to the node of selecting at random; If do not have complete copy in the current system, at first wait for a period of time, if node downloading slicing and reconstruct original are arranged, the redundancy of then specifying this node reparation to lose; Otherwise index node needs own downloading slicing, reconstruct original, the new burst that regeneration is enough.

3. high data available redundancy scheme in a kind of P2P system based on DHT according to claim 1 is characterized in that: in service in system, calculate the availability of current file by following formula:

a = 1 - {(1 - p)}^{h} (1 - Σ_{i = m}^{n} (\begin{matrix} n \\ i \end{matrix}) p^{i} {(1 - p)}^{n - i})

Wherein: a is the availability of file, and p is the mean availability of node, and h is a complete copy quantity, and m is that file can recover required minimum burst quantity, and n is a burst quantity total in the system;

The redundancy of hybrid strategy can be obtained by the redundancy that stack is duplicated with the burst redundancy:

r = h + \frac{n}{m} = h + {(\frac{σ_{a} (h) \sqrt{\frac{p (1 - p)}{m}} + \sqrt{\frac{σ_{a}^{2} (h) p (1 - p)}{m} + 4 p}}{2 p})}^{2}