HK1173245B

HK1173245B - Using index partitioning and reconciliation for data deduplication

Info

Publication number: HK1173245B
Application number: HK13100419.8A
Authority: HK
Inventors: J.李; S.森古普塔; R．卡拉赫; R．N．德塞; P．A．奥尔泰安; J．R．本顿
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-12-28
Filing date: 2013-01-10
Publication date: 2019-02-01

Description

Data deduplication using index partitioning and reconciliation

Technical Field

The invention relates to data deduplication using index partitioning and coordination.

Background

Data deduplication (also sometimes referred to as data optimization) refers to reducing the physical amount of bytes of data that need to be stored on disk or transmitted over a network without compromising the fidelity and integrity of the original data, i.e., the reduction in bytes is lossless and the original data can be fully recovered. Data deduplication thus results in savings in hardware costs (for storage and network transmission) as well as data management costs (e.g., backup) by reducing the resources required to store and/or transfer data. These cost savings become important as the amount of digitally stored data grows.

Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. A technique is used to identify the same region of data in one or more files and physically store only one unique region (block), while maintaining a pointer to the block in association with the file. Another technique is to mix data deduplication with compression, for example, by storing compressed blocks.

To identify the chunks, the server storing the chunks maintains a hash index service of hashes of the chunks in the system. Hashes do not have locality (locality), i.e., chunk hashes for chunks in the same file are not related, any edit to the content of a given chunk creates a very different (unrelated) hash value. Thus, conventional database techniques, such as B-tree indexing, result in poor performance of the indexing service. Maintaining the entire index in memory provides better performance, but consumes too much resources. Server memory resources are needed for other server applications (e.g., in a primary data deduplication scenario) and are used for caching.

Previous backup-oriented data deduplication optimizations relied on look-ahead caching to reduce the amount of resources used on the server to access the index. However, data deduplication is no longer limited to data backup scenarios, and has evolved to be used as the primary data storage cluster being accessed, as with other storage devices. Using only a look-ahead cache to reduce resource usage is not a suitable solution.

Disclosure of Invention

This summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.

Briefly, aspects of the subject matter described herein relate to deduplication technology by which an index of a hash index service is partitioned into subspace indexes such that less than the entire range of indexed data sets of the hash index service are loaded into memory at one time, thereby conserving available memory.

When data to be deduplicated is received and chunked (chunk), and possibly also compressed, the subspace index determines whether the hash value computed for the chunk matches the hash value of an entry in a main memory (e.g., RAM) cache. If so, information is returned that can be used to locate the existing chunk, otherwise the chunk is stored as a new chunk in the chunk store, and an entry corresponding to the hash value is added to the subspace index in association with a reference to the chunk.

In an aspect, the hash index service may be partitioned into a plurality of subspace indexes based on one or more criteria, which may correspond to how the data to be deduplicated is partitioned. Example partitioning/dividing criteria include a file type of data, a data type, a location, an application that created the data, file usage pattern data, file access pattern data, a file owner, a file user type, a namespace, file content, file metadata, learned criteria or adaptive criteria, or any other type of information that can be inferred as characteristics of the file, and any combination of these.

In another aspect, one subspace may be periodically (or occasionally) reconciled with one or more other subspaces to eliminate any duplicate entries from the subspace that is the reconciled subspace, and delete its associated (duplicate) blocks accordingly. Such duplicate data blocks may appear across subspaces, as the subspaces may be separately deduplicated up to and up to the point of coordination. As part of the reconciliation, each reference to the block associated with the duplicate entry is updated to reference the remaining blocks. The coordinated subspaces and the chunk stores from which the chunks were deleted may be compacted and/or overwritten with new data in various ways.

The subspace and the one or more other subspaces to reconcile may be selected based on similar subspace types, similar subspace signatures, and/or based on the subspace being an uncoordinated subspace and the other subspaces having previously reconciled with one another. The signature comprises a compact representation of the subspace hash, and may be computed/determined in a number of ways, e.g., based on a minimum hash computation, a bloom filter (bloom filter), a minimum hash and bloom filter combination, and so forth.

Other advantages of the present invention will become apparent from the following detailed description when read in conjunction with the accompanying drawings.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or like elements and in which:

FIG. 1 is a block diagram representing example components of a data storage service configured for deduplication using a subspace-based hash index service.

2-4 are representations illustrating the relationship between a subspace and a chunk store used in data deduplication.

FIG. 5 is a flow diagram representing example steps for data deduplication in which less than the entire index is cached, thereby facilitating subspace-based deduplication.

FIG. 6 is a flowchart representing example steps for reconciling two subspaces to locate equivalent entries corresponding to equivalent blocks in a block store.

FIG. 7 is a block diagram representing exemplary non-limiting networked environments in which various embodiments described herein can be implemented.

FIG. 8 is a block diagram representing an exemplary non-limiting computing system or operating environment in which one or more aspects of various embodiments described herein can be implemented.

Detailed Description

Aspects of the technology described herein generally relate to a deduplication system that operates by dividing a global hash index into a plurality of subspaces. Each of the subspaces may be stored in a secondary storage device, e.g., on a hard disk. Depending on the data set currently in use, a subspace or a portion of a subspace may be actively loaded into memory, e.g., based on the current system workload, resource availability, and which subspace is required to index a data block. When a subspace fills up, the subspace may become a read-only subspace that may be used to locate blocks (e.g., as a cache), and a new active subspace is created to process the new blocks. Note that the sealed sub-spaces are often the same size, but not necessarily so.

Furthermore, the different subspaces may be coordinated to increase the quality of deduplication (e.g., a higher deduplication rate of their respective data chunks), and thereby increase scalability of the data storage server. To this end, consider two subspaces selected for coordination, one as the source and one as the destination. The hash of one subspace is compared to the hash of the other subspace to find equivalent chunks, where any equivalent chunk is removed from one chunk store, an entry is potentially removed from one of the respective subspaces, and references to the removed chunk are updated to point to the remaining chunks. Note that repeating the coordination operation between all subspaces in the system will eventually detect all duplicates in the system, thus maximizing the savings obtained from data deduplication.

In one aspect, to predict which subspaces may reconcile well with each other, the reconciliation process computes signatures for the indices in each subspace. The signature may be used by the data storage server to identify which subspaces are selected for reconciliation. For example, this subspace selection process may first process the subspace that is most likely to give the most space savings in reconciliation. Because the subspace signatures may be very compact, at off-peak times, the data storage server may be able to afford to load the signatures of all subspaces so that it can optimally identify which subspaces are good deduplication candidates. Each subspace may be optimized separately, with the "hot" subspace (where content is modified or regularly read) being optimized for performance, and the "cold" subspace (where content includes older snapshots or infrequently accessed content) being optimized for data deduplication quality. Subspaces may be coordinated during off-peak hours to improve data deduplication performance and/or quality.

It should be understood that any of the examples herein are non-limiting. Thus, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in data processing, data indexing, and data deduplication in general.

FIG. 1 illustrates example components of a content-aware data deduplication data storage system, such as implemented in a data/file storage service 102. The service 102 receives data 104 (files, blobs, etc.), and deduplication logic 106 processes the data for deduplication. To this end, deduplication logic 106 provides data 104 to chunking module 108, and chunking module 108 processes the content into chunks, such as an algorithm that chunks the content of a file according to the structure of the file (e.g., partitioning a media file into a media header and a media body), or by using a fast hashing technique based on repeated computations over a sliding window (such fast hashing functions include the CRC and Rabin function families), where a chunk is selected when the hashing function and the size/content of the current chunk satisfy certain heuristics (heuristic). The block boundaries are generally determined in a data-dependent manner at locations where the hash function satisfies a certain condition. The following description is with respect to one block 110, but it will be understood that data is generally divided into a plurality of blocks.

The deduplication logic 106 passes the chunk 110 to a hashing mechanism 112, and the hashing mechanism 112 computes a hash of the chunk (referred to as a chunk hash 114). A strong hash function, such as a cryptographically secure SHA-256 or SHA-512 hash function, or the like (which ensures a very low likelihood of inter-hash collisions) may be used as the chunk hash 114 to uniquely identify the chunk 110. Note that with such secure hashing, the possibility of hash collisions is eliminated, e.g., hash collisions may be about 30 orders of magnitude smaller than the hardware error with the most reliable hardware currently available.

The chunk hash 114 is provided to a hash index service 116 (which includes or is coupled to a reconciliation mechanism 117). If the chunk hash 114 is found (i.e., already exists) in the hash index service 116, the duplicate copy of the chunk 110 is considered to have already been stored in the chunk store 118 without storing the current chunk any more. Rather, any reference to this block may only refer to the previous, existing block.

If the chunk hash 114 is not found in the hash index service 116, the chunk 110 is stored in the chunk store 118, and the chunk hash 114 is stored in the hash index service 116. It can be readily appreciated that given sufficient data over time, a large amount of storage can be saved by referencing a block rather than maintaining many separate instances of the same data block. Blocks are also often compressed, thereby preserving even more storage; note that the hash may be computed on uncompressed blocks before compression, and/or the hash may be computed after compression.

The technology described herein relates to the architecture and algorithms of the hash index service 116, and more particularly to a subspace-based hash index service, and the concept of subspace reconciliation.

In general, a subspace is a small portion of the global index of the entire system and generally corresponds to an initial partitioning (partitioning) of the available data to be deduplicated into multiple datasets. The partitioning is generally planned so that data corresponding to any subspace may be well deduplicated with other data of that subspace (many chunks will match). The hash of the global data chunk is divided into subspace indexes, e.g., one subspace index for each respective data chunk. The data/hash may be divided into subspaces based on virtually any criteria, such as file type (e.g., by extension), data type (e.g., image, text document, etc.), location (e.g., machine, volume), application that created the file, file usage/access patterns (e.g., last modified time, etc.), user ID (file owner), user type, namespace (e.g., disk volume), content and/or metadata (e.g., to cluster similar files), file classification information (provided by manual or automatic methods), learned/adaptive criteria (e.g., external feedback from analysis of previously deduplicated data on this or another system), or any other type of information that may be inferred or determined as characteristic of the file. Note that any combination of some or all of the above partitions may be used, e.g., data with the same file type that was modified in the last month.

By way of example, if it is assumed that related data often appears in the storage system near the same time frame, using the last modification or last access time of a file as a criterion to divide the file namespace into index subspaces may be a useful partitioning scheme. For example, consider a file sent as an attachment to one hundred users, eighty of which save the file to their own respective document repositories. The system may divide the namespace into temporal ranges and treat each temporal range as a subspace.

As another example, the partitioning by file type is based on the following assumptions: there is a greater chance that different types or groups of types will have duplicate blocks between themselves. For example, text-based formats (e.g.,. txt,. xml,. html, etc.) may be deduplicated well with each other, but not with files maintained in a zip format. Thus, a system (e.g., administrator) may define a set of related files, such as all text formats, Office2007, zip, etc., and treat files belonging to a set as a subspace.

By partitioning the data such that the respective hash index is a subspace, the subspace index can be stored in an auxiliary device (e.g., a hard disk), while some form of active subspace index is loaded into fast memory (e.g., RAM) to be executed as an efficient (online) hash index service. For example, a file/blob (blob) of a group of data may be grouped by its partitioning (and corresponding subspace) before being sent to the service 102, such that the activity index changes only when the group changes. Furthermore, various methods may be used to select which subspaces to load (relating the subspaces to files to be deduplicated, e.g., based on insertion time) before starting the optimization session to maximize savings during subsequent deduplication processes.

The partitioning and subspace indexing of data reduces the deduplication ratio (i.e., reduces deduplication space savings) because equivalent chunks may be indexed by different index subspaces and thus stored in multiple locations. Furthermore, because the subspace is small (because its size can be controlled by configurable parameters), the system can ensure that the loaded subspace will fit into the main storage (RAM) for the entire duration of the deduplication process. As described below, the different subspaces may be coordinated (e.g., in offline operation when the load of the server is low in terms of CPU, memory, and I/O load) such that only one copy of any equivalent blocks present in the two (or more) subspaces need be maintained while the other copies are discarded. While in some environments each subspace may be able to reconcile with every other subspace, in other environments there are too many subspaces to reconcile all subspaces in this manner. As described below, each index subspace may be identified by a signature, where signature similarity (e.g., vector distance)/matching is used to select subspaces for reconciliation that those signatures appear to have a significant probability of deduplication and thus will likely be well reconciled with each other.

FIG. 2 shows an example architectural design of a hash index service 116 configured for subspace partitioning. As described above, the data/file storage service 102 includes a chunk store 118 and a hash index service 116, where the chunk store 118 maintains chunk data, which is an area of data in one or more files. Block store 118 may include two or more component stores, e.g., S₁，S₂，……，S_nWherein each storage Si is comprised of a set of block IDs c_i，1，c_i，2，……，c_i，kA set of blocks identified, as shown in fig. 3. Block ID c_i，jIdentification block j is storing S_jAnd may be associated with an optional backward reference link to the file to which it belongs.

The hash index service 116 includes a subspace index P₁，P₂，……，P_mWherein each subspace P_jComprising a set of hashes h_j，1，h_j，2，……，h_j，lAnd associated Block ID c_j，1，c_j，2，……，c_j，lAs generally represented in fig. 2 and 4. Note that each block ID corresponds to a block of storage locations, as indicated by the dashed arrows in FIG. 2 (although only one such arrow is shown for simplicity). As described below, each subspace P is optimized for deduplication purposes_jSignature sig may also be used_jTo identify. Subspace P_jA plurality of other sub-sampled representations may also be included, as described below.

It is possible in the system to establish a correspondence between chunk stores and index subspaces, e.g., s per chunk store_iCorresponding to one and only one index subspace P_jHowever, an absolute correspondence is not necessary. It is also possible to allow indexing of the subspace P_jTo a block in the plurality of block stores (i.e., to have a block ID belong to the plurality of block stores). It is also possible that the blocks in block store sj are indexed by multiple subspaces. However, if the block store resides on a hard disk drive or other device with low random access performance (e.g., as compared to an SSD) and good read performance is desired, it may be desirable to limit index subspace P_jThe number of block stores pointed to.

If the data/file storage device 102 includes multiple machines, multiple chunk stores and multiple index subspaces may be allocated to each machine. Indexing services can be used-as implemented by a centralized indexing service with failover support (e.g., viaA clustering service in Server 2008), or implemented through a decentralized indexing service — to store a particular block S_iOr a specific index subspace P_jTo a machine.

By way of summary and providing some more detail, FIG. 5 is a flowchart representing steps that occur when a new file is written to file storage service 102 that is configured for subspace partitioning. At step 502, a chunk store and index subspace is allocated to the file so that new chunks (if any) and their hash index can be deposited. Step 504 represents using a content-aware chunking module to split the file into individual blocks (as soon as possible the entire file is a single block).

At step 506, the chunk hash is computed via a secure hash service (e.g., SHA-256). Note that one option, prior to computing the chunk hash, is to compress the chunk (if appropriate).

As represented via steps 508 and 510, the chunk hash is checked against one or more current caches of a hash index service that includes a current subspace for adding any new hashes, and possibly other (e.g., read-only) subspaces when memory permits. If the chunk hash is present in the current cache of the hash index service at step 510, the associated chunk ID of the previous chunk is returned at step 512. Otherwise, a null value is returned at step 514, indicating that the chunk hash was not found in the current cache of the hash index service.

Note that via the subspace Index (and possibly other factors as generally described in the aforementioned patent application entitled "Adaptive Index for data deduplication"), only a portion of the entire hash Index may be kept in the memory cache by the hash Index service, i.e., the hash Index service may selectively cache a portion of the hashes of the subspaces in the system. Thus, it is possible that the chunk hash does exist in the secondary storage of the hash index service, but is not immediately found by checking the current cache. The block hash access of the hash index service (step 508) triggers cache management of such a look-ahead cache in the hash index service. Depending on the hash hit pattern, the hash index service intelligently manages the current cache depending on the available memory resources. The hash of the least recently used subspace is not cached, leaving memory to cache the more actively used subspace. Furthermore, accesses in a less frequently used subspace may be subsampled (as generally described in the aforementioned patent application entitled "Adaptive Index for Data Deduplication") to reduce memory consumption. Note that in addition to the current subspace where new hashes are being added, other subspaces in the hash index service may be read-only, in which event the cache replacement policy is for a read-only cache, whose implementation and management is relatively straightforward. Further note that cache management may use the signature associated with the subspace.

Step 516 represents optionally compressing the block, e.g., via the LZW algorithm or a variant thereof; (note that chunk compression may instead occur before step 506). The compression may be different for different types of data. It is further noted that not all data is compressed well, and thus step 516 may not be performed for certain types of data, or step 516 may not be performed at all.

When the hash index service returns null (step 514), the chunk is deposited to the chunk store, returning a chunk ID indicating the location (offset) of the chunk in the store, as represented by step 518. At step 520, the hash and chunk ID for the chunk are deposited in the current subspace of the hash index service. Note that in this process, only the block store write operation at step 518 needs to be persisted before the block is considered to have been successfully written. The chunk hash and chunk ID pairs may be written (step 520) to the current subspace of the hash index service in a lazy manner. In the event that the write operation of the subspace fails (e.g., due to a power failure, etc.), the hash index service may scan the chunk store and reconstruct the chunk hash and chunk ID pair to reconstruct the unwritten portion of the index subspace of the hash index service.

Steps 522 and 523 return to step 506 to repeat the process for any remaining blocks. When all blocks of the entire file have been persisted in the block store, or found to be duplicate copies with a block ID, then the chunk ID for the entire file is returned at step 524. These are used to form a stream map of the file to allow access to the file data from the chunk store as needed via the chunk ID.

An alternative implementation at this stage is: instead, the chunk hashes for the chunks of the entire file are returned, as well as the current subspace. This alternative uses hashes in the stream map to identify the chunks. This alternative implementation has advantages and disadvantages, as described herein.

More specifically, there are alternative implementations of subspace-based hash index services, where there are different implementation choices for the chunk, chunk id, hash, and reference count. As generally described herein, in data deduplication, a chunk is the basic unit of data deduplication, and deduplication is implemented by finding a chunk with the same hash value and storing only one copy of the chunk. In a simple implementation, the blocks do not overlap. As also described herein, each block is identified with a block ID, which is a reference in the file storage service to locate the block. Because each block is stored in the block store, the block ID is a pair of digits (k, off) that identifies the block as being stored in the block store Sk at the offset off. Each block is uniquely identified by its hash, which is typically computed by applying a hash function to the compressed block stream; (in a concept called hierarchical chunking, the hash of a parent block may also be computed by directly applying the hash to the hash values of its child blocks).

A file in the file storage service is identified by the block that synthesized it. This may take the form of a block ID or hash. Thus, the file may be described as:

file ═ block ID₁Block ID₂,._m}

Or

File { hash }₁Hash of₂,._m}

These two schemes are called a block ID description scheme and a hash description scheme, respectively. The block ID description scheme has been generally described above, and thus the hash description scheme is described below.

In a hash description scheme, the file descriptor contains a relative reference, i.e., a hash of the block. A second table, in this case the < hash, chunk _ id, ref _ count > (< hash, chunk _ id, reference _ count >) mapping table in the hash index service, is consulted to obtain the location of the chunk. This adds an additional layer of indirection during file accesses and may degrade read access performance during file accesses. However, the advantages are: the deduplication (and garbage collection) operations may be performed faster.

In the hash description scheme, another block ID table need not be maintained. The subspace corresponds directly to the chunk store. Because the number of hashes in a data/file storage service may be large, and the scheme assumes that not all hashes may be loaded into main system memory, the file descriptor may further record the subspace in which the hashes are deposited or later deduplicated. With the subspace information, the file storage service is given a hint as to which subspace it needs to consult to retrieve chunk _ id (chunk _ id) of the relevant chunk.

Thus, one file identifier takes the form:

file { hash }₁Hash of₂,._m，P₁，......，P_k}

Wherein P is₁，......，P_kAn index representing the subspace. When accessing the file, a lookup operation is performed, e.g., a lookup of a hash in a hash index service₁Hash of₂,._mTo retrieve the relevant chunk _ id information.

Note that because each hash corresponds to a unique block, when looking for a hash, one looks for it_jIf the hash is found_iHashing that is the same but in different subspaces_jThen hashing may be used_jTo retrieve the hash_i. (corresponding to hashing)_jAnd hashing_iAre de-duplicable, such as later during coordination as described below. ).

If no corresponding hash is found in the current cache of the hash index service, the hash of the subspace P is added₁，......，P_kLoad into the current cache. If the memory is exhausted, the hash of the least recently used subspace is removed to provide more memory.

Data ofThe operation of deduplication is straightforward. When comparing to the source subspace P_jA destination subspace P_iWhen de-repeating, the source subspace P_jThe hash of (a) is loaded into memory. Then, the source subspace P is compared_jThe hash of (1) checks the destination subspace P_iEach hash of (a). Whenever a duplicate is found, its hash in the destination subspace is removed.

Block store S_jThe associative block store garbage collection operation above can also be performed in a straightforward manner, as follows:

1. loading a subspace P associated with a chunk store Sj₁，……，P_kThe hash of (1). Copying the stream of the source block storage Sj to a new destination block storage Sk; in subspace P₁，……，P_kThe block whose hash was found in is completely replicated. In subspace P₁，……，P_kAny block for which a hash is not found is not copied.

2. Updating subspace P with Block ID in New destination Block store Sk₁，……，P_k。

3. The source block store S may then be deleted_j。

Because there is no back-reference link in the hash description scheme, both deduplication optimization and garbage collection operations are straightforward.

Turning to subspace reconciliation (which is typically done during off-peak hours when the system has a relatively low workload and available system resources), subspace reconciliation operations may be performed to find and remove duplicate entries, and thus duplicate blocks, between subspaces. Note that if significant user and/or system activity is detected, the coordination operation may be interrupted. In the case of an interruption, the hash index service will use one of the subspaces, e.g., if subspace A and B use subspace A when coordinated.

There are various ways to perform the coordination, including coordinating each subspace with every other subspace. Note that a subspace may be reconciled with multiple subspaces at the same time, e.g., when memory space permits. Further, note that not all subspaces need be loaded at the same time, or even in the same session, e.g., subspace X may be reconciled against subspaces A, B and C, and then later reconciled against, e.g., subspaces D and E.

Another way to reconcile subspaces is to reconcile the new subspace with other subspaces that have already been reconciled with each other. For example, if subspaces are added to the deduplication system one at a time, the added subspaces may be reconciled against other subspaces (e.g., they are likewise reconciled one at a time as each is added, and need not reconcile with each other), and so on.

Another way is by type of subspace (e.g., corresponding to the partition). For example, groups of subspaces may be defined based on those having the same file type or data type. The reconciliation may be limited to subspaces in the same group, for example, because it is likely that subspaces in a group of subspaces for text data are well reconciled with each other, but it is unlikely that these subspaces are well reconciled with the subspace for indexing image data.

FIG. 6 is a flow diagram illustrating example steps for coordination of one subspace with one or more other subspaces. Step 602 represents loading the one or more other subspaces into memory as an index; note that an index may be expensive in terms of memory consumption, while a scan hash is inexpensive because it uses sequential read I/O to scan, and thus it is advantageous to index smaller groups of subspaces and scan larger groups. Step 604 begins scanning the subspace to reconcile, e.g., sequentially selecting the hashes one at a time, and looking up a matching hash in the index (step 606). If no matching hash is found (step 609), the next hash is selected and the process repeats (step 614) to direct that this subspace has been reconciled.

If a matching hash is found at step 608, step 610 marks the subspace hash entry and the chunk as deleted. Step 612 updates the pointers to any files or other entities that reference the blocks to be updated to the newly deduplicated blocks; note that a database of blocks and files referencing the blocks is maintained so that these files can be found and updated. When no file or other entity references the chunk, the subspace hash entry and chunk may be securely deleted, e.g., as described below with respect to compaction at step 616.

As described above, the process repeats via step 614 until the subspace has been reconciled against the loaded index (entries for one or more subspaces). At this point, compaction, as represented by step 616, may be performed, although compaction may occur at a later time, for example, in a garbage collection type operation.

Compaction may be performed by copying the remaining index entries of the reconciled subspace (i.e., those that are not marked as deleted/eliminated) to the new subspace. An alternative is to move the index entries from the end of the subspace into the "holes" present in the subspace, i.e. overwrite those entries marked as deleted. Block storage can be similarly compacted.

Compaction results in smaller subspaces (and one or more smaller chunk stores), which is allowed. The subspace may be reopened to add new entries instead of using a smaller subspace (or chunk store); for efficiency, the point at which a new entry begins may be tracked so that reconciliation may begin from that point, since previous entries have been found to not match during reconciliation. The smaller chunk store may similarly be reopened to add a new chunk.

In another alternative, rather than moving entries or blocks, a bitmap may be used to track where new entries may be added to a subspace, or where new blocks are added to a block store.

It is readily understood that a new subspace may be reconciled against a set of one or more older subspaces (new-to-old coordination), however, old-to-new coordination is also possible, but doing so typically involves more data movement. More specifically, during the reconciliation process, the file associated with the subspace being deduplicated will incur higher access costs because access to the file is more fragmented. The hash index service may select which subspace to deduplicate based on the access pattern of the content to improve access performance, e.g., so that common files are not too fragmented. The hash index service may also adjust (cordinate) coordination to balance between deduplication reservation and fragmentation amount.

For example, consider two subspaces A and B that are similar to each other, and observe that subspace B is accessed (read or updated) more frequently than subspace A, or that subspace B contains an updated version of the contents of subspace A, so that subspace A is more likely to be considered the version history of subspace B and is expected to be accessed less frequently. The hash index service may choose to de-duplicate subspace A against subspace B so that the more frequently accessed subspace B retains better access performance.

In an alternative embodiment, the chain of deduplication may be limited to control the cost of file access. With deduplication optimization, the number of I/O operations required to access a file may be evaluated by calculating the number of branches in the deduplication graph. For example, if subspace a is deduplicated against subspaces B and C, and subspace B is also deduplicated against subspace D, there are three branches in the deduplication graph, and the number of I/O accesses to the files associated with subspace a is 4(═ 3+ 1). To limit the number of I/Os accessing a file to no more than K, the coordination tree cannot have more than K-1 branches.

If the deduplication system is storing periodic backup data, e.g., B for day 0₀B for day 1₁For day 2B₂Etc., then to ensure that the recovery of the most recent backup is performed reasonably efficiently (e.g., to bring no more than K accesses), then the deduplication optimization and coordination operations may be performed as follows: first, store the full backup on day 0 to B₀. When backup B1 on day 1 was performed, it was compared to B₀De-duplication is performed. When backup B2 on day 2 was performed, it was compared to B₀And B₁Perform deduplication, and so on. On day K, perform B_KAnd then apply the coordination operation to target B_KBackup pair B of₀The backup of (2) is deduplicated. For B_KBackup B for day K +1_K+1Perform de-duplication to B_KAnd B_K+1The K +2 day backup was deduplicated. On day 2K, perform B_2KAnd then apply the coordination operation to target B_2KBackup pair B of_KThe backup of (2) is deduplicated.

In many environments, there are too many subspaces/insufficient resources to reconcile all subspaces with each other. In this case, a selection mechanism of the hash index service is used to select which subspaces to reconcile. As described herein, in a hash index service, the hashes of a file storage service are divided into subspaces to allow the hash index service to scale, and one or more signatures may be computed for each subspace so that the subspace may be compactly represented. The signatures of the subspaces are loaded into the hash index service (e.g., to determine which subspaces are to be loaded/unloaded by the cache management unit of the hash index service and at what sampling rate), and which two subspaces are to be reconciled in further data deduplication.

To this end, generally speaking, during a reconciliation operation, the hash index service examines the signatures (or other representative data) of the index subspaces in the system and computes similarities between any two subspaces based on the signatures to determine potential deduplication savings between the two subspaces. Because the subspace signature is very compactly represented, the operation of finding a subspace to be deduplicated can be performed very quickly.

The signature of the subspace may take various forms. Let subspace P_jComprising a set of hashes h_j，1，h_j，2，……，h_j，l. Subspace P_jThe plurality of sample signatures of (a) is:

1. a minimum hash and its variants (e.g., having a hash value close to a particular constant); for example, maximal hashing is one such variation.

Sig₁(Pj)＝min{hj，1，hj，2，……，hj，l}

K-minimum hash (k-min hash) or a variant thereof.

Sig₂(Pj) is the set of k-minimum hash values between { hj, 1, hj, 2, … …, hj, l }.

Other algorithms that can generate a set of k deterministic hash values from a complete set of hash values can be used, e.g.,

the signature may be the k hash values closest to the constant H.

3. Bloom filter

The signature is a bloom filter formed from the complete set of hash values hj, 1, hj, 2, … …, hj, l.

K-minimum hash + bloom filter

The signature is a bloom filter formed from the k-minimum hash value obtained in 2.

B-bit Minwise hash (as described in p.li and a.c. konig, "b-bit Minwise hash", WWW 2010).

The signature is a b-bit minwise hash formed of the complete set of hash values hj, 1, hj, 2, … …, hj, l.

K-minimum hash + b-bit minwise hash.

This signature is formed by a b-bit minwise hash over the k minimum hash values obtained in 2.

For each signature computation method, a similarity measure R (Pi, Pj) is defined, representing how similar the two subspaces are (i.e., how similar they are)Deduplicated). In one implementation, the similarity metric R (Pi, Pj) takes values from 0 to 1; the greater the similarity measure, the more likely it is that P is in the subspace_iAnd P_jWith significant probability of deduplication. The similarity metric for each signature may be:

1. minimum hash:

r (Pi, Pj) ═ 1, if the minimum hashes of the two subspaces are equal,

r (Pi, Pj) ═ 0, otherwise.

K-minimum Hash:

r (Pi, Pj) ═ q/k, if there are q equal (common) minimum hashes between subspaces Pi, Pj.

3. Bloom filters and/or k-min hash + bloom filters:

(Note that one known property of a bloom filter is that the bitwise AND representation between two bloom filters (which are arrays of bits) contains two sets P_iAnd P_jThe "intersection (intersection)" of (b) the bloom filter. Furthermore, a bitwise OR (bitwise OR) between two bloom filters indicates two sets P_iAnd P_jThe union of (1) and (b). In addition, the "length" operator (which may be implemented as simply counting the number of bits in the bloom filter) represents a way to estimate the number of hash values inserted into the bloom filter. Summarizing the above formula: this metric approximates the amount of commonality between the two sets. )

A 4 b-bit Minwise hash and/or a k-minimum hash + b-bit Minwise hash (based on "b-bit Minwise hash" of p.li and a.c. konig, comparison of algorithm 1 in WWW 2010).

Note that the various subspaces can be compared to one another, forming a distance map (distance map) or other similar clustered arrangement. The selection may be based on a cluster of such subspaces.

Once the subspace to be reconciled is identified, deduplication may be performed between the selected subspaces as described above (e.g., deduplication of subspace A against subspace B). Coordination may be performed by loading the hash index of subspace B into memory (e.g., into a hash table).

In an alternative, the deduplicated subspace A' of A may be generated during reconciliation if memory permits, thereby performing compaction substantially dynamically. To this end, for each hash index entry in subspace A, the process checks whether the same hash entry already exists in subspace B. If the hash does not exist, the hash index entry is added to subspace A' in this alternative. If the hash already exists, the hash index entry in subspace A may be deduplicated and effectively "removed" from subspace A but not added to subspace A'. After the entire subspace A has been deduplicated against subspace B, the original subspace A may be removed and replaced with the compacted/deduplicated subspace A'.

Exemplary networked and distributed environments

Those skilled in the art will appreciate that the embodiments and methods described herein may be implemented in connection with any computer or other client or server device, which may be deployed as part of a computer network or in a distributed computing environment, and which may be connected to any type of data store or stores. In this regard, the embodiments described herein may be implemented in any computer system or environment having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units. This includes, but is not limited to, an environment with server computers and client computers deployed in a network environment or distributed computing environment, having remote or local storage.

Distributed computing provides sharing of computer resources and services through the exchange of communications between computing devices and systems. These resources and services include the exchange of information, cache storage and disk storage for objects such as files. These resources and services also include processing power sharing among multiple processing units for load balancing, resource expansion, processing specialization, and so forth. Distributed computing makes use of network connections, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, various devices may have applications, objects, or resources that may participate in a resource management mechanism as described with reference to various embodiments of the invention.

FIG. 7 provides a schematic diagram of an exemplary networked or distributed computing environment. The distributed computing environment comprises computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. that may include programs, methods, data stores, programmable logic, etc. as represented by example applications 730, 732, 734, 736, 738. It is to be appreciated that computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. can comprise different devices, such as Personal Digital Assistants (PDAs), audio/video devices, mobile phones, MP3 players, personal computers, laptop computers, etc.

Each computing object 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. can communicate with one or more other computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. by way of the communications network 740, either directly or indirectly. Although illustrated as a single element in fig. 7, communications network 740 may comprise other computing objects and computing devices that provide services to the system of fig. 7, and/or may represent multiple interconnected networks, which are not shown. Each computing object 710, 712, etc. or computing object or device 720, 722, 724, 726, 728, etc. can also contain an application, such as applications 730, 732, 734, 736, 738, that can utilize an API, or other object, software, firmware, and/or hardware, suitable for communication with the application implementations provided in accordance with the various embodiments of the invention.

There are a variety of systems, components, and network configurations that support distributed computing environments. For example, computing systems may be connected together by wired or wireless systems, by local networks, or by widely distributed networks. Currently, many networks are coupled to the Internet, which provides an infrastructure for widely distributed computing and encompasses many different networks, but any network infrastructure can be used to facilitate exemplary communications with the system as described in various embodiments.

Thus, a host of network topologies and network infrastructures, such as client/server, peer-to-peer, or hybrid architectures, can be utilized. A "client" is a member of a class or group that uses the services of another class or group to which it is not related. A client may be a process, for example, roughly a set of instructions or tasks, that requests a service provided by another program or process. The client process uses the requested service without having to "know" any working details about the other program or the service itself.

In a client/server architecture, particularly a networked system, a client is typically a computer that accesses shared network resources provided by another computer (e.g., a server). In the illustration of FIG. 7, as a non-limiting example, computing objects or devices 720, 722, 724, 726, 728, etc. can be thought of as clients and computing objects 710, 712, etc. can be thought of as servers where computing objects 710, 712, etc. act as servers providing data services, such as receiving data from client computing objects or devices 720, 722, 724, 726, 728, etc., storing data, processing data, transmitting data to client computing objects or devices 720, 722, 724, 726, 728, etc., although any computer can be considered a client, a server, or both, depending on the circumstances.

A server is typically a remote computer system accessible over a remote or local network, such as the Internet or wireless network infrastructure. A client process may be active in a first computer system and a server process may be active in a second computer system, communicating with each other over a communications medium, thereby providing distributed functionality and allowing multiple clients to take advantage of the information-gathering capabilities of the server.

In a network environment in which the communications network 740 or bus is the Internet, for example, the computing objects 710, 712, etc. can be Web servers with which other computing objects or devices 720, 722, 724, 726, 728, etc. communicate via any of a number of known protocols, such as the Hypertext transfer protocol (HTTP). Computing objects 710, 712, etc. acting as servers may also serve as clients, e.g., computing objects or devices 720, 722, 724, 726, 728, etc., as may be characteristic of a distributed computing environment.

Exemplary computing device

As noted above, the techniques described herein may be advantageously applied to any device. It should be understood, therefore, that handheld, portable and other computing devices and computing objects of all kinds are contemplated for use in connection with the various embodiments. Accordingly, the below general purpose remote computer described below in FIG. 8 is but one example of a computing device.

Embodiments may be implemented in part via an operating system, for use by a developer of services for a device or object, and/or included within application software that operates to perform one or more functional aspects of the embodiments described herein. Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Those skilled in the art will appreciate that computer systems have a variety of configurations and protocols that can be used to communicate data, and thus no particular configuration or protocol should be considered limiting.

FIG. 8 thus illustrates an example of a suitable computing system environment 800 in which one or aspects of the embodiments described herein can be implemented, although as made clear above, the computing system environment 800 is only one example of a suitable computing environment and is not intended to suggest any limitation as to scope of use or functionality. Moreover, the computing system environment 800 is not intended to be interpreted as having any dependency relating to any one or combination of components illustrated in the exemplary computing system environment 800.

With reference to FIG. 8, an exemplary remote device for implementing one or more embodiments includes a general purpose computing device in the form of a computer 810. Components of computer 810 may include, but are not limited to, a processing unit 820, a system memory 830, and a system bus 822 that couples various system components including the system memory to the processing unit 820.

Computer 810 typically includes a variety of computer readable media and can be any available media that can be accessed by computer 810. The system memory 830 may include computer storage media in the form of volatile and/or nonvolatile memory such as Read Only Memory (ROM) and/or Random Access Memory (RAM). By way of example, and not limitation, system memory 830 may also include an operating system, application programs, other program modules, and program data.

A user may enter commands and information into the computer 810 through input device(s) 840. A monitor or other type of display device is also connected to the system bus 822 via an interface, such as output interface 850. In addition to a monitor, computers can also include other peripheral output devices such as speakers and a printer, which may be connected through output interface 850.

The computer 810 may operate in a networked or distributed environment using logical connections to one or more other remote computers, such as a remote computer 870. The remote computer 870 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, or any other remote media use or transmission device, and may include any or all of the elements described above relative to the computer 810. The logical connections depicted in FIG. 8 include a network 872, such Local Area Network (LAN) or a Wide Area Network (WAN), but may also include other networks/buses. Such networking environments are commonplace in homes, offices, enterprise-wide computer networks, intranets and the Internet.

As mentioned above, while exemplary embodiments have been described in connection with various computing devices and network architectures, the underlying concepts may be applied to any network system and any computing device or system in which it is desirable to improve the efficiency of resource usage.

Moreover, there are numerous ways of implementing the same or similar functionality, e.g., an appropriate API, tool kit, driver code, operating system, control, standalone or downloadable software object, etc., which enables applications and services to use the techniques provided herein. Thus, embodiments herein are contemplated from the standpoint of an API (or other software object), as well as from a software or hardware object that implements one or more embodiments as described herein. Thus, various embodiments described herein may have aspects that are wholly in hardware, partly in hardware and partly in software, as well as in software.

The word "exemplary" is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited to these examples. Moreover, any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Furthermore, to the extent that the terms "includes," "has," "includes," and other similar words are used, for the avoidance of doubt, such terms are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements when employed in a claim.

As noted, the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. As used herein, the terms "component," "module," "system," and the like are likewise intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

The system as described above has been described with reference to interaction between several components. It will be understood that these systems and components may include components or specified sub-components, some specified components or sub-components, and/or additional components, and according to various permutations and combinations of the above. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.

In view of the exemplary systems described herein, methodologies that may be implemented in accordance with the described subject matter can also be understood from the flow charts with reference to the figures. While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of blocks, it is to be understood and appreciated that the embodiments are not limited by the order of the blocks, as some blocks may occur in different orders and/or concurrently with other blocks from what is depicted and described herein. Although non-sequential or branched flow is illustrated via flowchart, it can be appreciated that various other branches, flow paths, and orders of the blocks, may be implemented which achieve the same or similar result. Moreover, some illustrated blocks may be optional in implementing the methodologies described hereinafter.

Conclusion

While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

In addition to the embodiments described herein, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiments for performing the same or equivalent function of the corresponding embodiments without deviating therefrom. Further, multiple processing chips or multiple devices may share the performance of one or more functions described herein, and similarly, storage may be effected across multiple devices. Accordingly, the present invention should not be limited to any single embodiment, but rather should be construed in breadth, spirit and scope in accordance with the appended claims.

Claims

1. A method for data deduplication, comprising:

subspace index (P) to be associated with a subspace₁-P_m) Loading from a secondary media into a main memory cache, the subspace index including less than all index entries of a hash index service (116);

reconciling the subspace with another subspace based on signature similarity to remove at least one duplicate index entry between the subspace and the other subspace, including using a similarity metric to determine signature similarity between the subspace and the other subspace; and

de-duplicating data using the subspace index, including chunking (504) the data into one or more chunks, and for each chunk, determining (510) whether a hash value computed for the chunk matches a hash value of an index entry in the main memory cache, and if not, storing (518) the chunk and adding (518) the index entry corresponding to the hash value to the subspace index in association with a reference to the chunk, and if so, returning (512) information that can be used to locate an existing chunk.

2. The method of claim 1, further comprising: loading at least a portion of another subspace index into the main memory cache, or partitioning the hash index service into a plurality of subspace indexes based on one or more criteria, or partitioning data to be deduplicated into a plurality of data sets based on one or more criteria of the data, and partitioning the hash index service into a plurality of subspace indexes corresponding to the plurality of data sets, or any combination of: loading at least a portion of another subspace index into the main memory cache, or partitioning the hash index service into a plurality of subspace indexes based on one or more criteria, or partitioning data to be deduplicated into a plurality of data sets based on one or more criteria of the data, and partitioning the hash index service into a plurality of subspace indexes corresponding to the plurality of data sets.

3. The method of claim 1, wherein returning information that can be used to locate an existing chunk comprises returning a chunk identifier, or returning the hash value and an identifier of the subspace, wherein the chunk identifier comprises a reference to a chunk associated with the index entry.

4. The method of claim 1, further comprising: selecting the other subspace to reconcile based on a similarity between at least one signature representative of the subspace and at least one signature representative of the other subspace.

5. A system for data deduplication, comprising:

for sub-space index (P) associated with sub-space₁-P_m) Means for loading into a main memory cache from a secondary media, the subspace index including fewer than all index entries of a hash index service (116);

means for reconciling the subspace with another subspace to remove at least one duplicate index entry between the subspace and the other subspace based on signature similarity, comprising determining signature similarity between the subspace and the other subspace using a similarity metric; and

means for de-duplicating data using the subspace index, the de-duplicating data comprising chunking (504) the data into one or more chunks, and for each chunk, determining (510) whether a hash value computed for the chunk matches a hash value of an index entry in the main memory cache, and if not, storing (518) the chunk and adding (518) the index entry corresponding to the hash value to the subspace index in association with a reference to the chunk, and if so, returning (512) information that can be used to locate an existing chunk.

6. The system of claim 5, further comprising: means for loading at least a portion of another subspace index into the main memory cache, or partitioning the hash index service into a plurality of subspace indexes based on one or more criteria, or partitioning data to be deduplicated into a plurality of data sets based on one or more criteria of the data, and means for partitioning the hash index service into a plurality of subspace indexes corresponding to the plurality of data sets, or means for any combination of: loading at least a portion of another subspace index into the main memory cache, or partitioning the hash index service into a plurality of subspace indexes based on one or more criteria, or partitioning data to be deduplicated into a plurality of data sets based on one or more criteria of the data, and partitioning the hash index service into a plurality of subspace indexes corresponding to the plurality of data sets.

7. The system of claim 5, wherein means for returning information that can be used to locate an existing chunk comprises means for returning a chunk identifier, or returning the hash value and an identifier of the subspace, wherein the chunk identifier comprises a reference to a chunk associated with the index entry.

8. The system of claim 5, further comprising: means for selecting the other subspace to reconcile based on a similarity between at least one signature representative of the subspace and at least one signature representative of the other subspace.

9. A method of reconciling a subspace and one or more other subspaces, wherein a subspace index associated with a subspace includes less than all index entries of a hash index service, comprising:

(a) selecting (604) a hash value in the subspace;

(b) determining (608) whether the hash value matches a hash value in the one or more other subspaces, and if not, proceeding to step (d), wherein the other subspaces are selected based on a similarity measure for determining a similarity between the subspace and the other subspaces;

(c) eliminating (610) duplicate index entries in the subspace, wherein the duplicate index entries are index entries corresponding to hash values in step (b) that match hash values in the one or more other subspaces; and

(d) returning (614) to step (a) to process a plurality of different hash values.

10. The method of claim 9, further comprising: selecting the subspace and the one or more other subspaces based on: a similar subspace type, a similar subspace signature, or the subspace is an uncoordinated subspace and the one or more other subspaces comprise a plurality of subspaces that have been previously coordinated with one another, or any combination of: similar subspace types, similar subspace signatures, or the subspaces are uncoordinated subspaces and the one or more subspaces comprise a plurality of subspaces that have been previously coordinated with one another.

11. The method of claim 9, wherein eliminating duplicate index entries in the subspace comprises:

not copying the index entry to a new subspace, or

The subspace is compacted by marking index entries containing the hash value as eliminated, and eliminating the index entries by copying only unmarked index entries to a new subspace after reconciliation, or overwriting index entries marked as eliminated with unmarked index entries.