WO2010050943A1

WO2010050943A1 - Creating snapshots of a file system

Info

Publication number: WO2010050943A1
Application number: PCT/US2008/081661
Authority: WO
Inventors: Michael Callahan; Terence Rokop; Samuel Revitch
Original assignee: Hewlett Packard Development Co LP
Current assignee: Hewlett Packard Development Co LP
Priority date: 2008-10-30
Filing date: 2008-10-30
Publication date: 2010-05-06
Anticipated expiration: 2011-04-30

Abstract

One embodiment is a method that captures multiple snapshots of a file system represented as a tree (300), stores states of the snapshots in a single history tree 9340), updates the history tree to include changes between a current snapshot and a previous snapshot (360), and uses one of the snapshots to backup or recover data in the file system (370).

Description

CREATING SNAPSHOTS OF A FILE SYSTEM

BACKGROUND

File systems organize and track where data is stored in memory and where free or available space exists in memory. Distributed or clustered file systems can store thousands or even millions of files. These files and the corresponding metadata are distributed across a large number of storage devices, such as disk drives.

In order to safeguard data, some fϋe systems track old versions of files with a snapshot, which is a copy of a set of files and directories as they existed at a particular point in time. Snapshots are used for backup and recovery of data. For example, snapshots can be used to recover lost data or identify a state of the file system at a previous time.

Snapshots pose a variety of challenges. For example, a large amount of space can be needed to store multiple snapshots of a file system. Further, maintaining muitipie snapshots of the file system can affect system performance or jeopardize consistency of data.

Methods and systems to improve creating snapshots of file systems are needed.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows a duster file system in accordance with an exemplary embodiment of the present invention.

Figure 2A shows a hierarchical directory for a file system in accordance with an exemplary embodiment of the present invention.

Figure 2B shows a history tree and a plurality of branch trees in accordance with an exemplary embodiment of the present invention.

Figure 3 shows a flow diagram for creating snapshots in a file system in accordance with an exemplary embodiment of the present invention.

Figure 4 shows an exemplary computer system for implementing one or more of the methods in accordance with an exemplary embodiment of the present invention.

DETAILED DESCRIPTION

Exemplary embodiments relate to systems and methods for creating snapshots of a file system.

Methods and systems implement a disk file system that includes built-in snapshot support mechanism. The file system is based on a tree structure with the ability to record snapshots of the complete state of the tree at a given point in time. SVfuittpie snapshots can be taken of the tree, which permit the state of the tree at multiple points in time to be preserved.

With exemplary embodiments, snapshots incur only incremental costs in space and performance and provide an efficient representation of the current state of the tree. Furthermore, the snapshot can be the basis for new branches of modifications.

In one exemplary embodiment, snapshots are based on a novel indexing structure referred to herein as a Sparse Logical History Tree. This structure uses a set of B-Pi-trees and a current version counter to store the state of a single attribute index with one or more branching versions and the states at all branch points and intervening snapshot points. Here, a branch tip is a version of the index that can be modified and snapshotted (i.e., copied).

Each branch tip derives from a branch point and can share contents with other branches derived from the same branch point. Data specific to each branch is stored in that branch's "branch tree" which is a private B~PS-tree index structure.

Aside from the branch trees, the indexing structure features a single "history tree," which stores the states of read-only snapshot points. The history tree includes entry data and key space delegation data required to recover a full state of the fiiβ system. The indexing structure is space efficient since, between al! trees, any entry that remains the same between muitipie snapshots and/or branches is represented exactly once.

in one exemplary embodiment, the indexing structure can be used with a B-tree like index structure for its branch trees or history tree. Exemplary embodiments can also be used with B-Pi-Trees.

With exempiary embodiments, methods and systems build a tree structure that can be snapshotted and used in clustered or disk file systems. Furthermore. exemplary embodiments enable methods and systems to perform one or more of the following:

1. Represent multiple snapshots of a tree in a space-efficient way.

2. Provide snapshots consuming storage space that is proportional to the amount of data that has been later overwritten, not proportional to the totai amount of data in the tree.

3. Store parts of the data structure representing snapshots in the same physical blocks on disk as the parts representing a current state of the file system so it is possible for a single on-disk block to represent a full tree and multiple snapshots of that tree. In other words, the representation places multiple snapshots of a tree, aiong with the current state of the tree, in a single block,

4. Represent multiple snapshots of a tree in a way which is consistent with high performance for accessing and updating values stored in the current version of the tree.

5. Allow an update of a vaiue in the tree to overwrite the old value in- piace to preserve favorable spatial properties of the layout of data corresponding to the current state of the tree {i.e., avoiding fragmentation), all while still preserving the old value for reference by snapshots. 6. Represent multiple snapshots of a tree in a way that is consistent with high performance for accessing values stored in snapshots of the tree.

7. Represent multiple snapshots of a tree in a way that allows the physical locations of blocks comprising the tree to be relocated for better efficiency with little effort {i.e., supporting easy def rag mentation).

8. Represent multiple snapshots of a tree without creating multiple physical pointers to individual blocks making up the tree.

9. Represent snapshots that can themselves be used as the basis for new branches (i.e., logical versions of the tree that themselves can be updated and snapshotted).

10. Support snapshots of just parts of the file system (as opposed to snapshots that necessarily affect the whole file system).

11.Support snapshots that can be taken by users of their own files, as opposed to only supporting snapshots that administrators make of the file system as a whole.

12. Support writable branches, including writable branches of just parts of a file system, including snapshots of individual files.

13. Support fast reversion of applications" view of file system state to a prior snapshot without losing the ability to go back to the later state (i.e., to undo the reversion).

14. Implement a file system that supports incremental, on-line checking of file system integrity.

15. Provide a file system that performs all of the above in a way consistent with high-performance in a clustered environment where multiple servers are updating the trees on-disk simultaneously.

In one exemplary embodiment, the file system uses a hierarchical tree to track and store data. Directories are represented as tree structures on storage devices (such as disk drives). Entries in the tree structure refer to separate blocks of memory that include the corresponding data for the file in the directory. One exemplary embodiment is a clustered file system in which many computers simultaneously access disk data. Figure 1 shows a distributed or duster fiie storage system 100 in accordance with an exemplary embodiment of the present invention. By way of example, the system is a duster storage network and/or a storage area network (SAN) that includes a plurality of client computers, nodes, or host computers 102A to 102N and one or more storage devices or arrays 103A to 103N that inciude one or more storage controiSers 104 (shown by way of example as an array controller), a plurality of storage devices 106 (shown by way of exampie as disk array 1 to disk array N), and a fiie system manager (FSM) 108 in communication with the storage controllers and devices. The filesystem manager 108 (such as a server or storage device) stores and organizes computer files so the files and corresponding data can be managed and discovered for the hosts 102A to 102N.

The host computers are grouped to form one or more clusters (shown as cluster 114A to 114N). For exampie, hosts 102A are grouped to form a one cluster 114A which includes a plurality of host computers (shown as host 1 to host N). Hosts 102N are grouped to form another cluster 114N.

The dusters 114A to 114N and file system manager 108 are coupled to the array controllers 104 through one or more fabrics or networks 110, and the storage devices or arrays 103 are coupled to the storage devices 106 through one or more fabrics or networks 111. For instance, the hosts communicate with an array controller using a Small Computer System Interface (SCSI) or other interface/commands over a fiber channel (FC). By way of example, networks 110 and 111 include one or more of the Ethernet, fibre channel (FC), serial attached SCSI (SAS), iSCSi, internet, local area network (LAN), wide area network (WAN), public and/or private networks, etc. Communications links 112 are shown in the figure to represent communication paths or couplings between the hosts, controllers, and storage devices. In one exemplary embodiment, the storage devices (such as array controller 104 and disk arrays 106) are network attached devices providing random access memory (RAM) and/or disk space (for storage and as virtual RAM) and/or some other form of storage or storage device, such as magnetic memory (example, tapes), micromechanical systems (MEMS), or optical disks, to name a few examples. Typically, storage devices include larger amounts of RAM and/or disk space and one or more specialized devices, such as network disk drives or disk drive arrays, (example, redundant array of independent disks (RASD)), high speed tape, magnetic random access memory (IVI RAM) systems or other devices, and combinations thereof, in one exemplary embodiment, the storage devices include one or more servers

The storage controller 104 manages various data storage and retneval operations. Storage controller 104 receives I/O requests or commands from the host computers 102A to 102N. such as data read requests, data write requests, maintenance requests, etc. Storage controller 104 handles the storage and retrieval of data on the muitipie disk arrays 106 and disk groups, in one exemplary embodiment, storage controller 104 is a separate device or may be part of a computer system, such as a server. Additionally, the storage controller 104 may be located with, proximate, or a great geographical distance from the disk arrays 106 or from each other.

The array controller 104 includes numerous electronic devices, circuit boards, electronic components, etc. By way of example, the array controller 104 includes firmware 120, an input/output (I/O) scheduler 122, a queue 124, one or more interfaces 126, one or more processors 128 (shown by way of example as a CPU, central processing unit), and memory 130 (including read and write cache). CPU 128 performs operations and tasks necessary to manage the various data storage and data retrieval requests received from host computers 102A to 102N. For instance, processor 128 is coupled to a host interface 126A that provides bidirectional data communications to one or more host computers 102A to 102N. Processor 128 is also coupled to an array interface 126B that provides bidirectional data communications to the disk arrays 106.

SVfemory 130 is aiso coupled to processor 128 and stores various information used by processor when carrying out its tasks. By way of example, memory 130 includes one or more of volatile memory, non-volatile memory, or a combination of volatile and non-voiatiie memory. The memory 130, for example, stores applications, data, control programs, algorithms (including software to implement or assist in implementing embodiments in accordance with the present invention), and other data associated with the storage device (example, state data such as mapping metadata, configuration metadata, and cached user data). The processor 128 communicates with memory 130, interfaces 126, and the other components via one or more buses 132.

Figure 2A shows a partial directory or hierarchical tree structure 200 for a file system in accordance with an exemplary embodiment of the present invention. The directory includes a root node 210 with branches leading to a plurality of directories 220A, 220B, to 220N {shown as directory A to directory N). Directory 220A includes plural subdirectories 230A to 230N (shown as subdirectory 1 to subdirectory N). Each subdirectory further includes one or more files. For example, directory 220B has file 240A, and subdirectory 230A has files 240B to 240N (shown as file 2 to file N).

In one exemplary embodiment, each file contains a reference to a separate data structure that stores data or metadata about the file. For example, each file contains a storage location (for example, a block location for data or an inode number that refers to a block storing the i nodes).

Figure 2B shows a history tree (HT) 260 and a plurality of branch trees (BT) 250A to 250 N in accordance with an exemplary embodiment of the present invention. The history tree stores states of read-only snapshot points, and tha branch trees store data specific for a branch or contents shared by several branches.

According to block 300, multiple snapshots are taken of a distributed or clustered ftie system over a period of time.

According to block 310, the snapshots are implemented in an indexing structure that uses B-trees and a current version counter to store a state of a single attribute index with one or more branching versions, states at al! branch points, and intervening snapshot points.

According to block 320, a branch tip is defined to be a version of the indexing structure that can be modified and snapshotted. Branch tips originate from branch points and can share contents with other branches originating from the same branch point.

According to block 330, data specific to each branch is stored in a branch tree.

According to block 340, the states of read-only snapshots are stored in a history tree. The history tree also stores entry data and key space delegation data used to recover a full state of the file system.

According to block 350, a current or most recent snapshot is compared with a previous or iast taken snapshot.

According to block 360, the history tree is updated to reflect changes between the current snapshot and the previous snapshot. Entries that remain the same between the current snapshot and the previous snapshot are represented only once (i.e., not stored multiple times).

According to block 370, one or more of the snapshots are used to backup and/or recover data in the file system.

Figure 4 shows an exemplary computer system 410 for implementing one or more of the methods in accordance with an exemplary embodiment of the present invention.

The computer system 410 includes a computer 420 coupled to storage devices 430. The computer 420 comprises a processing unit 450 (such as one or more processors of centra! processing units, CPUs) for controlling the overall operation of memory 460 (such as random access memory (RAM) for temporary data storage and read only memory (ROM) for permanent data storage) and one or more algorithms or programs 480 (such as a file system manager and/or a fiie snapshot aigorithm for taking snapshots of the file system as discussed in Fig. 3). The memory 460 stores data, control programs, and other data associate with the computer 420. Although shown separately, the memory 480 can store the programs 480. The processing unit 450 communicates with memory 460, storage devices 430, programs 480, and many other components via buses 490.

Embodiments in accordance with the present invention are not limited to any particular type or number of storage devices and/or computer. The computer system, for example, includes various portable and non-portable computers and/or electronic devices. Exemplary computer include, but are not limited to, servers, main frame computers, distributed computing devices, laptops, and other eiectronic devices and systems whether such devices and systems are portable or non-portable.

H) In order to facilitate an explanation of the data structures of the file system in accordance with exemplary embodiments, the description is divided into various headings.

Dei in itioπs for Search and U pdate Operations

For search operations, the following words are defined as follows;

Lookup-Exact: Find an entry with a key equa! to a search key.

Lαokup-LE: Find an entry with the key equal to, or the numerically closest key not greater than, a search key.

Lookup-GE' Find an entry with the key equai to, or the numerically closest key not less than, a search key,

Next-Entry: Retrieve the entry immediately foilowing a previously found entry.

Prev-Entry: Retrieve the entry immediately preceding a previously found entry.

For update operations, the following words are defined as follows:

Insert: insert an entry with a given key and given data.

Delete: Remove an entry.

Functionality

The sparse history tree in accordance with an exemplary embodiment supports a variety operations, including one or more of the following;

1 , Ail lookup and update operations on the tip of a specific branch, with performance equivalent to a non-versioned B-tree. 2. All lookup operations on a specific branch Bi any prior point in time, with reasonable and consistent performance.

3. Efficient relocation and defragrnentation of associated metadata and indirect blocks. This generally implies that each block has a small number (such as one) of direct references from all other blocks associated with the sparse history free.

4. Efficient copy-on-write of snapshot-dependent indirect data. When replacing the contents of a block that is referred to by an entry in a sparse history free, and the contents are preserved for snapshot purposes. The previous contents of the block can be saved at a new location rather than writing the new contents to a new location if policies or measurements suggest that this wouid lead to the best overall performance. The algorithm gives the option of relocating the old contents instead of the new contents. Also, relocating the new data results in snapshot-induced fragmentation is considered a problem with Write Anywhere File Layout

(VVAFL) like schemes. This requirement implies that it is possible to replace and reorganize history data associated with a snapshot to avoid fragmenting a file as it appears in a branch tip.

5. The ability to reorganize data that is shared between multiple branch tips to optimize for access from a specific branch tip.

6. Condensed, space-efficient representation of snapshot data. Metadata associated with multiple snapshots is stored in the same filesystem block.

7. Efficient creation of filesystem-wide snapshots and branches. These operations do not necessarily modify any sparse history tree. 8. Support for millions of snapshots per fiiesysfem.

9. Explicit removal of data that is not relevant to any live snapshot or branch.

10. Support for non-versioned items present in a branch tip. One embodiment consistently determines which items are non-versioned, and such items will not have prior versions preserved as a matter of policy.

Revision Structure The revision structure is a data structure externa! to the sparse history tree, but which any number of sparse history trees associate with and use to make organizational decisions. In one embodiment, each filesystem has one revision structure.

In one exemplary embodiment, the revision structure is a directed graph. The graph contains two types of nodes: branch tips and snapshots, each of which is uniquely identified by a snapshot ID (SSiD). Each node can have an outbound edge, pointing to the node it is derived from, or no outbound edge if all of its ancestors were deleted or it is the first snapshot/branch in the revision structure. Snapshots can have any number of inbound edges. Branch tips have no inbound edges. Each node a! so has associated with it a branch /D (BRiD) which is common among nodes in a linear progression of snapshots and simplifies the process of seeking between nodes based on derivation relationships,

The initial state of a revision structure is a single branch tip node. It supports four update operations:

1. Create-snapshot (SSiD) - creates a snapshot from a branch tip. The snapshot receives the branch tip's SSiD, and the branch tip is allocated a new SSiD. The new snapshot is marked as deriving from the same snapshot that the branch tip derived from, or none if the branch tip did not and the branch tip is marked as deriving from the new snapshot.

2. Create-branch (SSiD) - creates a new branch tip that is marked as deriving from the target snapshot. A new SSID is allocated for the branch tip.

3. Delete-snapshot (SSiD) - marks a snapshot as no longer being pertinent. The target snapshot may have only one snapshot or branch tip derived from it.

4. Delete-branch (SSID) - removes a branch tip. The revision structure exists independently of its associated sparse history trees. The revision structure may be updated without any specific knowledge of the set of associated sparse history trees. Sparse history frees associated with a revision structure are responsible for "following" changes to the revision structure and maintaining their own consistency with it. Each node in the revision structure has a branch ID value associated with if, which is assigned when the node is created. For snapshot creation, the branch SD is set to the branch !D of the branch tip that it was created from. For branch creation, the branch !D of the new branch tip is set to the snapshot ID that was allocated for it.

The sparse history tree specifically maintains sets of entries. Each node in the revision structure represents one state of that entry set. The create-snapshot operation, in concept, preserves the entry set state of a branch tip in a read-only fashion. The create-branch operation, in concept, takes the entry set state preserved in a snapshot and propagates it into a new branch tip where it can be modified.

Sparse history trees are consistent with their associated revision structure, but a great amount of flexibility is used to support the performance bounds. An arbitrary number of sparse history trees may associate with a single revision structure, and updates to the revision structure are possible without updating or even acknowledging the existence of all of the associated sparse history trees. The sparse history trees have the property that if the state of a sparse history tree is consistent with its associated revision structure, it remains consistent after an arbitrary set of updates are applied to the revision structure, without any corresponding updates to the sparse history tree. Practically, branch and snapshot deletions use background processing to remove contents that are no longer pertinent, but all states of partial removal are consistent.

Schema

The following terms and symbols are relevant to the schema: SSSD: Snapshot ID, a unique identifier given to each snapshot and branch tip node in the revision structure

BRID; Branch SD, an identifier given to all snapshots deriving from the same branch point.

K: item key, a byte string of fixed or iimited iength depending on the index and the item type.

D: Item data, a byte string of limited length.

R: End key of an entry that includes a key range, of the same form as K, but stored with entry data,

B: A pointer to a block containing a subordinate disk parcel, for the data portion of internal node entries.

L: A numeric tree level, where 0 is the leaf, 1 is the roof of a two-level tree, etc., used as a portion of the key for history tree entries

P; A snapshot SD, used as part of the data portion of an entry, typicaily used as a backpointing reference.

Q: A branch !D, used as part of the data portion of an entry. These are used to refer to branch tips in some entry types, or to supplement a snapshot ID in backpointer entry types. In backpointer entries the associated branch SD can be determined from the snapshot ID, but including it in the data portion can save having to Sook it up in the revision structure,

V: A snapshot SD, used as part of the key of an entry. 1: A branch ID, used as part of the key of an entry.

#: The canonical prehistoric snapshot and branch SD value, used as key, data, and search target values.

The sparse tree makes a distinction between entries and items. Fiiesystem items are expressed as mapping keys K to data D. Sparse history tree entries, on the other hand, are the subject of this schema.

The structure as a whole will exist as a single tree, with instances of two other types of trees rooted at entries in its leaves. Collectively, there are three types of indices: top-levei index, branch tip tree, and history tree. These indices are discussed below.

First, the top-level index includes an entry for each branch tip and an entry for the history component. This is structured as a Lomet B-Pi-Tree mapping:

1. I -> [branch tip tree root]

The root component of each branch tip tree belonging to the sparse history tree is stored as an entry here, keyed by the branch ID of each. The branch tip tree is described in detail below.

2. # -> [history tree root]

All top-ieve! index trees have a single entry with the prehistoric branch SD as the key, and the root component of the history tree as the data. The history tree is described in detail below.

The top-level index combines the muitipie indices required to support verstoning into a single rooted structure, possibly a single disk block. One exemplary embodiment stores additional metadata in the top-levei index, e.g. statistics for an tnode as a whole, covering all branches and snapshots it is part of. If so, the key space may require minor changes. Second, a branch tip tree for each modifiable branch tip that has any applicable state in the sparse history tree. The branch tip tree is structured as an IML Lomet B-Pi-trβe mapping K -> {[P, Q, D] | [P, Q, B] j [P, Q, R] } with node ievel version annotations (discussed beiow). The tree's physical structure is marked up in a way to support incremental backup of state pertinent to snapshots. The entry types will have meanings as foliows;

1 . K ~> [P, Q₁ D]: Standard key/value entry for an item with key "K" and data "D" that is active at the branch tip. "P" describes the snapshot version in which the entry was created, and may be equal to the branch tip version.

2. K -> [P, 8]: Block pointer for an internal node. In this case, "8" refers to a block containing the next level down within the tree, has left-delimiting key "K", and was last updated for snapshot version "P". The annotation of "B" should be "P", as described below. Referenced disk parcels represent the next lower level in the tree versus the containing parcel, and this type of entry may not be present at the leaf level for obvious reasons.

3. K -> [P, Q, R]: Delegation entry for key "K". In this case, the entry upholds the divide-and conquer search property. "P¹TQ" refers to an ancestor snapshot with a more specific entry for "K", and "R" refers to the key delimiting the end of the range. In most every case, these represent key spaces and entries that are inherited from an ancestor branch. Delegation entries may appear at any level of a tree, although they are only introduced through the process of populating a new branch tip.

The branch tip tree physical data structure is version-annotated. Each disk parcel of a branch tip tree contains an annotation field describing the last version in which its authoritative key space has been enumerated. Additionally, entries of higher level internal disk parcels incSude the annotation fields of the parcels they refer to. Right sibling pointers in branch tip tree nodes are also specialized. These pointers need not contain a copy of the right neighbor's annotation version, but still contain the right neighbor's left delimiting key. For cases where a key range within a parent B-tree node is delegated to a snapshot, right neighbor pointers that would need to point into that node are marked as null, and may potentially include a snapshot reference. When a branch tip tree is populated through the Transcribe operation, which replaces internal delegations with new private child nodes, left-neighbor children may need to have right sibiing pointers adjusted to point to them. The Transcribe operation is described below.

Delegation entries may be merged together for space efficiency, provided that they are leaf-level, their target versions are identical, and their key ranges are contiguous.

Third, the history tree records all prior states, including al! data pertinent to one or more snapshots but no longer pertinent to any branch tip.

The history tree is structured as an !1VlL Lomef B-Pi-tree. Unlike a branch tip tree, the interna! structure of a history tree is irrelevant, and its abiSity to store and locate keyed entries is the only required record-keeping task of the physical structure. Entries in the history tree follow the mapping [!, V, L, K] -> ([D, Pj j [P, Q, R] \ [Q, R] ! [R] S [P= Q] }■ The key structure [I, V, L, K] can be complex, and its uses may not be immediately apparent, it supports grouping entries in two ways: to facilitate searching for the nearest relevant snapshot on a particular branch, and to search for content records vs. structure (key-delimiting) records, which is further described below.

As suggested by the mapping, there are five types of entries;

1 , [I₁ V, 0, K] -> [D, P]: Content record describing the initial contents of an entry, for an entry that has been modified or deleted in a later version. Here "I" will be the grouping branch ID, "V" will be the first version the entry came into existence, "K" will be the entry's key, and "D" wili be the original contents of the entry. All client-visible entries wili be assigned level 0 regardless of layout decisions made in the branch tip tree. 2. [I, V, 0, K] -> [P, Q, R]; Content record delegating to an ancestor snapshot.

Here "I" will be the grouping branch ID, "V" the version in which this key space deiegation exists, "K" will be the left delimiting key of the delegation, "R" the right delimiting key, and "P" and "Q" wili be a reference to a previous version that should be examined for more detail of this vβrsion- key range. These key ranges remain unmodified between versions P and

V, and exact information about the values can be found at version P or earlier (and possibly the current version). The level L in the key is set to 0 for all first-level delegations.

3. [I, V, 0, KJ -> [Q, R]: Content record delegating to a branch tip. Here "I" will be the grouping branch ID, "V" version in which the key space or entry came into existence, "K" the left delimiting key, "R" the right delimiting key, and "Q" the branch tip to which the key range is deiegated. Effectively, these are initial version entries for entries that remain in the same state in the current version. The branch tip "1" is derived from snapshot "V" in order for this entry to make sense, but need not be the same branch as "I".

4. [I, V, L, K] -> [RJ; Structure record describing a delimiting key in an internal node of the original branch tip tree from which this snapshot is derived. Key delimiting records are only inserted as part of replacing a "frontier" delegation record (one with L = 0) with delegation records for the individual entries in the disk parcel of the branch tip free. The old frontier record is removed and reinserted with its actual tree ievel as L. This is described in more detail below in the operations section.

5. [I, #] -> [P, Q]: Branch point delegation. Here, "I" will be the grouping branch ID, "#" is the prehistoric version - the smallest value in the namespace of snapshot SDs so that this type of entry will appear before ali others with the same value for "I", "Q" will refer to an ancestor branch iD, and "P" an ancestor version. [P₁ Q] need not refer exactly to the branch point for branch ¹T as described in the revision structure, but should instead refer to the most recent ancestor of the branch point for branch T that has any record in that specific history tree.

As mentioned above, the physical layout of the history tree is irrelevant to the sparse history tree algorithms and does not require node-level version annotations as the branch tip tree does.

There are two subtle, subjective details to the choice of key grouping [!, V, L, K]. First, the [branch SD, snapshot SD] pair is the most significant key component, in that order, to support grouping and fast state retrieval of snapshots on a particular branch. Suppose we have two branch tips A and B. We make changes on A, take a snapshot of A, alternate modifying branch B and taking snapshots of A and B eight times, and finally modify branch A. The snapshot !Ds allocated to each snapshot will be interleaved, and the snapshots on branch A might be numbered [1. 2, 4, 6, 8] and on branch B [3, 5, 7, 9]. Because of the grouping, entries will be sorted in the history free as ([A, 1], [A. 2], [A, 4], [A, 6], [A, 8], [B, 3], [B, 5], [B. 7]). Because there should be no entries for snapshots on branch A other than SS!D 1 , if we were to search for state related to SSID 6 in Lookup-LE mode, we would find the last entry for SSID 1 and could conclude that SSlD 1 is the correct place to find state related to SSIO 6. Specifically, we would not have to skip over state related to SSSDs 3 and 5 that are unrelated to branch A,

The second key grouping detail has to do with the level, lnciuding it in the key, and placing it before the item key isolates content records from structure records. Structure records are only used when creating new internal nodes for branch tip trees in the Transcribe operation and would otherwise interfere with searches for state information Delegation entries in a history tree can be merged together for space efficiency, provided that the "V" versions are the same, the key ranges are contiguous, and the delegation target versions are the same. In one exemplary embodiment structure entries are not merged.

While the history tree is intended to be a repository for data that is no longer pertinent to any branch tip, it is allowed, and in some cases preferred, to contain data that is pertinent to branch tips. One goal of the sparse logical history tree is to trade near-optimal search performance in branch tips for lesser search performance within snapshots. When data pertinent to one branch tip is stored in the history tree, as it will be for a newly created branch, this goal is undermined to some extent.

Operations The following terms and symbols are used for a description of operations and supplement the terms in the schema section.

C: The total number of entry updates that have been performed on ai! branch tip trees of the index.

Nl: The total number of entries in the branch tip tree of branch S.

E: The maximum number of entries that can fit in a disk parcel of a branch tip tree.

Ml: The nesting level of branch I, i.e. the number of parent branch points between the branch tip of branch I and the root branch.

iatest(i): The snapshot ID assigned to the branch tip of branch !. This value will become the snapshot ID of the next snapshot taken of that branch tip. The revision structure is the primary location of this information. braπch(V): The branch ID associated with snapshot V. The revision structure is the primary location of this information.

branchpoint I); The snapshot SD of the snapshot from which branch I was initially derived. For unrooted branches, this is equal to #. The revision structure is the primary iocation of this information.

For the purpose of describing the sparse history tree operations, we assume that entries in the history tree are merged to the greatest extent possible, and that this merging occurs transparently as part of insert/delete operations on the history tree,

Beiow are high level descriptions of basic operations that can be performed on the sparse logical history tree, along with exemplary time bounds.

Lookup

Lookup searches the index in mode "Type" for entry data given a key [V, K], where "V" is a specific version of interest and "K" is the entry key to search for, where the nearest key less than or equal to "K" will be retrieved. Type refers to the search mode, and can be "LE" to search for the entry with the greatest key less than or equal, "GE" to search for the entry with the least key greater than or equal, or "Exact" to search for the entry with an exact key match only. The version "V¹¹ may be some specific version prior to the current version, or the latest on the branch, latest(i). The result is the entry that was valid at time "V" with the key equal to or the nearest less than "K", or NULL if there were no entries with key less than or equal to "K" at version "V".

Lookup depends on a utility operation called find-nonsparse-snapshoi which probes the history tree for the nearest ancestor snapshot of an arbitrary target snapshot. For example, if a state is set up on a branch tip, five snapshots are

">o taken, the state is modified, a sixth snapshot is taken, and we wish to enumerate the state as of the fourth snapshot, this routine wil! recommend searching the state associated with the first snapshot, as it will be the nearest ancestor of the fourth snapshot with recorded state. The same logic extends to nested branches.

find-nonsparse-snapshot(l, V, L, K, Type) { rep robe_b ranch: ent - Lookup-LE for [I, V₁ L₁ K] in the history tree if ent is not found, return nil

if {ent J != l) {

// Follow to the base snapshot for i

V = branchpoint I) if V " #, return ni! i - branch(V) goto reprobe__foranch

}

if (ent.V — #) { // Follow the branch base record, it shortcuts to the nearest branch/snapshot I = ent.Q

V = ent P goto reprobe_branch }

if (Type is GE and (ent.V \- V or ent.L != L or ent.K \- K)) { VX = entV ent = Nexf-Entry(ent) if enf was not found or ent,! N i, return CHECKjπP(l)

// Did we walk off the end of a target snapshot? if V — VX and eni.V \- VX, return nil }

// At this point the branch has been resolved, resolve the snapshot if (ent V N V) { if (Type is not GE) { xent - Next-Entry(ent) if xent was not found, return CHECK_T!P(!) if xent.! -- ! and xent.V -- V, return nil }

V = ent.V if Type is Exact ent ~ Lookup-LE for [S, V₁ L₁ K] in the history tree eise ent = l_0Qkup-{Type} for [\, V, L, K] in the history tree if ent was not found or ent.! S= I or ent.V != V, return ni!

} return ent

}

SHT-LOOkUp(V₁ K, Type) f = branch(V) if (V == latβst(l)) { ent = Lookup-{Type} for K in the branch tip tree of branch ! if ent was not found or ent is not a deiegation entry, return ent

V = ent. P i = ent.Q assert that V < latest(f)

}

if (V < latest(l)) { eni - find~nonsparse-snapshot(i, V. 0. K, Type) if {ent is CHECK-TIP(X)) { I = X goto check-tip }

re-evaS: if ent was not found or ent.V !- V, return nil if {ent is a delegation to a snapshot) { if K is not inside [ent. K, ent.R), return ni!

I = ent.Q V = entP

if Type is Exact ent - Lookup~LE for [!, V, 0, K] in the history tree else ent = Lookup-{Type} for [!, V, 0, K] in the history tree goto re-evai

}

if (ent is a delegation to a branch tip) {

// Confine the branch tip search to the delegation range of enf

1 = ent. I xeπt = Lookup-(Type) for K in the branch tip tree of branch ! if xent is not in the range [eπt.K, ent.R), return nil return xent

}

if {ent is an initial version entry) { if Type is Exact and eπt.K != K, return nil return en }

return nil

}

check-tip:

Lookup-{Type} for K in the branch tip tree of branch I, return the result

The claimed common-case time bound of find-nonsparse-snapshot is O({1 + MS) log(C)), optimisticafly assuming 0(1) time bounds for the branchpoinf(l) and branch(i) primitives. For a degenerate tree, the time bound is much worse at 0(C + CMS). A degenerate is tree caused by the underlying physical B-pi-free structure; if the sparse history tree algorithm used standard B~link trees, then it would not have "degenerate tree" worst cases (but it would also not allow as much concurrency.)

The claimed time bound of SHT-Lookup is O((1 + M!) log(C) + log2{C)) for the common genera! case, 0{C2 + CMI) for the degenerate general case, and Q(JOg(NI)) for the common case of V -Jatest(l)-

Archive

Archive is an operation used as part of insert and delete, ft creates delegation entries within the history tree for a specific node in the branch tip tree. In the process, the node's annotation version ("Node. P") is set to the current snapshot ID assigned to the branch tip. A node whose annotation version is equal to the branch tip SSID is said to be archived. Ail nodes having authority over a particular key range, including the target leaf node and ail of its parents, are archived before any existing entries within that key range may be modified or deleted, or any new entries with keys inside that key range may be inserted. Archive(l, Node) if Node. P " latest(l), return

if (Node is not the root node of its branch tip tree) { ent = Lookup-LE for [!, Node, P, 0, Node.lkeyJ in the history tree assert that ent is a branch tip deSegafion record assert that ent covers range [Node.lkey, Node.rkey) delete/split ent such to remove its coverage of [Node. Skey, Node.rkey) insert a structure entry into the history tree for

[I, Node. P, Node. Level, Node.lkey] -> [Node.rkey] }

// Populate the history tree with a branch tip delegation for each entry for each ent in Node.entries {

// The right delimiting key of the delegation record is either the key of the

// leaf-level entry if (ent is a key/value record) rk = ent.K eise if (ent is not the last entry in Node) rk = next(ent).K eise rk - Node.rkey

if (ent. P -- node. P)

/7 Branch tip delegation insert [\, Node. P, 0, ent.K] -> [I, rk] into the history tree else

// Prior snapshot delegation insert [S₁ Node. P, 0, ent.K] -> [ent.P, ent.Q, rk] into the history tree

}

Node.P ^ latest(i) Update the version annotation for Node in its parent's index term, if the term is posted.

The common-case performance if Archive is claimed to be 0(E log(C)). The degenerate performance is claimed to be O(EC).

Transcribe

Transcribe is a primitive operation used to expand internal node delegations in branch tip trees into new disk nodes that are physically part of the branch tip tree and thus modifiable. The new nodes are created from structure entries stored in the history tree or, in some cases, nodes belonging to another branch tip tree. When a branch is created, the branch tip tree's initial contents are a single delegation to the target snapshot, which are transcribed into a root node. When an entry reached through an internal node delegation is updated on a branch tip tree, nodes along the path from the root to the entry are transcribed, to create a local site of modification,

// Search a history tree for entries related to: // Snapshot version [1,V] // Key range [K₁R) // Height level (all entries with level > 0 are structural entries}

transcribe~search(l, V, levei, K, R) xent = Lookup-LE for [I, V, level, K] in the history tree if (xent was not found or is not inside the range of [K, R)) { xent = Lookup-LE for [!, V, 0, K] in the history tree assert that xent was found and is inside the range [K, R) } assert thai xent is not a branch base record return xent

// Build a new node for a branch tip with parameters; // Snapshot version [!,V] // Key range [K₇R) // Height level

// Returns a new node suitable to be inserted at height ieve!

transcribe-range(l, V, Sevel, K₁ R) hent = transcribe-search(L V, level, K, R) create a new branch tip tree node node node. P = iatest(l)

carry_on:

// We create direct backpointers unless hent is a base entry, which // either contains the original data or a delegation to the branch tip. if {hent is a delegation record to a prior snapshot) { bpv = hent. P bpi = hent.Q } eise { bpv = hentV bpi = hent! } if (hent is an initial content record, i.e. [\, V, 0, K] -> [D, P]) { rkey - hent. K } else { rkey = hent. R } if (rkey < R) { goto carry_on }

Insert K -> [bpv, bpi, rkey] into node hent - transcribe-search(!_s V, level, hent.K, R)

return node

Transcribe RoOt(I, V) ent = Lookup-LE for [I, V₁ MAXLEVEL, MAXKEY] in the history tree return transcribe-range{S, V, entL MINKEY, MAXKEY)

Transcribe Notfe{l, node, ent) xnode - transcπbe~raπge(l, ent.P, node. L - 1 , eπt.K_s ent. R) remove ent from node insert K -> [xnode. P, I, btock(xnode)] into node return xnode

Transcribe fills a node at an arbitrary level in the B-tree, the contents of which would otherwise need to be searched for in the history tree, and could not be modified without potentially disturbing state shared with other snapshots. Snapshots preserve al! structural details of branch tip trees at the time they are taken, including key space partitioning of internal nodes. When creating nodes for new branch tip trees deriving from such snapshots. Transcribe will duplicate the key space partitioning and node contents as closely as possible. The structure of the original branch tip tree at the time the snapshot was taken follows:

In terms of the !ML, Transcribe operations will populate new virtual parcels, which may be merged with neighbors. The common-case performance of both TranscribeRoot and TranscribeNode are claimed to be O(E log(C)). The degenerate performance is daimed to be O(EC).

LookupPrβpare LookυpPrepare is a tow-level operation that both searches for and prepares a specific parcei on a branch tip tree to be updated. This may include archiving and/or transcribing the target parcel and some or ail of its parents. LookupPrepare is used by both insert and delete.

LookupPrepare(l, K, Type)

Node = find the root node of the branch tip tree for branch I if (Node does not exist) { assert that the history tree contains no branch base record [I, #]

// Find the nearest ancestor branch

XI = I

search-next-parent:

XV = branchpoint^!) if (XV == #) {

// There is no ancestor, the branch tip is independent X! = #

} else { X! = branch(XV) ent = Lookup-LE for [X!, XV_S MAXLEVEL, O] in the history tree if ent was not found or ent. I != X!, goto search-next-parent

// An ancestor was found and the branch tip derives from it if ent is a branch base record, XV = ent. P and Xl = ent. I else, XV = ent. V

}

// insert a shortcut branch base record insert [I, #1 -> [XV, XS] into the history tree

if (XV == #) {

// The branch tip tree must start out empty

Node - create an empty branch tip tree root for branch Node. P = latest(f)

} else {

// The branch tip tree must be transcribed

Node = TranscribeRoot(XI, XV)

} insert Node into the top-level index

}

// Does the branch tip root need to be archived? if (Node.P != latβst(l)) { ArchivefS, Node)

}

next-node; ent = search~{Type} Node for K if ent refers to a key range not accounted for, return not found else if (ent is a block pointer and ent.B — 0) { Node = TraπscribeNode(l, Node, ent) goto next-node

} eise if {ent is a block pointer and ent.P N fatest(l)) { Parent = Node Node = πii child - RetrieveNode(ent.B) whiie (childxkey != ent.R) {

Archive(child) if child contains K, Node = child if necessary, try to post the index term for child into Parent chiSd - RetrieveNode(child.rsibSing)

} goto next-node } return Node

The path dealing with searching for the most recent snapshot ancestor as part of creating a new branch tip has common-case performance O{(1 + M!) log(C)). The traversal path may execute TranscribeNode for every Sevei of the structure data in the snapshot from which it is derived, which would have common-case performance O(E iog(C) log(NI)). In the common case, the traversal path may execute Archive for one node of every Sevei of the branch tip tree, which has performance 0(E iog(C) iog(Ni)}. In the degenerate case, the traversal path may execute Archive for every node in the branch tip tree, which would have performance O(NI E !og{C)), In the best case, which should be the most common, neither TranscribeNode nor Archive are invoked, and LookupPrepare has performance O(Sog{NI)).

insert

Insert [!, K, D] adds a new entry with key "K" and data "D" to the branch tree for L SVfost of the work is done inside LookupPrepare.

insertO, K₁ D) Node = LookupPreparefl, K, LE) if Node is not found, return failure if Node contains an entry for K₁ return already exists insert K -> [iatest(i), I, D] info Node

The claimed non-degenerate performance is O(E log2(C)). The best case, and perhaps most common performance is O(log(Ni)).

Deiete

Deiete [!, K] removes an entry with key "K" from branch I, possibly preserving its vaiue. As with Insert, most of the work is done inside LookupPrepare.

Delete^ K)

Node = LookupPreρare(!, K, LE) if Node is not found, return failure ent = search Node for K if ent is not found, return failure if (ent.P < iatest(ϊ)) { hist = Lookup-LE for [ent. Q, ent.P, 0, K] in the history tree if hist covers more than [K, ent.R), split hist into one component covering [K, enf.R) delete hist from the history tree insert [ent.Q, ent.P, 0, K] -> [ent. D, S] into the history tree

} remove ent from Node

The claimed common-case performance is 0(E log2(C}). The best case, and perhaps most common performance is O(log(NI)).

Update

Update(f, K, D) alters an existing entry with key "K", possibly preserving its previous vaiue. Update can be modeled as Delete(l, K), !nsert(l, K, D). Exterπai Interfaces

The Sparse History Tree will have a few simpler exterπai interfaces, modeled after the !ML interfaces. They will diverge only where required. The SHT index class will mirror the !ML index class, but will remain TBD for one exemplary embodiment.

typedef sht_index_s sht_index_t; typedef sht_indexciass_s sht_incSexclass_t;

imi_status_t sht_index_init(sht_index_t ^*, sht_indexclass_t ^*); void shtjndex_mcleanυp(shMndex_t ^*);

The Sparse History Tree will depend on the higher level snapshot management module to implement its revision structure,

typedef uint64_t ssm_brid_t; typedef uint64_t ssm_ssid_t; typedef ssm_snapcfx_s ssm_snapctx_t; typedef ssni_branchctx_s ssm_branchctx_t;

os__status_t ssm_snapctxjnit{ssm_snapctx_t ^*, swift_context_t \ ssm_ssid_t); void ssm_msnaρctx_cleanup(ssrn_snaρctx_t *); os_status__t ssm_branchctxjnit{ssm_branchctx__t ^*, swift_context__t ^*, ssm_brid_t): void ssm_branchcfx_cleanup(ssm_braπchctx_t ^*};

In particular, the ssm_snapctx_t structure provides at least the following information to the sparse history tree;

1 , The snapshot SD of the snapshot

2. Tha snapshot ID of the snapshot's immediate ancestor 3. The branch ID of the branch on which the snapshot resides

4. Whether the snapshot has been marked "dead" or no longer pertinent

Likewise, the ssm_branchctxj structure provides at least the following:

1. The branch ID of the branch

2. The snapshot SD of the branch tip

3. The snapshot !D of the initia! branch tip

4. The branch ID of the parent branch 5. The snapshot iD of the branch point within the parent branch

6. Whether the branch tip has been marked "dead" or no longer pertinent

The snapshot management module also defines interfaces for creating snapshots and branches, marking snapshots as "dead" or no longer pertinent, uitimate on-disk storage of the snapshot data, etc. These interfaces are not directly reievant to the sparse history tree, as the SHT is oniy a consumer of the branch structure information and has no reason to modify it.

typedef sht_cursor_s shf_cursor_f;

void sht_mcursorjnit{sht_cursorj *, shtjndexj *, swiftjxhandlej ^*); void sht_mcursorjreset(sht_mcursorj ^*); void sht_cursor__cleanup(sht_cursor_t ^*);

im!_status_t sht_cursor_search_tip(sh1_cursor_f ^*, iml_searchmode_t, ssm_branchctx_t ^*, iml_kext_t ^*);

iml_statusj sht_mcursor_msearch_snap(sht_mcursor_mt ^*, ssm_snapctxj ^*, imLkextJ *);

im!_status_J sht_mcursor_mprev(sht_cursorj ^*); imi_status_t sht_cursor_next(sht_cursor_t ^*); im!_status_t sht_cursor_get_entry(sht_cursor_t *, im!_kext_t ^*, iml_dext_t ^*}; iml_statusj: sht_mcursorjget_key{sht_cursorj *, iml_kextj ^*}; imi_status_t sht_mcursor_mget_mdata(sht_mcursorj ^*, imljjextj *); /^*

* The below interfaces are only relevant for branch tips and return failure if the

* cursor refers to a snapshot */ imi_status_t sht_cursor_replace_entry(sht_cursor_t *, iml_dext_t ^*); imi_status_t sht_cursor_deiete_entry(sht_cursor_t ^*); imi_status_t sht_mcursorjnsert_mafter(sht_cursorj *, iml_mkextj *, imLdextJ ^*);

The Sparse History Tree also provides interfaces for maintenance operations. These include freeing index entries beiongiπg to no-ionger-pertinent snapshots and branch tips, and relocating pertinent entries from branch tips that are being destroyed into "replacement" branch tips, in the case of a revert operation.

iml_status_t shtjrecSaim_snap(sht_index_t *, ssm_snapctx_t ^*);

imi_statusj shtj"eciaim_mbranch{shtjndex_mt *, ssrn_branchctx_t *, ssmj3ranchctx_t *hint_replacement).

Definitions

As used herein and in the claims, the following words are defined as follows:

A "B-tree" is a tree data structure in a database that sorts data and enables searches, insertions, and deletions to occur in logarithmic amortized time. Internal nodes have a variable number of child nodes with all leaf nodes (i.e., externa! nodes) being maintained at a same depth for balance. A B-pi-tree or "Lomet B~pi~tree" is a B~liπk tree (which is a B-tree in which each node contains a pointer to its right sibling) in which transacted modifications to the tree occur one ieve! of the tree at a time.

A "block" is a sequence of bytes or bits having a specified length or block size. Data is blocked (i.e., placed into blocks) to facilitate haπdiing streams of data since data is typically read as a single block at a time. Most fiie systems are based on block devices that store and retrieve specified blocks of data. For example, a single file can be stored in multiple blocks.

A "block device" refers to a device through which the file system moves data in the form of blocks. Block devices are addressable storage devices, such as hard disks, CD-ROM, or memory regions.

A "duster" is a group of two or more computers that work closely together so that in many respects form a single computer. For examples, a cluster is formed by Sinking multiple computers through a fast local area network (LAN) to improve performance and/or availability over a single computer.

A "directory" is an entity in a file system that contains a group of files or other directories. Related files are typically stored in a same directory, A directory contained inside a directory is a subdirectory. Together, multiple directories form a hierarchy or free structure,

A "filesysteπf or "file system" refers to a system that an operating system or a program uses to organize and keep track of files. A filesystem stores and organizes computer files so the fiies and corresponding data can be managed and discovered, Filesystems use a data storage device (such as a hard disk, CD-ROM, etc) and manage the physical location of the files. A file system is a set of abstract data types that are implemented for the storage, hierarchical organization, manipulation, navigation, access, and/or retrieval of data. In one embodiment, the files are stored in storage devices (such as disk drives) in fixed-size blocks (such as 512 bytes, 1KB, 2KB, or 4KB). The file system manager organizes the biocks into files and directories and tracks which blocks belong to which files.

An "IML Lomet B-pi-tree" is a B-pi-tree managed by the ISVIL (i.e., an Index

Manipulation Layer which is a component of the fiiesystem that manages trees of various kinds).

An "inode" is a data structure that contains information about files (such as basic information about a regular file, directory, or other file system object). Iπodes include information on files, such as, but not limited to, user ownership, access mode (read, write, execute permissions) and type. In one exemplary embodiment, each file has an inode and is identified by an inode number (i-number) in the file system where it resides, (nodes contain metadata (i.e., data about data) about the file.

The term "metadata" refers to data about data. Metadata can be stored in a separate file apart from the data itself. For example, file systems can store metadata in directory entries or in specialized structures, such as inodes. By way of example, metadata can include length of data stored as the number of blocks allocated for the file, a time and date when the file was modified, created, or last accessed, ownership identification, access permissions, etc.

A "node" is a basic unit used to build a tree or data structure. Each node includes data and possibly links (such as pointers or references) to other nodes. By way of example, nodes include root nodes, child nodes, and leaf nodes. A root node is the top most node of a tree data structure. A child node or internal node is an intermediate node between the root node and a Seaf node. A Seaf node or external node ts a node of a tree data structure that has no child nodes (i.e., an end or bottom node).

30 A "snapshot" is a copy of a set of fiSes and directories in a file system as the fiies and directories existed at a particular point in time in the past. Snapshots can be read-only or writable,

A "sparse history tree" is a tree that maps keys to data and retains old ("historical") versions of the mappings.

A "sparse logical history tree" is a sparse history tree in which the key ranges delineating historical data are tracked logically {by entries in the tree) rather than physically (by block).

A "storage device" refers to any data storage device capable of storing data including, but not limited to, one or more of a disk array, a disk dnve, a tape drive, optical drive, a SCSI device, or a fiber channel device. Further, a "disk array" or "array" is a storage system that includes plural disk drives, a cache, and controller. Arrays include, but are not limited to, networked attached storage (NAS) arrays, modular SAN arrays, monolithic SAN arrays, utility SAN arrays, and storage virtual ization.

A "tree" is a data structure (i.e. , a way to store data) that emulates a tree structure with a set of linked nodes.

In one exemplary embodiment, one or more blocks or steps discussed herein are automated. !n other words, apparatus, systems, and methods occur automatically. The terms ^"automated" or "automatically" (and like variations thereof) mean controlled operation of an apparatus, system, and/or process using computers and/or mechanical/electrical devices without the necessity of human intervention, observation, effort and/or decision.

The methods in accordance with exemplary embodiments of the present invention are provided as examples and should not be construed to limit other embodiments within the scope of the invention. Further, methods or steps discussed within different figures can be added to or exchanged with methods of steps in other figures. Further yet, specific numerical data values (such as specific quantities, numbers, categories, etc.) or other specific information should be interpreted as iSlustrative for discussing exempSary embodiments. Such specific information is not provided to limit the invention.

In the various embodiments in accordance with the present invention, embodiments are implemented as a method, system, and/or apparatus. As one example, exemplary embodiments and steps associated therewith are implemented as one or more computer software programs to implement the methods described herein. For example, the software is implemented as one or more modules. The location of the software will differ for the various alternative embodiments. The software programming code, for example, is accessed by a processor or processors of the computer or server from long-term storage media of some type, such as a CD-ROM drive or hard drive. The software programming code is embodied or stored on any of a variety of known media for use with a data processing system or in any memory device such as semiconductor, magnetic and optical devices, inciuding a disk, hard drive, CD-ROM, ROM, etc. The code is distributed on such media, or is distributed to users from the memory or storage of one computer system over a network of some type to other computer systems for use by users of such other systems. Alternatively, the programming code is embodied in the memory and accessed by the processor using the bus. The techniques and methods for embodying software programming code in memory, on physical media, and/or distributing software code via networks are well known and will not be further discussed herein.

The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fuliy

4 ] appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

What is claimed is:

1 ) A method, comprising: capturing multiple snapshots of a file system represented as a tree (300); storing states of the snapshots in a single history tree (340); updating the history tree to include changes between a current snapshot and a previous snapshot (360); and using one of the snapshots to backup or recover data in the file system (370).

2) The method of claim 1 further comprising, implementing the snapshots in an indexing structure that uses B-trees and a current version counter to store a state of a single attribute index with one or more branching versions, states at all branch points, and intervening snapshot points (310).

3) The method of claim 1 further comprising, representing the snapshots of the tree without creating multiple physical pointers to individual blocks making up the tree.

4) The method of claim 1 , wherein the snapshots represent a logical version of the tree and are updated and snapshotted.

5) The method of claim 1 further comprising, updating a value in the tree to overwrite an old value in-place to preserve spatial properties of a layout of data corresponding to a current state of the tree to avoid fragmentation and to preserve an old value for reference by the snapshots.

6) A tangible computer readable storage medium having instructions for causing a computer to execute a method, comprising: creating plural read-only snapshots of a clustered file system represented as a tree data structure (300); storing states of the read-only snapshots in a history tree (340); comparing a current snapshot to a previous snapshot to determine updates to the states in the history tree (350); and using one of the read-only snapshots to backup or recover data in the file system (370).

7) The tangible computer readable storage medium of claim 6 further comprising, storing the read-only snapshots in same physical blocks on disk as a data structure representing a current state of the file system so that a single on-disk block represents the tree data structure and multiple snapshots of the tree data structure.

8) The tangible computer readable storage medium of claim 6, wherein the readonly snapshots consume space proportionate to an amount of data that has been overwritten and not proportionate to a total amount of data in the tree data structure.

9) The tangible computer readable storage medium of claim 6 further comprising, deriving branch tips from branch points in the tree data structure, wherein the branch tips are modified and snapshotted.

10) The tangible computer readable storage medium of claim 6, wherein the read-only snapshots are stored with a current state of the tree data structure in a single storage block.

11 ) A server (420), comprising: a snapshot algorithm (480) that captures snapshots of a clustered file system (100) represented as a tree (200), stores states of the snapshots in a history tree, updates the history tree to includes changes between a current snapshot and a previous snapshot, and uses one of the snapshots to backup or recover data in the file system.

12) The server of claim 11 , wherein the snapshots are taken in the clustered file system where multiple servers simultaneously update trees stored on disks.

13) The server of claim 11 , wherein the snapshots include snapshots of individual files and not the entire clustered file system.

14) The server of claim 11 , wherein the snapshots are writable snapshots of only part of the clustered file system.

15) The server of claim 11 , wherein the snapshots are taken by users of files of the users.