HK1093481B

HK1093481B - Virtual disk drive system and method

Info

Publication number: HK1093481B
Application number: HK07100663.9A
Authority: HK
Inventors: P‧E‧索兰; J‧P‧圭德; L‧E‧阿兹曼; M‧J‧克莱姆
Original assignee: Dell International L.L.C.
Priority date: 2003-08-14
Filing date: 2004-08-13
Publication date: 2009-07-31

Description

Virtual disk drive system and method

Technical Field

The present invention relates generally to disk drive systems and methods, and more particularly to designing disk drive systems with capabilities such as dynamic data allocation and disk drive virtualization.

Background

Existing disk drive systems are designed in such a way that: such that the virtual volume data storage space is statically associated with a physical disk of a particular size and location for storing data. These disk drive systems need to know and monitor/control the exact location and size of the virtual volumes of the data storage space in order to store data. In addition, systems often require more data storage space in order to add more RAID devices. However, these additional RAID devices are often expensive and not required until the additional data storage space is actually needed.

FIG. 14A illustrates a prior art disk drive system containing virtual volume data storage space associated with physical disks of a particular size and location for storing, reading/writing, and/or recovering data. The disk drive system statically allocates data based on the particular location and size of the virtual volume of data storage space. As a result, the emptied data storage space will not be used, while additional, sometimes expensive, data storage devices, such as RAID devices, are pre-acquired for storing, reading/writing and/or restoring data in the system. These additional data storage spaces are needed and/or used later.

Thus, there is a need for improved disk drive systems and methods. There is also a need for an efficient, dynamic data allocation and disk drive space and time management system and method.

Summary of The Invention

The present invention provides an improved disk drive system and method capable of dynamically allocating data. The disk drive system may include a RAID subsystem including a matrix of disk storage blocks and a disk manager including at least one disk storage system controller. The RAID subsystem and disk manager dynamically allocate data across a matrix of disk storage blocks and a plurality of disk drives based on RAID-to-disk mapping. The RAID subsystem and disk manager determine if additional disk drives are needed and send a notification if additional disk drives are needed. Dynamic data allocation allows a user to acquire the disk drive later when needed. Dynamic data allocation also allows for efficient data storage of snapshots/point-in-time copies of virtual volume matrices or pools of disk storage blocks, for instant data replay and instant data fusion for data backup, recovery, etc., remote data storage, and data staging management (data progression), etc. The data staging management also allows for postponing the purchase of cheaper disk drives, since they will be purchased later.

In one embodiment, a matrix or pool providing virtual volumes or disk storage blocks is associated with physical disks. The matrix or pool of virtual volumes or disk storage blocks is dynamically monitored/controlled by a plurality of disk storage system controllers. In one embodiment, the size of each virtual volume may be default or may be predefined by the user, while the location of each virtual volume defaults to empty. Before allocating data, the virtual volume is empty. Data may be allocated in any grid of the matrix or pool (e.g., a "point" in the grid once data is allocated in the grid). Once the data is deleted, the virtual volume is again available, indicating "empty". Thus, additional data storage space and sometimes expensive disk storage devices, such as RAID devices, may be acquired at a later time on an as-needed basis.

In one embodiment, a disk manager may manage multiple disk storage system controllers, and multiple redundant disk storage system controllers may be implemented to override a failure on an operational disk storage system controller.

In one embodiment, the RAID subsystem includes a combination of at least one of the RAID types, such as RAID-0, RAID-1, RAID-5, and RAID-10. It will be appreciated that other RAID types may be used in alternative RAID subsystems, such as RAID-3, RAID-4, RAID-6, RAID-7, and the like.

The invention also provides a dynamic data allocation method, which comprises the following steps: providing a default size of the logical blocks or disk storage blocks such that disk space of the RAID subsystem forms a matrix of disk storage blocks; writing data and allocating data in the matrix of disk storage blocks; determining the occupancy rate of the disk space of the RAID subsystem based on the historical occupancy rate of the disk space of the RAID subsystem; determining whether an additional disk drive is needed; and send a notification to the RAID subsystem if additional disk drives are needed. In one embodiment, the notification is sent via email.

One of the advantages of the disk drive system of the present invention is that the RAID subsystem is able to use RAID technology across a virtual number of disks. The rest of the storage space is available for free use. By monitoring storage space and determining the occupancy of the storage space of the RAID subsystem, a user does not have to acquire a large number of drives that are expensive but not useful when purchased. Therefore, adding drives to meet the increasing demand for storage space when the drives are actually needed will significantly reduce the overall cost of the disk drive. At the same time, the efficiency of disk usage is substantially improved.

It is a further advantage of the present invention that the disk storage system controller is generic to any computer file system and not only to a particular computer file system.

The invention also provides a method for instant replay of data. In one embodiment, the data instant replay method comprises the following steps: providing a default size of a logical block or a disk storage block, such that disk space of the RAID subsystem forms a page pool of storage or a matrix of disk storage blocks; automatically generating snapshots of volumes of a page pool or snapshots of a matrix of disk storage blocks at predetermined time intervals; and storing an address index of a snapshot or delta of the matrix of memory page pools or disk memory blocks such that the snapshot or delta of the matrix of disk memory blocks is located on-the-fly by the stored address index.

The data instant replay method automatically generates snapshots of the RAID subsystem at user defined time intervals, user configured dynamic timestamps (e.g., every few minutes or hours, etc.), or times indicated by the server. These time-stamped virtual snapshots allow for instant replay of data and instant recovery of data in the order of minutes or hours, etc., in the event of a system failure or virus attack. This technique is also known as instant replay fusion, i.e., fusing data shortly before a crash or attack in time, and can instantly use the snapshot stored before the crash or attack for future operations.

In one embodiment, snapshots may be stored in a local RAID subsystem or in a remote RAID subsystem so that if a major system crash occurs due to, for example, a terrorist attack, the integrity of the data is not affected and the data can be recovered on the fly.

Another advantage of the data instant replay method is that snapshots can be used for testing while the system keeps its operation. The real-time data may be used for real-time testing.

The present invention also provides a system for instant replay of data comprising a RAID subsystem and a disk manager having at least one disk storage system controller. In one embodiment, the RAID subsystem and disk manager automatically allocate data across disk space of the plurality of drives based on a RAID-to-disk mapping, wherein the disk space of the RAID subsystem forms a matrix of disk storage blocks. The disk storage system controller automatically generates snapshots of the matrix of disk storage blocks at predetermined time intervals and stores an address index of the snapshots or increments of the matrix of disk storage blocks so that the snapshots or increments of the matrix of disk storage blocks can be located instantaneously by the stored address index.

In one embodiment, the disk storage system controller monitors the frequency of data usage from a snapshot of a matrix of disk storage blocks and applies aging rules so that less used or accessed data is moved to a less expensive RAID subsystem. Similarly, when data located in a less expensive RAID subsystem is to be used more frequently, the controller moves the data to the more expensive RAID subsystem. Thus, a user can select a desired RAID subsystem portfolio to meet their own storage needs. Thus, the cost of the disk drive system can be significantly reduced and dynamically controlled by the user.

These and other features and advantages of the present invention will become readily apparent to those skilled in this art from the following detailed description, wherein there is shown and described an illustrative embodiment of the invention, including the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of modifications in various obvious aspects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.

Brief description of the drawings

FIG. 1 illustrates one embodiment of a disk drive system in a computer environment in accordance with the principles of the present invention.

FIG. 2 illustrates one embodiment of dynamic data allocation of a pool of storage pages with a RAID subsystem for a disk drive in accordance with the principles of the present invention.

FIG. 2A illustrates a conventional data allocation in a RAID subsystem of a disk drive system.

FIG. 2B illustrates data allocation in a RAID subsystem of a disk drive system according to the principles of the present invention.

Fig. 2C illustrates a dynamic data allocation method in accordance with the principles of the present invention.

Fig. 3A and 3B are schematic diagrams of snapshots of disk storage blocks of a RAID subsystem at multiple time intervals, according to the principles of the present invention.

Figure 3C illustrates a method of data instant replay in accordance with the principles of the present invention.

FIG. 4 is a schematic diagram of a data just-in-time fusion function using snapshots of disk storage blocks of a RAID subsystem in accordance with the principles of the present invention.

FIG. 5 is a schematic diagram of local-remote data copy and instant replay functions through the use of snapshots of disk storage blocks of a RAID subsystem, in accordance with the principles of the present invention.

FIG. 6 illustrates a snapshot using the same RAID interface to perform I/O and concatenating multiple RAID devices into a volume in accordance with the principles of the present invention.

FIG. 7 illustrates one embodiment of a snapshot structure in accordance with the principles of the present invention.

Fig. 8 illustrates one embodiment of a PITC life cycle in accordance with the principles of the present invention.

Fig. 9 illustrates one embodiment of a PITC table structure with multi-level indexing, in accordance with the principles of the present invention.

Fig. 10 illustrates one embodiment of recovery of a PITC table in accordance with the principles of the present invention.

FIG. 11 illustrates one embodiment of a write process with a self page sequence and a non-self page sequence in accordance with the principles of the invention.

Fig. 12 illustrates exemplary snapshot operations in accordance with the principles of the present invention.

FIG. 13A illustrates a prior art disk drive system containing virtual data storage space associated with a physical disk of a particular size and location for statically allocating data.

FIG. 13B illustrates a volume logical block mapping in the prior art disk drive system of FIG. 13A.

FIG. 14A illustrates one embodiment of a disk drive system containing a virtual volume matrix of disk blocks for dynamically allocating data in the system in accordance with the principles of the present invention.

FIG. 14B illustrates one embodiment of dynamic data allocation in a disk storage block virtual volume matrix as shown in FIG. 14A.

FIG. 14C illustrates a schematic diagram of volume-RAID page remapping for one embodiment of a storage virtual volume page pool, in accordance with the principles of the present invention.

FIG. 15 illustrates an example of three disk drives mapped to multiple disk storage blocks of a RAID subsystem in accordance with the principles of the present invention.

FIG. 16 shows an example of remapping of disk drive storage blocks after adding a disk drive to three disk drives as shown in FIG. 15.

FIG. 17 illustrates one embodiment of an accessible data page in a data hierarchy management operation in accordance with the principles of the present invention.

FIG. 18 illustrates a flow diagram for one embodiment of data hierarchy management operations in accordance with the principles of the present invention.

FIG. 19 illustrates one embodiment of a compressed page layout in accordance with the principles of the present invention.

FIG. 20 illustrates one embodiment of hierarchical management of data in an advanced disk drive system in accordance with the principles of the present invention.

FIG. 21 illustrates one embodiment of external data flow in a subsystem in accordance with the principles of the present invention.

FIG. 22 illustrates one embodiment of internal data flow within a subsystem.

FIG. 23 illustrates one embodiment of each subsystem independently maintaining coherency.

Fig. 24 illustrates one embodiment of a hybrid RAID waterfall data progression management in accordance with the principles of the present invention.

FIG. 25 illustrates one embodiment of storing multiple free lists of a page pool, in accordance with the principles of the present invention.

FIG. 26 illustrates one embodiment of an example database in accordance with the principles of the present invention.

Figure 27 illustrates one embodiment of an example of an MRI map in accordance with the principles of the present invention.

Detailed description of the preferred embodiments

The present invention provides an improved disk drive system and method capable of dynamically allocating data. The disk drive system may include a RAID subsystem including a page pool of storage or a matrix of disk storage blocks that maintains a free list of RAIDs, and a disk manager including at least one disk storage system controller. The RAID subsystem and disk manager dynamically allocate data across a pool of storage pages or a matrix of disk storage blocks and a plurality of disk drives based on RAID-to-disk mapping. The RAID subsystem and disk manager determine if additional disk drives are needed and send a notification if additional disk drives are needed. Dynamic data allocation allows a user to acquire a disk drive when it is later needed. Dynamic data allocation also allows for efficient data storage of snapshots/point-in-time copies of virtual volume matrices or pools of disk storage blocks, instant data replay and instant data fusion for data backup, restore, etc., remote data storage, and data staging management, etc. Data staging also allows for postponing the purchase of cheaper disk drives since they can be purchased later.

FIG. 1 illustrates one embodiment of a disk drive system 100 in a computer environment 102 in accordance with the principles of the present invention. As shown in FIG. 1, the disk drive system 100 includes a RAID subsystem 104 and a disk manager 106 having at least one disk storage system controller (FIG. 16). The RAID subsystem 104 and disk manager 106 dynamically allocate data across the disk space of the plurality of disk drives 108 based on RAID-to-disk mapping. In addition, the RAID subsystem 104 and disk manager 106 can determine whether additional disk drives are needed based on the allocation of data across disk space. If additional disk drives are needed, a notification is sent to the user so that additional disk space can be added if desired.

In accordance with the principles of the present invention, a disk drive system 100 with dynamic data allocation (or referred to as "disk drive virtualization") is shown in FIG. 2 in one embodiment, and in FIGS. 14A and 14B in another embodiment. As shown in fig. 2, the disk storage system 110 includes a page pool 112, i.e., a data storage pool containing a list of data storage spaces in which data may be freely stored. Page pool 112 maintains a free list of RAID devices 114 and manages read/write allocations based on user requests. The disk storage volumes 116 requested by the user are sent to the page pool 112 to obtain storage space. Each volume may request the same different storage device classes with the same or different RAID levels (e.g., RAID 10, RAID 5, RAID 0, etc.).

Another embodiment of the dynamic data allocation of the present invention is illustrated in FIGS. 14A and 14B, wherein a disk storage system 1400 having a plurality of disk storage system controllers 1402 and a matrix of disk storage blocks 1404 controlled by the plurality of disk storage system controllers 1402 dynamically allocates data in the system in accordance with the principles of the present invention. A matrix of virtual volumes or blocks 1404 is provided for association with physical disks. The matrix of virtual volumes or blocks 1404 is dynamically monitored/controlled by a plurality of disk storage system controllers 1402. In one embodiment, the size of each virtual volume 1404 may be predefined, such as 2 megabytes, and the location of each virtual volume 1404 is default to empty. Each of virtual volumes 1404 is empty before data is allocated. Data may be allocated in any grid of the matrix or pool (e.g., once data is allocated in the grid, it is a "point" in the grid). Once the data is deleted, the virtual volume 1404 is again available, indicating "empty". Thus, additional and sometimes expensive disk storage devices, such as RAID devices, may be acquired later on demand.

Thus, the RAID subsystem is able to use RAID techniques across a virtual number of disks. The rest of the storage space is available for free use. By monitoring storage space and determining the occupancy of the storage space of the RAID subsystem, a user does not have to acquire a large number of drives that are expensive but not useful when purchased. Therefore, adding drives to meet the increasing demand for storage space when the drives are actually needed will significantly reduce the overall cost of the disk drive. At the same time, the efficiency of use of the disk is substantially improved.

Also, the dynamic data allocation of the disk drive system of the present invention allows for efficient data storage of snapshots/point-in-time copies of virtual volume page pools or virtual volume matrices of disk storage blocks, instant data replay and instant data fusion for data recovery and remote data storage, and hierarchical management of data.

The above features and advantages resulting from the dynamic data allocation system and method and its implementation in the disk drive system 100 will be discussed in detail below.

Dynamic data allocation

FIG. 2A illustrates a conventional data allocation in a RAID subsystem of a disk drive system in which empty data storage space is captured and cannot be allocated for data storage.

FIG. 2B illustrates data allocation in a RAID subsystem of a disk drive system according to the principles of the present invention in which emptied data storage available for data storage is blended together to form a page pool, such as a single page pool in one embodiment of the present invention.

Fig. 2C illustrates a dynamic data allocation method 200 in accordance with the principles of the present invention. The dynamic data allocation method 200 includes a step 202 of defining a default size of logical blocks or disk storage blocks such that disk space of the RAID subsystem forms a matrix of disk storage blocks; and a step 204 of writing data and allocating data in the disk storage blocks of the matrix where the disk storage blocks are indicated as "empty". The method further includes a step 206 of determining an occupancy rate of a disk space of the RAID subsystem based on the historical occupancy rate of the disk space of the RAID subsystem; and a step 208 of determining if additional disk drives are needed and, if so, sending a notification to the RAID subsystem. In one embodiment, the notification is sent via email. Further, the size of the disk storage blocks may be set to default or may be changed by the user.

In one embodiment, dynamic data allocation, sometimes referred to as "virtualization" or "disk space virtualization," efficiently handles a large number of read and write requests per second. The architecture may require that the interrupt handler call the cache subsystem directly. Since dynamic data allocation does not queue requests, it may not optimize the requests, but it may have a large number of pending requests at a time.

Dynamic data allocation may also maintain data integrity and protect the contents of the data from any controller failures. To this end, dynamic data allocation writes status information to the RAID devices for reliable storage.

Dynamic data allocation may also maintain the order of read and write requests and complete read or write requests in the exact order in which the requests were received. Dynamic data allocation allows maximum system availability and supports remote replication of data to different geographical locations.

In addition, dynamic data allocation provides the ability to recover from data corruption. Through the snapshot, the user can view past disk states.

Dynamic data allocation manages RAID devices and provides storage abstraction to create and augment large devices.

Dynamic data allocation presents virtual disk equipment to a server; the device is called a volume. For the server, the volume works the same. It may return different information for the sequence number, but the volume behaves substantially like a disk drive. Volumes provide a storage abstraction to multiple RAID devices to create larger dynamic volume devices. A volume includes multiple RAID devices for efficient use of disk space.

FIG. 21 illustrates an existing volume logical block mapping. FIG. 14C illustrates remapping of a volume-RAID page of one embodiment of a page pool of storage virtual volumes, in accordance with the principles of the present invention. Each volume is divided into a set of pages, e.g., 1, 2, 3, etc., and each RAID is divided into a set of pages. In one embodiment, the volume page size and the RAID page size may be the same. Thus, one example of a volume-to-RAID page mapping of the present invention is that page #1 using RAID-2 is mapped to RAID page # 1.

Dynamic data allocation maintains data integrity for the volume. The data is written to the volume and acknowledged to the server. Data integrity covers various controller configurations, including independence and redundancy through controller failure. Controller failures include power failures, power cycles, software exceptions, and hard resets. Dynamic data allocation generally does not handle disk drive failures covered by RAID.

Dynamic data allocation provides the highest level of data abstraction for the controller. It accepts requests from the front end and eventually writes data to disk using the RAID devices.

Dynamic data allocation includes various internal subsystems:

cache-smooth read and write operations to a volume by providing a fast response time to the server and binding writes to the data plug-in.

configuration-A method involving creating, deleting, retrieving, and modifying data distribution objects. A component is provided for creating a tool kit for higher level system applications.

Data plug-in-depending on the volume configuration, volume read and write requests are distributed to various subsystems.

RAID interface-providing RAID device abstraction to users and other dynamic data allocation subsystems to create larger volumes.

Copy/mirror/swap-volume data is copied to local and remote volumes. In one embodiment, only blocks written by the server may be replicated.

Snapshot — incremental volume restore that provides data. It creates a view volume (ViewVolume) of the past volume state on the fly.

Proxy volume-enables request communication to a remote destination volume for supporting remote replication.

Billing-charging costs are charged to the user for the allocated storage, activity, performance, and data recovery.

Dynamic data allocation also logs any errors and significant changes in the configuration.

FIG. 21 illustrates one embodiment of external data flow within the subsystem. The external request comes from the front end. Requests include, get volume information, read and write. All requests contain a volume ID. The volume information is processed by the volume configuration subsystem. The read and write requests contain LBAs. The write request also contains data.

Depending on the volume configuration, dynamic data allocation passes requests to multiple external layers. The remote copy passes the request to the front end, destined for the remote destination volume. The RAID interface passes the request to the RAID. The copy/mirror/swap passes the request back to the dynamic data allocation to the destination volume.

FIG. 22 illustrates one embodiment of the internal data flow within the subsystem. The internal data stream starts with the cache. The cache may place the write request in the cache or pass the request directly to the data plugin. The cache supports direct DMA from the front-end HBA device. The request can be completed quickly and a response returned to the server. The data plug-in manager is central to the flow of requests below the cache. For each volume, it calls the registered subsystem object for each request.

Dynamic data distribution subsystems that affect data integrity may require support for controller coherency. As shown in FIG. 23, each subsystem independently maintains coherency. The coherency updates avoid copying blocks of data across the coherency link. Cache coherency may require data to be copied to the peer controller.

Disk storage system controller

FIG. 14A illustrates a disk storage system 1400 having a plurality of disk storage system controllers 1402 and a matrix of disk storage blocks or virtual volumes 1404 controlled by the plurality of disk storage system controllers 1402 for dynamically allocating data in the system, in accordance with the principles of the present invention. FIG. 14B illustrates one embodiment of dynamic data allocation in a virtual volume matrix of disk storage blocks or virtual volumes 1404.

In one operation, the disk storage system 1400 automatically generates a snapshot of the matrix of disk storage blocks or virtual volumes 1404 at predetermined time intervals and stores an address index of the snapshot or increments therein of the matrix of disk storage blocks or virtual volumes 1404 such that the snapshot or increments of the matrix of disk storage blocks or virtual volumes 1404 can be located instantaneously by the stored address index.

In another operation, the disk storage system controller 1402 monitors the frequency of data usage from snapshots of the matrix of disk storage blocks 1404 and applies aging rules so that less used or accessed data is moved to a less expensive RAID subsystem. Similarly, when data located in a less expensive RAID subsystem begins to be used more frequently, the controller moves the data to the more expensive RAID subsystem. Thus, a user can select a desired RAID subsystem portfolio to meet their own storage needs. Thus, the cost of the disk drive system can be significantly reduced and dynamically controlled by the user.

RAID-disk mapping

The RAID subsystem and disk manager dynamically allocate data based on RAID-to-disk mapping of disk space across multiple disk drives. In one embodiment, the RAID subsystem and disk manager determine if additional disk drives are needed and send a notification if additional disk drives are needed.

FIG. 15 illustrates an example of three disk drives 108 (FIG. 1) mapped to multiple disk storage blocks 1502 through 1512 in a RAID-5 subsystem 1500 in accordance with the principles of the present invention.

FIG. 16 shows an example of a remapping 1600 of disk drive storage blocks after a disk drive 1602 is added to three disk drives 108 as shown in FIG. 15.

Magnetic disc manager

As shown in FIG. 1, disk manager 106 generally manages disks and disk arrays, including grouping/resource consolidation (consolidation), disk attribute abstraction, formatting, adding/subtracting disks, and tracking disk service times and error rates. The disk manager 106 does not differentiate between the various disk models and provides a common storage device for the RAID components. The disk manager 106 also provides grouping capabilities that facilitate constructing RAID groups with specific characteristics, such as 10,000RPM disks.

In one embodiment of the invention, the disk manager 106 is at least three tiers: abstraction, configuration, and I/O optimization. The disk manager 106 presents "disks" to higher layers, which may be, for example, locally or remotely attached physical disk drives or remotely attached disk systems.

A common underlying feature is that any of these devices may be the target of an I/O operation. The abstraction service provides a uniform data path interface for higher layers (particularly the RAID subsystem) and provides a generic mechanism for an administrator to manage the target device.

The disk manager 106 of the present invention also provides grouping capabilities to simplify management and configuration. Disks may be named and placed in groups, and groups may also be named. Grouping is a powerful feature that simplifies tasks such as migrating volumes from one grouping of disks to another, dedicating a grouping of disks to a particular function, designating a grouping of disks as spare, and the like.

The disk manager also interfaces with devices such as a SCSI device subsystem that is responsible for detecting the presence of external devices. The SCSI device subsystem is capable of determining, at least for fibre channel/SCSI type devices, a subset of devices that are block type target devices. It is these devices that are managed and abstracted by the disk manager.

In addition, the disk manager is responsible for responding to flow control from the SCSI device layer. The disk manager has the ability to queue, which provides an opportunity to aggregate I/O requests as a method to optimize disk drive system throughput.

Further, the disk manager of the present invention manages a plurality of disk storage system controllers. Also, multiple redundant disk storage system controllers may be implemented to cover failures of operational disk storage system controllers. The redundant disk storage system controller is also managed by a disk manager.

Disk manager relationship to other subsystems

The disk manager interacts with several other subsystems. The RAID subsystem is the primary client of the services provided by the disk manager for data path activity. The RAID subsystem uses the disk manager as an exclusive path to the disks for I/O. The RAID system also listens for events from the disk manager to determine the presence and operational status of the disks. The RAID subsystem also works with a disk manager to allocate ranges for the fabric of the RAID devices. The management control listens for disk events to learn about the existence of a disk and to learn about operational state changes. In one embodiment of the invention, RAID subsystem 104 may comprise a combination of at least one RAID type, such as RAID-0, RAID-1, RAID-5, and RAID-10. It will be appreciated that other RAID types may be used in alternative RAID subsystems, such as RAID-3, RAID-4, RAID-6, RAID-7, and the like.

In one embodiment of the invention, the disk manager utilizes a configuration access service to store persistent configuration and current transitive read-only information such as statistics on the presentation layers. The disk manager registers the handler with configuration access to access these parameters.

The disk manager also utilizes services at the SCSI device layer to understand the existence and operational status of block devices and contains I/O paths to these block devices. The disk manager queries the SCSI device subsystem for devices as a support method for uniquely identifying the disk.

Data instant replay and data instant fusion

The invention also provides a method for instant replay and instant fusion of data. Fig. 3A and 3B show schematic diagrams of snapshots of disk storage blocks of a RAID subsystem at multiple time intervals in accordance with the principles of the present invention. FIG. 3C shows a data instant replay method 300, which includes a step 302 of defining a default size of logical blocks or disk storage blocks such that disk space of the RAID subsystem forms a page pool of storage or a matrix of disk storage blocks; a step 304 of automatically generating snapshots of volumes of the page pool or snapshots of matrices of disk storage blocks at predetermined time intervals; and storing an address index of a snapshot or delta therein of the memory page pool or the matrix of disk memory blocks, such that the snapshot or delta of the matrix of disk memory blocks is instantly locatable by the stored address index.

As shown in FIG. 3B, at each predetermined time interval, for example, 5 minutes, such as T1(12:00PM), T2(12:05PM), T3(12:10PM), and T4(12:15PM), a snapshot of the page pool or disk block matrix is automatically generated. A snapshot or delta thereof of the page pool or disk block matrix is stored in the page pool or disk block matrix so that the snapshot or delta of the page pool or disk block matrix can be located on-the-fly by the stored address index.

Thus, the data instant replay method automatically generates snapshots of the RAID subsystem at user-defined time intervals, user-configured dynamic timestamps (e.g., every few minutes or hours, etc.), or times indicated by the server. These time-stamped virtual snapshots allow for instant replay of data and instant recovery of data in the order of minutes or hours, etc., in the event of a system failure or virus attack. This technique is also known as instant replay fusion, i.e., fusing data shortly before a crash or attack in time, and can instantly use the snapshot stored before the crash or attack for future operations.

FIG. 4 also shows a schematic diagram of a data just-in-time fusion function 400 using multiple snapshots of disk storage blocks of a RAID subsystem, in accordance with the principles of the present invention. At T3, a parallel chain (parallelchain) T3 ' -T5 ' of snapshots is generated, whereby data fused and/or restored by fused data T3 ' may be used to replace data to be fused at T4. Similarly, multiple parallel chains of snapshots T3 ", T4" ' may be generated for replacing data to be fused at T4 ' -T5 ' and at T4 "-T5". In an alternative embodiment, snapshots at T4, T4 '-T5', T5 "may still be stored in a page pool or matrix.

The snapshot may be stored at the local RAID subsystem or the remote RAID subsystem so that if a major system crash occurs due to, for example, a terrorist attack, the integrity of the data will not be affected and the data can be recovered on the fly. FIG. 5 illustrates a schematic diagram of local-remote data copy and restore-on-demand functionality 500 using snapshots of disk storage blocks of a RAID subsystem in accordance with the principles of the present invention.

The remote copy performs a service of copying volume data to a remote system. It attempts to maintain as close synchronization of local and remote volumes as possible. In one embodiment, the data of the remote volume may not reflect a perfect copy of the data of the local volume. Network connectivity and performance may cause the remote volume to be out of sync with the local volume.

Another feature of the data instant replay and data instant fusion method is that snapshots can be used for testing while the system still maintains its operation. Real-time data may be used for real-time testing.

Snapshot and point-in-time copy (PITC)

One example of data instant replay, in accordance with the principles of the present invention, is a snapshot of disk storage blocks utilizing a RAID subsystem. The snapshot records writes to the volume so that a view can be created to view the contents of the past volume. Snapshots therefore also support data recovery by creating a view of a previous point-in-time copy of a volume (PITC).

The core of the snapshot implements the creation, aggregation, management, and I/O operations of the snapshot. The snapshot monitors writes to the volume and creates a point-in-time copy (PITC) for access through the view volume. It adds a Logical Block Address (LBA) remapping layer to the data path within the virtualization layer. This is another virtual LBA mapping layer within the I/O path. The PITC may not copy all volume information, it may only modify the tables used for remapping.

The snapshot tracks changes to the volume data and provides the ability to view the volume data from a previous point in time. The snapshot performs this function by maintaining a list of incremental writes for each PITC.

The snapshot provides a number of methods for the PITC profile table, including: application-initiated and time-initiated. The snapshot provides the application with the ability to create PITCs. The application controls the creation through an API on the server and passes the creation to the snapshot API. Also, the snapshot provides the ability to create a schedule.

The snapshot may not implement a journaling system or restore all writes to the volume. The snapshot may save only the last write to a single address within the PITC window. Snapshots allow users to create PITCs that cover a defined short period of time, such as minutes or hours. To handle the failure, the snapshot writes all the information to disk. The snapshot maintains a volume data page pointer containing incremental writes. Because the table provides a mapping to the volume data and is inaccessible if it is not available, the table data must handle controller failure conditions.

The view volume function provides access to the PITC. The view volume function may be appended to any PITC within the volume other than an existing PITC. The addition to the PITC is a relatively fast operation. Uses of the view volume functionality include testing, training, backup, and restore. The view volume function allows writes without modifying the underlying PITC on which it is based.

In one embodiment, the snapshot is designed to optimize performance and be easy to use at the expense of disk space:

snapshots provide quick responses to user requests. The user request includes an I/O operation, create a PITC, and create/delete a view volume. To this end, the snapshot uses more disk space to store the table information than is minimally required. For I/O, the snapshot summarizes the current state of the volume into a single table so that all read and write requests can be satisfied by the single table. Snapshots reduce the impact on normal I/O operations as much as possible. Second, operating on the picture volume, the snapshot uses the same table mechanism as the main volume data path.

Snapshots minimize the amount of data replicated. To do so, the snapshot maintains a pointer table for each PITC. The snapshot copies and moves pointers, but it does not move the data on the volume.

Snapshots manage volumes using fixed-size data pages. Tracking individual sectors may require a large amount of memory for a single reasonably sized volume. By using pages of data that are larger than sectors, some pages may contain a percentage of information that is copied directly from another page.

Snapshots use the data space on the volume to store the data page table. The look-up table is regenerated after a controller failure. The lookup table allocates pages and further subdivides them.

In one embodiment, the snapshot handles controller failures by requiring the volume using the snapshot to operate on a single controller. This embodiment does not require any coherency. All changes to the volume are recorded on disk or to a reliable cache for recovery by the replacement controller. In one embodiment, recovery from a controller failure requires that snapshot information be read from disk.

Snapshots use a virtualized RAID interface to access storage. The snapshot may use multiple RAID devices as a single data space.

Snapshots support 'n' PITCs per volume and'm' views per volume. The limits on 'n' and'm' are a function of disk space and controller memory.

Volume and volume allocation/layout

The snapshot adds an LBA remapping layer to the volume. Remapping uses the I/O request LBA and lookup tables to translate the address to a page of data. As shown in FIG. 6, the presented volume using snapshots behaves the same as a volume without snapshots. It has linear LBA space and handles I/O requests. The snapshot uses a RAID interface to perform I/O and includes a plurality of RAID devices in a volume. In one embodiment, the size of the RAID device that snapshots the volume is not the size of the presented volume. RAID devices allow snapshots to expand space for data pages within a volume.

A new volume that has snapshots initially enabled only needs space that includes new data pages. The snapshot does not create a page list to place in the underlying PITC. In this case, the bottom PITC is empty. At allocation time, all PITC pages are on the free list. By creating a volume that initially enables snapshots, it may allocate less physical space than the volume presents. The snapshot tracks writes to the volume. In one embodiment of the invention, NULL volumes will not be replicated and/or stored in the page pool or matrix, thereby increasing the efficiency of the use of storage space.

In one embodiment, for both allocation schemes, the PITC places a virtual NULL volume at the bottom of the list. Reads to NULL volumes return zero blocks. The NULL volume handles sectors that have not been previously written by the server. Writes to NULL volumes are not likely to occur. Volumes use NULL volumes for reads to unwritten sectors.

The number of free pages depends on the size of the volume, the number of PITCs, and the expected rate of data change. The system determines the number of allocated pages for a given volume. The number of data pages may expand over time. The augmentation may support faster data changes, more PITC, or larger volumes than expected. The new page is added to the free list. Adding pages to the free list may occur automatically.

Snapshots use data pages to manage volume space. Each page of data may include several megabytes of data. Using an operating system tends to write multiple sectors in the same area of a volume. The memory requirement also indicates that the snapshot uses the page to manage the volume. Maintaining a single 32-bit pointer for each sector of a 1 terabyte volume may require 8 gigabytes of RAM. Different volumes may have different page sizes.

FIG. 7 illustrates one embodiment of a snapshot structure. The snapshot adds a plurality of objects to the volume structure. Other objects include PITC, pointers to active PITC, free list of data pages, sub-view volume, and PITC aggregate object.

The active pitc (ap) pointer is maintained by the volume. The AP handles the mapping of read and write requests to the volume. The AP contains an overview of the current location of all data within the volume.

The free list of data pages tracks the available pages on the volume.

The optional child view volume provides access to the volume PITC. View volumes contain their own AP to record writes to PITC without modifying the underlying data. A volume may support multiple sub-view volumes.

Snapshot aggregate object temporarily links two PITCs for the purpose of removing the previous PITC. Aggregation of PITC involves moving ownership of a data page and releasing the data page.

The PITC contains tables and data pages for pages written when the PITC is active. The PITC contains a freeze timestamp at which point the PITC stops accepting write requests. The PITC also contains a time-to-live value that determines when the PITC will aggregate.

Also, at the point of taking PITC to provide predictable read and write performance, the snapshot summarizes the data page pointers for the entire volume. Other solutions may require a read to check multiple PITCs for the most recent pointer. These solutions require a table caching algorithm but have worst case performance.

The snapshot summary in the present invention also reduces the worst case memory usage of the table. It may require the entire table to be loaded into memory, but it may require only a single table to be loaded.

The summary includes pages owned by the current PITC and may include pages from all previous PITCs. To determine which pages the PITC can write, it tracks page ownership for each data page. It also keeps track of ownership over the aggregated processes. To this end, the data page pointer includes a page index.

Fig. 8 illustrates one embodiment of a PITC life cycle. Each PITC goes through a number of the following states before committing as read-only:

1. create table — at creation time, a table is created.

2. Submitted to disk-this generates on-disk storage for the PITC. By inscribing the table here, it ensures that the space required to store the table information is allocated before fetching the PITC. At the same time, the PITC object is also committed to disk.

3. Accepting I/O, which becomes Active PITC (AP), now handles read and write requests for the volume. This is the only state that accepts write requests to the table. The PITC generates an event indicating that it is currently active.

4. The table is committed to disk-the PITC is no longer an AP and no further pages are accepted. The new AP has taken over. After this point, the table will not change unless it is removed in the polymerization operation. It is read-only. At this point, the PITC generates an event indicating that it is frozen and has been committed. Any service may listen for the event.

5. Release Table memory-memory required for the release table. This step also clears the log to state that all changes have been written to disk.

The top-level PITC of a volume or view volume is referred to as the active PITC (ap). The AP satisfies all read and write requests to the volume. For a volume, the AP is the only PITC that can accept write requests. The AP contains an overview of the data page pointer for the entire volume.

For the aggregation process, the AP may be the destination, not the source. As a destination, the AP increases the number of owned pages, but it does not change the view of the data.

For volume expansion, the AP grows with the volume immediately. The new page points to the NULL volume. non-AP PITC requires no modification to the volume expansion.

Each PITC maintains a table that maps incoming LBAs to data page pointers to the base volume. The table includes pointers to data pages. The table requires more physical disk space to address than the logical space previously presented. FIG. 9 illustrates one embodiment of a table structure containing a multi-level index. The structure decodes the volume LBA into a data page pointer. As shown in fig. 9, each stage decodes lower and lower bits of the address. This structure of the table allows for fast lookup and provides the ability to expand volumes. For fast lookups, the multi-level index structure makes the table shallow, with multiple entries at each level. The index performs an array lookup at each level. To support volume expansion, the multi-level index structure allows additional layers to be added to support expansion. In the overall case, volume expansion is an expansion of the LBA count presented to higher layers, rather than an expansion of the actual amount of storage space allocated for a volume.

The multi-level index contains an overview of the entire volume data page remapping. Each PITC contains a complete remapping list of the volume at the point in time that the PITC was committed.

The multi-level index structure uses different entry types for each level of the table. Different entry types support the need to read information from disk and store information in memory. The underlying entry may contain only a data page pointer. The top and middle level entries contain two arrays, one for the LBAs of the next level table entry, and the other for the memory pointers to the table.

When the presented volume size is expanded, the size of the previous PITC tables need not be increased, and these tables need not be modified. Because the table is read-only, the information in the table may not change, and the augmentation process modifies the table by adding a NULL page pointer to the end. The snapshot does not present the user with the table from the previous PITC directly.

I/O operations require the table to map LBAs to data page pointers. The I/O then multiplies the data page pointer by the data page size to obtain the LBA of the underlying RAID. In one embodiment, the data page size is a power of 2.

This table provides the API to remap LBAs, add pages, and aggregate tables.

The snapshot uses data pages to store PITC objects and LBA mapping tables. The table directly accesses the RAID interface for I/O to its table entry. The table minimizes modifications when reading and writing the table to the RAID device. It is possible to read and write table information directly into the table entry structure without modification. This reduces the number of copies required for I/O. The snapshot may use a change log to prevent the creation of hotspots on disk. A hotspot is a location that is reused to track updates to a volume. The change log records updates to the PITC table and free lists of volumes. During the recovery process, the snapshot uses the change log to recreate the APs and free list in memory. FIG. 10 illustrates one embodiment of a recovery of a table that clarifies the relationship between APs in memory, APs on disk, and change logs. It also shows the same relationship to the free list. The in-memory AP table may be reconstructed from the APs on disk and the log. For any controller failure, the AP in memory is rebuilt by reading the AP on disk and applying the change log thereto. Depending on the system configuration, the change log uses different physical resources. For multi-controller systems, the change log relies on battery backed cache memory for storage. The use of cache memory allows snapshots to reduce the number of table writes to disk while still maintaining data integrity. The change log is copied to the backup controller for recovery. For a single controller system, the change log writes all the information to disk. This has the side effect of creating a hot spot on the disk at the location of the log. This allows multiple changes to be written to a single device block.

Periodically, snapshots write PITC tables and free lists to disk, creating checkpoints in the log and clearing checkpoints. The period varies depending on the number of updates to the PITC. The aggregate process does not use the change log.

Snapshot data page I/O may require that the request fit within a data page boundary. If the snapshot encounters an I/O request that crosses a page boundary, it splits the request. It then passes the request down to the request handler. The write and read portions assume that the I/O fits within the page boundary. The AP provides LBA remapping to satisfy the I/O request.

The AP satisfies all write requests. Snapshots support two different write sequences for owned and non-owned pages. Different write sequences allow adding pages to the table. FIG. 11 illustrates one embodiment of a write process having a sequence of owned pages and a sequence of non-owned pages.

For the self page sequence, the process includes the following:

1) finding a table mapping; and

2) write-remap LBA from page to page, and write data to RAID interface.

The previously written page is a simple write request. The snapshot writes the data to the page, overwriting the current content. Only the data pages owned by the AP are written. Pages owned by other PITCs are read-only.

For a non-owned page sequence, the process includes the following:

1) finding a table mapping;

2) reading previous pages-performing a read of a page of data such that the write request and the read data constitute a complete page. This is the start of the copy on the write process.

3) Combine data — place data page read and write request payloads in a single contiguous block.

4) Free list assignment-get new data page pointer from free list.

5) The combined data is written to the new data page.

6) The information of the new page is submitted to the log.

7) Update table-change LBA remapping in the table to reflect the new data page pointer. The data page is now owned by the PITC.

Adding a page may require blocking read and write requests until the page is added to the table. The snapshot implements controller coherency by writing table updates to disk and saving multiple cached copies for the log.

For read requests, the AP fulfills all read requests. Using the AP table, the read request remaps the LBAs to the LBAs of the data pages. It passes the remapped LBA to the RAID interface to satisfy the request. A volume may fulfill read requests for data pages that have not been previously written to the volume. These pages are marked as NULL addresses (all 1 s) in the PITC table. The request for this address may be satisfied by the NULL volume and return a constant data pattern. Pages owned by different PITCs may satisfy read requests that cross page boundaries.

Snapshots use NULL volumes to satisfy read requests for previously unwritten data pages. It returns all 0 s for each sector read. It does not have RAID devices or allocated space. It is contemplated that blocks of all 0's are kept in memory to meet the data requirements of a read to a NULL volume. All volumes share a NULL volume to satisfy read requests.

In one embodiment, the aggregation process removes PITC and some of its own pages from the volume. Removing the PITC creates more available space to track new discrepancies. The aggregate compares the differences for two adjacent tables and only saves the newer differences. Aggregation occurs periodically or manually, depending on user configuration.

The process may include two PITCs, a source and a destination. In one embodiment, the rules for a qualified object are as follows:

1) the source must be a PITC before the destination-the source must be created before the destination.

2) The destinations may not be the sources at the same time.

3) A source may not be referenced by multiple PITCs. When creating a view volume from a PITC, multiple referencing occurs.

4) The destination may support multiple references.

5) An AP may be a destination but not a source.

The aggregation process writes all changes to disk and does not require any coherency. If the controller fails, the volume recovers the PITC information from the disk and resumes the aggregation process.

The process tags two PITCs for polymerization and comprises the steps of:

1) the source state is set to the aggregate source-this state is committed to disk for memory failure recovery. At this point, the source is no longer accessed because the data page of the source may be invalid. The data page may be returned to the free list, or ownership may be transferred to the destination.

2) The destination state is set to the aggregation destination — this state is committed to disk for controller failure recovery.

3) Load and compare table-the process moves the data page pointer. The freed data page is immediately added to the free list.

4) The destination state is set to normal-the process is complete.

5) Adjust list-change the previous pointer to the source next pointer to point to the destination. This effectively removes the source from the list.

6) Release Source-return any data pages for control information to the free list.

The above process supports a combination of two PITCs. One skilled in the art will appreciate that aggregation can be designed to remove multiple PITCs and create multiple sources in one pass.

As shown in FIG. 2, a page pool maintains a free list of data pages for use by all volumes associated with the page pool. The free list manager uses the data pages from the page pool to commit the free list to persistent storage. Updates to the free list come from more than one source: write process allocate pages, control page manager allocate pages, and aggregate process return pages.

The free list maintains triggers that automatically augment themselves at some threshold. The trigger adds the page to the page pool using a page pool expansion method. The automatic expansion may be determined by volume policy. More important data volumes will be allowed to expand while less important volumes are forced to aggregate.

The view volume provides access to previous points in time and supports normal volume I/O operations. The PITC tracks differences before the PITC, and the view volume allows the user to access information contained within the PITC. The view volume branches from the PITC. The view volume supports restore, test, backup operations, and the like. Creation of the view volume occurs almost instantaneously since the view volume does not require a copy of the data. The view volume may require its own AP to support writing to the view volume.

The view taken from the current state of the volume AP may be copied from the current volume AP. Using the AP, the view volume allows writes to the view volume without modifying the underlying data. The OS may require a file system or file reconstruction to use the data. The view volume allocates space for the AP and the written data pages from the parent volume. The view volume has no associated RAID device information. Deleting the view volume frees up space back to the parent volume.

FIG. 12 illustrates exemplary snapshot operations using a snapshot to display volume transfers. FIG. 12 shows a volume having 10 pages. Each state contains a read request fulfillment list for the volume. The shaded blocks indicate the own data page pointers.

The transition from the left side of the figure (i.e., the initial state) to the middle of the figure shows writes to pages 3 and 8. A write to page 3 requires a change to PITC I (AP). PITC I follows the new page write process to add page 3 to the table. The PITC reads the unchanged information from page J and stores the page using drive page B. All future writes to page 3 in this PITC can be handled without moving the page. The write to page 8 shows the second case for writing to a page. Because PITC I already contains page 8, PITC I overwrites that portion of the data in page 8. For this case, it exists on driver page C.

The transition from the middle of the figure to the right side of the figure (i.e., the final state) shows the aggregation of PITC II and III. Snapshot aggregation involves removing older pages separately while still maintaining all changes in both PITCs. Both PITCs contain page 3 and page 8. The process keeps the newer pages from PITC II and frees the pages from PITC III, which returns pages a and D to the free list.

The snapshot allocates data pages from the page pool for storing free list and PITC table information. Control page allocation allocates data pages twice to match the size required by the object.

The volume contains a page pointer to the top of the control page information. From this page, all other information can be read.

The snapshot keeps track of the number of pages in use over a certain time interval. This allows the snapshot to predict when the user needs to add more physical disk space to the system to prevent the snapshot from exhausting.

Data hierarchy management

In one embodiment of the invention, data staging management (DP) is used to gradually move data into storage space with an appropriate cost. The present invention allows the user to add drives when they are actually needed. This will significantly reduce the overall cost of the disk drive.

Data staging management moves non-recently accessed data as well as historical snapshot data to less expensive storage. For data that is not recently accessed, this reduces the cost of storage step by step for any page that is not recently accessed. It may not move the data immediately to the lowest cost storage. For historical snapshot data, it moves a read-only page to a more efficient storage space, such as RAID 5, and if the page is no longer accessed by the volume, moves the page to the least expensive storage.

Other advantages of the hierarchical management of data of the present invention include maintaining fast I/O access to currently accessed data and reducing the need to purchase fast but expensive disk drives.

In operation, data progression management uses the cost of physical media and the efficiency of the RAID devices used for data protection to determine the cost of storage. Data hierarchy management also determines storage efficiency and moves data accordingly. For example, data progression management may convert RAID 10 into a RAID 5 device in order to more efficiently use physical disk space.

Data hierarchy management defines accessible data as data that can currently be read or written by a server. It uses accessibility to determine the storage class that the page should use. If a page belongs to historical PITC, it is read-only. If the server does not update the page in the most recent PITC, the page is still accessible.

FIG. 17 illustrates one embodiment of an accessible data page in a data hierarchy management operation. The accessible data pages are divided into the following categories:

recently accessed accessibilities — these are the most used active pages of the volume.

Not recently accessed accessible-read and write pages that have not been used recently.

History accessible-read only pages readable by volume-applied to snapshot volumes

History inaccessible-read-only data pages that are not currently accessed by the volume-apply to the snapshot volume. Snapshot is recovery

These pages are maintained for the purpose and are typically placed on the lowest cost storage possible.

In fig. 17, three PITCs with different owned pages of a snapshot volume are shown. The dynamic capacity volume is represented by PITC C alone. All of these pages are accessible and read and write. These pages may have different access times.

The following table illustrates various storage devices in terms of increasing efficiency or decreasing monetary cost. The list of storage devices is also in the approximate order of the progressively slower write I/O accesses. Data progression management calculates the efficiency of the logically protected space divided by the total physical space of the RAID devices.

Table 1: RAID type

As the number of drives in a stripe increases, RAID 5 efficiency increases. As the number of disks in a stripe increases, the failure domain (fault domain) also increases. The increase in the number of drives in a stripe also increases the minimum number of disks necessary to create a RAID device. In one embodiment, data staging management does not use RAID 5 stripe sizes larger than 9 drives due to increased failure domain sizes and limited efficiency increases. Data staging management uses a RAID 5 stripe size that is an integer multiple of the snapshot page size. This allows data staging management to perform full stripe writes when moving pages to RAID 5, making the move more efficient. For data staging management purposes, all RAID 5 configurations have the same write I/O characteristics. For example, RAID 5 on 2.5 inch FC disks may not be able to effectively use the performance of these disks. To prevent this combination, data progression needs to support the ability to prevent RAID types from running on certain disk types. The configuration of data hierarchy management may also prevent the system from using RAID 10 or RAID 5 space.

The disk types are shown in the following table:

table 2: disk type

Data hierarchy management includes the ability to automatically classify disk drives relative to the drives within the system. The system examines the disk to determine its performance relative to other disks in the system. Faster disks are classified in a higher value class and slower disks are classified in a lower value class. When a disk is added to the system, the system automatically rebalances the value classification of the disk. This approach handles both systems that never change and systems that change often when new disks are added. Automatic classification may place multiple disk types in the same value classification. If the drivers are determined to be close enough in value, they may have the same value.

In one embodiment, the system includes the following drivers:

high-10K FC driver

Low-SATA drive

With the addition of a 15K FC drive, the data hierarchy management automatically reclassifies the disks and downgrades the 10K FC drive. This results in the following classifications:

high-15K FC driver

middle-10K FC driver

Low-SATA drive

In another embodiment, the system may have the following drive types:

high-25K FC driver

Low-15K FC driver

Thus, the 15K FC driver is classified as a lower value class, and the 15K FC driver is classified as a higher value class.

If a SATA drive is added to the system, data staging management automatically reclassifies the disks. This results in the following classifications:

high-25K FC driver

middle-15K FC driver

Low-SATA drive

The data staging management may include waterfall staging management. Typically, waterfall progression management moves data into less expensive resources only when the resources are fully utilized. Waterfall progression management effectively maximizes the use of the most expensive system resources. It also minimizes the cost of the system. Adding inexpensive disks to the lowest pool will create a larger pool at the bottom.

Typical waterfall progression management uses RAID 10 space and then uses the next one of the RAID space, such as RAID 5 space. This forces the waterfall to proceed directly to RAID 10 for the next type of disk. Alternatively, the data progression management may include a hybrid RAID waterfall progression management as shown in fig. 24. The replacement data hierarchical management approach solves the problem of maximizing disk space and performance and allows storage to be converted to a more efficient form in the same disk class. This alternative method also supports the requirement that RAID 10 and RAID 5 share the total resources of the disk class. This may require configuring a fixed percentage of disk space that the RAID level may use for a disk class. Thus, this alternative data hierarchy management approach maximizes the use of expensive storage while allowing space for another RAID class to coexist.

This hybrid RAID waterfall approach also moves pages only to less expensive storage when storage is limited. A threshold, such as a percentage of total disk space, limits the amount of storage of a certain RAID type. This maximizes the use of the most expensive storage in the system. Data hierarchy management automatically moves pages to lower cost storage as storage approaches its limits. Data staging management provides a buffer for write peaks.

It will be appreciated that the waterfall approach above also moves pages immediately to the lowest cost storage, as in some cases there may be a need to move history and non-accessible pages to less expensive storage in a timely manner. History pages can also be moved to less expensive storage instantaneously.

Fig. 18 shows a flow diagram of a data hierarchy management process 1800. Data hierarchy management continuously checks each page in the system for its access pattern and storage cost to determine if there is data to move. Data hierarchy management may also determine whether storage has reached its maximum allocation.

The data hierarchy management process determines whether the page is accessible by any volume. The process checks each volume attached to the history against the PITC to determine if the page is referenced. If the page is being actively used, the page is eligible for promotion or slow demotion. If the page is not accessible by any volume, it is moved to the lowest cost storage available. Data staging management also factors in the time before PITC expires. If the snapshot schedule PITC is about to expire, no pages will be hierarchically managed. If the page pool is operating in aggressive mode, the page can be managed hierarchically.

Data hierarchy management recent access detection requires eliminating bursts of activity from upgrades to pages. Data hierarchy management separates read and write access tracking. This allows for data hierarchy management to maintain data on accessible RAID 5 devices. Operations such as virus scanning or reporting only read data. If storage is in short supply, data hierarchy management changes the qualification of the most recent access. This allows data hierarchy management to more aggressively downgrade pages. This also helps to fill the system from bottom to top when storage is in short.

Data staging management can actively move data pages when system resources become scarce. For all these cases, there must still be more disks or configuration changes. Data staging lengthens the amount of time a system can operate in an in-flight condition. Data hierarchy management attempts to keep the system operational for as long as possible. This continues until such time as all of its storage classes are exhausted.

In situations when RAID 10 space is in short supply, and the total available disk space is in short supply, data progression management may dial RAID 10 disk space to move to more efficient RAID 5. This increases the overall capacity of the system at the expense of write performance. More disks are still necessary. If a particular storage class is fully used, data staging management allows borrowing of non-acceptable pages to keep the system running. For example, if a volume is configured to use RAID 10-FC for its accessible information, it may allocate pages from RAID 5-FC or RAID 10-SATA until more RAID 10-FC space is available.

Data hierarchy management also supports compression to increase the perceived capacity of the system. Compression may be used only for history pages that are not accessed, or as storage for recovery information. Compression appears as another type of storage near the bottom of the storage cost.

As shown in fig. 25, the page pool basically contains a free list and device information. The page pool needs to support multiple free lists, an enhanced page allocation scheme, and a categorization of the free lists. The page pool maintains a separate free list for each type of storage. The allocation scheme allows pages to be allocated from one of multiple pools while setting a minimum or maximum allowed class. The classification of the free list is from the device configuration. Each free list provides its own counter for statistics gathering and display. Each free list also provides RAID device efficiency information for aggregation of storage efficiency states.

In one embodiment, the device list may require additional capability to track storage class costs. The combination determines the class of storage. This occurs if the user wishes to have more or less granularity to the configured classes.

FIG. 26 illustrates one embodiment of a high performance database in which all available data, even if not recently accessed, resides only on a 2.5FC drive. The non-accessible historical data is moved to the RAID 5 fibre channel.

FIG. 27 illustrates one embodiment of an MRI image volume where the accessible storage for the dynamic volume is SATA RAID 10 and RAID 5. If the image has not been recently accessed, the image is moved to RAID 5. New writes initially enter RAID 10. FIG. 19 illustrates one embodiment of a compressed page layout. Data hierarchy management achieves compression by making a secondary allocation to a fixed size data page. The secondary allocation information tracks the location of the free portion of the page and the allocated portion of the page. Data staging management may not predict the efficiency of compression and may handle variable size pages within its secondary allocation.

Compressed pages can significantly impact CPU performance. For write access, a compressed page will require the entire page to be decompressed and recompressed. Thus, the page being actively accessed is not compressed and returns to its uncompressed state. In cases where storage is extremely limited, writes may be necessary.

The PITC remapping table points to the secondary allocation information and is marked to indicate the compressed page. Accessing compressed pages may require higher I/O counts than non-compressed pages. The access may require a read of the secondary allocation information to retrieve the location of the actual data. The compressed data may be read from disk and decompressed on the processor.

Data hierarchy management may require that compression be able to break down portions of an entire page. This allows data hierarchy management read access to decompress only a small portion of a page. The read-ahead feature of the read cache may help delay compression. A single decompression may handle multiple server I/O. The data-level management marks pages that are not good candidates for compression, so that it will not have to try to compress pages frequently.

FIG. 20 illustrates one embodiment of hierarchical management of data in an advanced disk drive system in accordance with the principles of the present invention. Data hierarchy management does not change the external behavior of the volume or the operation of the data path. Data hierarchy management may require modification to a page pool. The page pool basically contains free lists and device information. The page pool needs to support multiple free lists, an enhanced page allocation scheme, and a categorization of the free lists. The page pool maintains a separate free list for each type of storage. The allocation scheme allows pages to be allocated from one of multiple pools while setting a minimum or maximum allowed class. The sorting of the free list may come from the device configuration. Each free list provides its own counter for statistics gathering and display. Each free list also provides RAID device efficiency information for aggregation of storage efficiency statistics.

The PITC identifies candidates for movement and blocks I/O to accessible pages when the pages are moved. Data staging management continually checks PITC for candidates. The accessibility of pages is constantly changing due to server I/O, new snapshot page updates, and view volume creation/deletion. Data hierarchy management also continually checks for volume configuration changes and summarizes the current list of page classes and counts. This allows the data hierarchy management to evaluate the summary and determine if there are pages that are likely to be moved.

Each PITC presents a counter for the number of pages stored for each class. The data hierarchy management uses this information to identify PITC that becomes a good candidate for a moving page when a threshold is reached.

RAID allocates devices from a group of disks based on disk cost. RAID also provides an API to retrieve the efficiency of the device or potential device. It also needs to return information about the number of I/os needed for the write operation. Data progression management may also require that RAID NULL use a third party RAID controller as part of the data progression management. The raidull may consume the entire disk and may simply be a pass-through layer.

The disk manager may also automatically determine and store the disk classification. Automatically determining the disk class may require a change to the SCSI initiator.

From the above description and drawings, it will be appreciated by those of ordinary skill in the art that the specific embodiments shown and described are for purposes of illustration only and are not intended to limit the scope of the present disclosure. Those skilled in the art will recognize that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. References to details of particular embodiments are not intended to limit the scope of the invention.

Claims

1. A disk drive system capable of dynamically allocating data in a storage pool, the system comprising:

a RAID subsystem containing the pool of storage;

a plurality of virtual volumes comprising disk space blocks from a RAID subsystem, wherein at least a portion of the plurality of virtual volumes each comprise disk space blocks from a plurality of RAID devices; and

a disk manager including at least one disk storage system controller;

wherein the disk manager is configured to:

maintaining a list of empty disk space blocks for a plurality of virtual volumes;

dynamically allocating a block of disk space; and is

Data is written to the allocated block of disk space.

2. The system of claim 1, wherein the disk manager manages a plurality of disk storage system controllers.

3. The system of claim 2, further comprising a plurality of redundant disk storage system controllers to mask failure of an operating disk storage system controller.

4. The system of claim 1, wherein the RAID subsystem further comprises at least one of a RAID type of RAID-0, RAID-1, RAID-5, and RAID-10.

5. The system of claim 4, further comprising RAID types such as RAID-3, RAID-4, RAID-6, and RAID-7.

6. A method of dynamically allocating data in a RAID storage system, the method comprising the steps of:

generating a plurality of virtual volumes comprising disk space blocks from a plurality of RAID devices, wherein at least a portion of the plurality of virtual volumes each comprise disk space blocks from a plurality of RAID devices;

managing a page pool of storage that maintains a list of empty disk space blocks for a plurality of virtual volumes;

dynamically allocating disk space blocks of a plurality of virtual volumes using the pool of storage pages; and

data is written to the allocated block of disk space.

7. The method of claim 6, further comprising setting the size of the disk storage blocks as a default and changeable by a user.

8. The method of claim 6, wherein the plurality of virtual volumes comprise disk space blocks from at least one of a plurality of RAID types, such as RAID-0, RAID-1, RAID-5, and RAID-10.