WO1996041249A2

WO1996041249A2 - Intelligent disk-cache memory

Info

Publication number: WO1996041249A2
Application number: PCT/US1996/006520
Authority: WO
Inventors: Lawrence E. Aszmann; John P. Guider
Original assignee: Tricord Systems, Inc.
Priority date: 1995-06-07
Filing date: 1996-05-20
Publication date: 1996-12-19
Also published as: WO1996041249A3; AU5790596A

Abstract

Method and apparatus for intelligently caching data in an intelligent disk subsystem connected to a main computer having a main memory. The disk subsystem includes a disk-cache memory having a first and a second memory bank. A first copy and a second copy of data are held in the first and second memory banks, respectively, wherein the first memory bank is coupled to a first battery and the second memory bank is coupled to a second battery. A detected failure occurring in either memory bank or either battery causes either the first copy or the second copy of data to be read, based on where a detected failure occurred. In one embodiment, successive read operations are routed to alternating memory banks. In one embodiment, read operations going to disk devices and returning data to the disk-cache memory are given a higher priority than write operations. In one embodiment, only write operation data are cached in the cache memory, but either read or write operations are completed using the data held in the cache memory.

Description

INTELLIGENT DISK-CACHE MEMORY

Field of the Invention The present invention relates to methods and apparatus for computer disk subsystems and more specifically to cache memories suitable for computer disk subsystems.

Background of the Invention Very fast access to data stored in computer disk subsystems is desired to enhance the speed of the computers which use those disk subsystems. Computer disk subsystems generally include the media on which data are stored, plus one or more disk-subsystem controllers, one or more memories for caching and/or buffering data being transferred between the system and the disk devices, and an interface to the main computer or computers. The media holding data within a computer disk subsystem can include magnetic, optical, or other disk-storage technologies, and often also include tape or other types of removable data storage, particularly for backup and interchange of the data stored on the disk-storage technologies.

The memories in a computer can include static random access memories (SRAMs), dynamic random access memories (DRAMs), or dual- ported static random-access memories (DPSRAMs).

"Caching" in these memories is defined as storing data which, it is anticipated, the main computer will request in the near future (this is also called "read caching"). Read caching can involve holding read data which was requested in anticipation that it will be requested again in the future, or reading ahead of what was requested and holding the read-ahead data in anticipation that sequential data will be requested in the future. "Write caching" is defined as storing data, which has been sent from the main computer for storage on the disk devices, into the memories. "Read buffering" in these memories is defined as storing data which has already been requested by the main computer, but which comes from the disk devices at a different rate than the rate which the data are transferred to the main computer system; the buffering accommodates the different transfer rates (typically, storing data as they are relatively slowly read and gathered from the disk devices, then quickly transferring blocks of data to the main computer). "Write buffering" accommodates data which is being transferred from the system to the disk devices, and which comes from the system at a different rate (typically a quicker rate) than the rate of transfer to the disk devices. Read buffering and write buffering is often performed in disk devices themselves.

Often, the main computer must wait until the disk subsystem has completed a write operation and has also indicated this completion to the main computer before the main computer can proceed with other operations. If write caching is performed, the completion indication can be sent to the main computer once the data are successfully written into a data cache, and before the data have actually been written to the disk surface. There is a danger of data loss if the data are transferred into the cache and the corresponding completion indication is sent to the main computer, but later some subsequent error (such as a parity error in the data, a disk-subsystem-controller failure, or a loss of power to the subsystem) prevents the data from ever being written to the disk.

What is needed is an improved cache apparatus and method for an "intelligent" disk subsystem. The improved cache should improve performance and be able to recover data in case of a detected parity error in the data, in the case of a controller failure, and in the case of a power failure.

Summary of the Invention The present invention teaches a method and apparatus for intelligently caching data in an intelligent disk subsystem connected to a main computer having a main memory. The disk subsystem includes a disk-cache memory having a first and a second memory bank. A first copy and a second copy of data are held in the first and second memory bank, respectively, wherein the first memory bank is coupled to a first battery and the second memory bank is coupled to a second battery. A detected failure occurring in either memory bank or either battery causes either the first copy or the second copy of data to be read (i.e., the copy not associated with the error will be read), based on where a detected failure occurred.

In one embodiment, the detected failures include error-correction- code (ECC) errors.

In one embodiment, successive read operations are routed to alternating memory banks.

In another embodiment, each memory bank is periodically scrubbed to purge the memory of correctable errors.

In yet another embodiment, the disk-cache memory is packaged on a removable cache module. In yet another embodiment, read operations going to disk devices and returning data to the disk-cache memory are given a higher priority than write operations.

In yet another embodiment, a method describes caching data originating from write operations but not caching data originating from read operations in the cache, wherein write operations are defined as transferring data from the main computer to the disk subsystem and read operations are defined as transferring data from the disk subsystem to the main computer.

In yet another embodiment, later write operations are examined to determine whether data from the later write operation is intended to be written to the intended address of an earlier write operation, and if so, the data from the first write operation is overwritten with the data from the second in the cache.

Brief Description of the Drawings In the following detailed description of the invention, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration only, specific exemplary embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized, and structural changes may be made, without departing from the scope of the present invention.

FIG. 1 is a block diagram illustrating an computer system 100 including an intelligent SCSI subsystem 200 according to the present invention. FIG. 2 is a block diagram illustrating an intelligent SCSI subsystem 200 including an intelligent disk-cache memory 300 according to the present invention.

FIG. 3 is a block diagram illustrating details of one embodiment of an intelligent disk-cache memory 300 according to the present invention. FIG. 4 is a diagram illustrating an embodiment of removable cache module 350.

Description of the Preferred Embodiment In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.

Figure 1 is a block diagram illustrating a computer system 100 according to the present invention, including main computer 102 and intelligent SCSI subsystem (ISS) 200 connected by high-performance bus 198. Main computer 102 includes system memory 104. A plurality of disk-connection busses 239 are provided, each capable of connecting to a plurality of disk devices 240. In one embodiment, disk-connection busses 239 are standard 16- bit-wide/fast differential-drive SCSI (Small Computer System Interface) busses with differential SCSI terminators, and disk devices 240 are SCSI disk devices. In other embodiments, disk-connection busses 239 are standard 8 -bit- wide/fast differential-drive SCSI busses, 8-bit-wide/slow differential-drive SCSI busses, 8- bit-wide/fast single-ended-drive SCSI busses, or 8-bit-wide/slow single-ended- drive SCSI busses, depending on the interface chips used in ISS 200. In one embodiment, four SCSI disk-connection busses 239 are provided. In one such embodiment, up to seven SCSI disk devices 240 can connect to each disk- connection bus 239. In another such embodiment, up to fifteen SCSI disk devices 240 can connect to each disk-connection bus 239. In one embodiment, high-performance bus 198 is a 64-bit wide bus capable of 267MB/second transfer rates. In one embodiment, up to six ISS 200 subsystems can be connected to one main computer 102, providing scalability to the needs of the user. In one such embodiment, portions of disk-processing tasks are offloaded from main computer 102 into each ISS 200, allowing scalable system performance improvements by adding ISS 200 subsystems. In another embodiment, multiple main computers 102 are also connected to high- performance bus 198, providing additional system performance or redundancy (for additional reliability) or both.

Figure 2 is a block diagram illustrating one embodiment of intelligent SCSI subsystem (ISS) 200, including intelligent disk-cache memory 300, processor 202, cache buffer 204, dual-port memory 206, FLASH ROM (read-only memory) 208, NVRAM (non- volatile random-access memory) 210, system registers 212, local registers 214, SRAM 216, and a plurality of disk- connection bus processors 220.1 through 220.N. In one embodiment, disk- connection bus processors 220.1 through 220.N are NCR53C720-type SCSI processors by NCR Corporation, and number four. In one embodiment, processor 202 is a 33-MHz 486SX processor by Intel Corporation, and is used to control the overall flow of data and status information within ISS 200. In one such embodiment, a socket is provided to allow a processor 202 upgrade to an Intel Overdrive processor. In one embodiment, system registers 212 and local registers 214 are implemented in a fifteen-nanosecond (15-ns) MACH230 chip. In one embodiment, dual-port memory 206 is implemented using two 30-ns 4K-by- 16-bit dual-port SRAMs, with one port coupled to high- performance bus 198 and the other port coupled to local bus 199. In one such embodiment, writing to a particular location (called an 'ISS mailbox') in dual- port memory 206 from high-performance bus 198 causes an interrupt (called an 'ISS mailbox interrupt') to be issued to processor 202 (in one embodiment, by an Intel 8259A interrupt controller chip) in order to indicate that information has arrived in the ISS mailbox. In one such embodiment, writing to a particular location (called a 'system mailbox') in dual-port memory 206 from local bus 199 causes a system interrupt (called a 'system mailbox interrupt') to be issued to processor 202 (in one embodiment, via a slot-specific attention signal — i.e., identifying a specific ISS 200) in order to indicate that information has arrived in the system mailbox. In one embodiment, each dual-port memory 206 on each ISS 200 in a single system 100 is given a different address to which to respond, in order that main computer 102 can communicate to each ISS 200 separately. In one embodiment, SRAM 216 is implemented as two sections of four chips each, wherein each chip is a 25-ns 128K-by-8-bit SRAM, and contains the executable firmware which controls operation of ISS 200. In one embodiment, FLASH ROM 208 is an 80-ns 256K-by-8-bit electronically- erasable-and-rewritable ROM comprising a 28F200BXT chip, and containing at least part of the firmware which controls the operation of processor 202. In one such embodiment, the subsystem firmware in FLASH ROM 208 can be updated without turning off or disabling the system 100 using a DOS-based utility operating on main computer 102; subsystem firmware in FLASH ROM 208 is copied into SRAM 216 for faster execution by processor 202 and bus processors 220.1-220.N. In one embodiment, NVRAM 210 is a 100-ns 32K-by-8-bit nonvolatile SRAM chip and is used to store RAID (redundant arrays of inexpensive disks) configuration information which is used to automatically restore the RAID-level configuration after a power failure, and the status of in- progress write operations in order that disk data integrity can be reconstructed after a failure.

In one embodiment, ISS 200 provides RAID data protection, and supports various levels of RAID protection including data striping, disk mirroring, and disk-fault recovery. RAID level 0 provides higher potential performance by placing part of a data file on one disk (e.g., on the first disk device 240 connected to disk-connection bus 239 from bus processor 220.1) and another part of the same data file on another disk (e.g.. on the first disk device 240 connected to disk- connection bus 239 from bus processor 220.2), such that both disk devices 240 can be accessed in parallel, and data can thus be transferred at a higher rate, up to N times faster if N disk devices 240 are being used in parallel. This placement of data is called 'striping', and helps avoid concentrating system data on a few highly-used disk devices 240. By balancing read and write requests across all disk devices 240 in a system 100, better average system performance is achieved. In one embodiment, a RAID-level-0 subsystem appears to main computer 102 as a single, large logical disk device having the capacity of all disk devices 240 combined.

RAID level 1 provides higher potential performance by placing a 'mirrored' separate copy of a data file on each of two or more disk devices 240 (e.g., one copy on the first disk device 240 connected to disk-cpnnection bus 239 from bus processor 220.1 , and one copy on the first disk device 240 connected to bus 239 from bus processor 220.2, and if more than two copies are to be made, one copy to each other mirror disk device 240), such that all disk devices 240 are written roughly in parallel, and data are read from any disk, thus freeing the other disks for other work, which gives better overall performance for system 100. In one embodiment, the disk device 240 finishing last provides the completion status signal to main computer 102, thus ensuring that all devices have received the new data before the data are released by the main computer 102. In an alternative embodiment the disk device 240 finishing first provides ' the completion status signal to main computer 102, thus providing faster average write times, since the fastest completion time is as fast or faster than the average completion time. RAID level 1 also provides data protection, in that if an error is detected in the data read from one disk, or even if one disk-connection bus 239 fails entirely, the data can still be retrieved from one or more other disks on other disk-connection busses 239. In one embodiment, RAID-level- 1 mirroring is performed across two or more ISS 200 subsystems, (a configuration called 'disk controller duplexing') in order that a failure of an ISS 200 will not cause the loss of data availability to main computer 102 (for example, in one embodiment one complete copy of all system data are stored on disk devices 240 connected to one ISS 200, and another complete copy of all system data are stored on disk devices 240 connected to another ISS 200, so that if either ISS 200 or any component connected to them fails, all of the data are available through the other, surviving ISS 200 and associated devices).

RAID level 10 provides even higher potential performance as well as providing data protection, by combining RAID-level- 1 mirroring with RAID-level-0 striping. RAID level 4 provides data protection but uses fewer disk drives than RAID-level- 1 mirroring, by combining RAID-level-0 striping with exclusive-or checksum-type parity data redundancy. In one such embodiment, data are striped across two to twenty-seven disk devices 240, while a separate disk device 240 holds the parity data; and all disk devices 240 are configured into a single, large logical device. In another such embodiment, data are striped across two to fifty-nine disk devices 240, while a separate disk device 240 holds the parity data; and all disk devices 240 are configured into a single, large logical device.

RAID level 5 provides data protection but uses fewer disk drives than RAID-level- 1 mirroring, by combining RAID-level-0 striping with exclusive-or checksum-type parity data redundancy. In one such embodiment, data are striped across three to twenty-eight disk devices 240, while the parity data are interleaved among each disk device 240. In another such embodiment, data are striped across three to sixty disk devices 240, while the parity data are interleaved among each disk device 240.

In one embodiment, one or more hot-spare disk devices 240 are connected to one or more of the disk-connection busses 239 and kept powered- on and ready, but are unused relative to storage of data until a failure is detected (for RAID-levels- 1, -4, -5, or -10) in one of the other disk devices 240. For example, in a system 100 having RAID-level- 1 mirroring, in the event that failure is detected in one of the other disk devices 240, the mirrored data from the disk device (or devices) 240 which mirror the failed device is reconstructed onto the hot-spare disk device 240. In a system 100 which has one failed disk device 240, there is a chance that an additional failure will affect the remaining good copy of the data before the failed device can be repaired (or replaced). By reconstructing the data from the first disk device 240 to fail onto the hot-spare disk device 240, additional system reliability is provided. In one such embodiment, the ability to hot-replace a failed disk device 240 is provided, allowing a user to remove and replace a failed disk device 240 while the system 100 continues to operate. In one such embodiment, hot-replacement is limited to differential-type SCSI drives.

In one embodiment, data for write operations are written into both memory banks 308A and 308B in intelligent disk-cache memory 300, and a write-completion status signal is sent to main computer 102 after these writes are complete, but before data are actually written to disk devices 240. These write operations are called "delayed writes," since they are actually completed after the completion is indicated to main computer 102. This allows main computer 102 to proceed to other tasks without having to wait until the data are actually written to disk devices 240, thus enhancing performance in certain circumstances. Because two redundant copies of the data exist in battery-backed-up memory banks 308A and 308B, there is little chance that the data will get irretrievably destroyed before they are actually written to the disk devices 240.

In one such embodiment, an elevator-seek algorithm operating in the firmware which is controlling processor 202 optimizes the sequence of " operations sent to any one disk device 240 (although certain operations which must be performed in a determined order, such as a write to a particular sector followed by a read to the same sector are left in that order). Thus, some operations are re-ordered in order to shorten the seek time needed (e.g., a first read operation to a sector which required a long seek time might be re-ordered to take place after a second read operation which came into ISS 200 after the first operation but which required a shorter seek time: the second operation would be performed by the disk arm while the arm was "on its way" to the first). In one such embodiment, operations are reordered so that disk addresses are accessed in an alternating ascending-and-descending sequence, and the disk arm is scanned first in an ascending sequence, and then in a descending sequence (hence the name 'elevator-seek'). By implementing the elevator-seek algorithm in firmware rather than in the operating system running on main computer 102, bus traffic on high-performance bus 198 is reduced, thus enhancing overall system performance.

In one embodiment, the firmware which is controlling processor 202 optimizes operations by detecting multiple SCSI requests to adjacent locations and combining such operations into a single, larger SCSI operation, thus reducing the number of SCSI commands passed, and saving "missed revolutions" (i.e., if a second operation is sent immediately following the completion of the first, the disk head will already have moved across the sector needed ('missing it') by the time the second command is recognized, and the head must therefore wait nearly an entire revolution before the desired sector is again reached). By combining both operations into a single operation, data can be continuously read (or written) without incurring additional revolutions. Such concatenated write operations are called "concatenated writes."

In one embodiment, the firmware which is controlling processor 202 optimizes write operations by detecting multiple SCSI writes to the same location and combining such operations into a single write operation reflecting only the data of the last write to arrive at ISS 200 (since the data of the earlier operations would have been overwritten anyway, those data are no longer needed, and only the last operation need be performed). In one such embodiment, upon the completion of the last operation, write-completion status is then signaled to main computer 102 for every collapsed write operation. Performance is improved, since the unneeded earlier write operations are never actually performed to the disk devices 240. These combined write operations are called "collapsed writes." In one embodiment, the firmware which is controlling processor

202 increases overall performance by giving read operations going to disk devices 240 a higher priority than write operations. The resulting performance increase is apparently due to delayed writes accumulating in intelligent disk- cache memory 300 (since reads are preferentially performed earlier), and the writes thus becoming "concatenated writes" or "collapsed writes." In one embodiment, operating system performance is also improved by allowing main computer 102 to perform read-ahead caching into system memory 104.

Figure 3 is a block diagram illustrating details of one embodiment of intelligent disk-cache memory 300. In the embodiment shown in Figure 3, address bus 301 is twenty-three bits wide, and data bus 303 is 32 bits wide. Address bus 301 is coupled to two redundant address buffers, 302 A and 302B. Data bus 302 is coupled to two redundant bi-directional data buffers, 304 A and 304B. Address buffer 302 A is twenty-three bits wide in the embodiment shown, is comprised of three FCT2244 chips (available from Quality Semiconductor, 851 Martin Ave.. Santa Clara, CA 95050), and drives an address to all of the memory chips in memory bank 308 A. Address buffer 302B is also twenty-three bits wide in the embodiment shown, is also comprised of three FCT2244 chips, and drives all of the memory chips in memory bank 308B. Data buffer 304A is thirty-two bits wide in the embodiment shown, is comprised of two ABT16245 chips (available from Texas Instruments, P.O.Box 655303, Dallas, TX 75265), and drives data to all of the memory chips in memory bank 308A. Data buffer 304B is also thirty-two bits wide in the embodiment shown, is also comprised of two ABT 16245 chips, and drives data to all of the memory chips in memory bank 308B.

In the embodiment shown, memory bank 308 A includes twenty pseudo-SRAM (static random-access memory) chips, each having 512K eight- bit bytes, each having an eighty-nanosecond (80-ns) access time, arranged as a memory array which is four sections by five bytes by 512K, or eight megabytes of data plus error-correction code (ECC). The five bytes to memory bank 308 A are coupled to the four bytes of the DATA0 bus and the one byte of ECC0 bus. Similarly, memory bank 308B also includes the same number and type of chips, coupled to the four bytes of the DATA1 bus and the one byte of ECC 1 bus. The four bytes of the DATA0 bus and the one byte of ECC0 bus are coupled to error- correction-code (ECC) chip 306A, which is a nine-nanosecond (9-ns) AM29C660 chip (available from AMD (Advanced Micro Devices) 901 Thomson Place. P.O. Box 3453, Sunnyvale, CA 94088) that provides detection of all double- and single-bit errors and correction of all single-bit errors (DBED/SBEC) on the DATAO bus. ECC chip 306A generates ECCO bits as DATAO data are being written into memory bank 308 A, and uses the read-out ECCO bits to provide DBED/SBEC on the DATAO bus as data are being read. Similarly, the four bytes of the DATA1 bus and the one byte of ECC 1 bus are coupled to ECC chip 306B, which is also an AM29C660 chip that provides DBED/SBEC on the DATA1 bus. ECC chip 306B also generates ECC1 bits as DATA1 data are being written into memory bank 308B, and uses the read-out ECC1 bits to provide DBED/SBEC on the DATA1 bus as data are being read from memory bank 308B. In the embodiment shown, battery 311 A provides power for memory bank 308A, and is monitored by gas-gauge (GG) chip 310A and BVCC (battery V_cc) control chip 313A and controlled via PNP transistor 314A. Similarly, battery 31 IB provides power for memory bank 308B, and is monitored by GG chip 31 OB, and BVCC control chip 313B and PNP transistor 314B. In this embodiment, GG chip 310A is a BQ2010 chip (Benchmark

Electronics Inc., 2611 Westgrove Drive Suite 109, Carrollton, TX 75006) which provides monitoring of the charge in battery 311 A. In this embodiment, BVCC control chip 313A is an LTC1235 chip (Linear Technology Corp., 1630 McCarthy Blvd., Milpitas, CA 95035) which controls the supply of power, from main voltage supply V_cc or from battery 311 A, via PNP transistor 314A, depending on which provides the "best" supply voltage (i.e., if V_cc fails or falls below a specified voltage, power is instead supplied from battery 311 A).

In the embodiment shown, gas gauge control 312 is a fifteen- nanosecond (15-ns) MACH220 37047200 chip (available from AMD (Advanced Micro Devices) 901 Thomson Place, P.O. Box 3453, Sunnyvale, CA 94088) which provides monitoring of GG chips 310A and 310B, and provides output- gas-gauge values onto the low-order bits of bus DATAO when requested. In the embodiment shown, cache control 322 is a fifteen- nanosecond (15-ns) MACH445 37047100 chip (available from AMD (Advanced Micro Devices) 901 Thomson Place, P.O. Box 3453, Sunnyvale, CA 94088) which provides overall monitoring and control of the functions of intelligent disk-cache memory 300. Input signals which are monitored by cache control 322 include a single-bit and a multi-bit error-detection signal from each of the ECC chips 306A and 306B. In one embodiment, input signals from other subsystems such as BVCC controls 313A and 313B, gas gauges 310A and 31 OB, gas gauge control 312, etc. are also monitored by cache control 322. Control signals generated by cache control 322 include output-enable signals and read/write signals to memory banks 308A and 308B, and a battery-off signal to BVCC controllers 313A and 313B.

In one embodiment, cache control 322, under normal conditions (i.e., when no errors have been detected), in response to a write operation causes each written datum to be simultaneously written twice, one copy each to memory banks 308A and 308B, in order to have a redundant copy available if one or the other memory bank (308A or 308B), or the respective associated circuitry for that memory bank, happens to fail. In this embodiment, under normal conditions, in response to a read operation, cache control 322 causes either memory bank 308A or 308B to read its respective copy of the data into its respective ECC chip 306A and 306B (i.e., alternating banks are used on successive read operations), and if no error is detected, cache control 322 causes the data from memory bank 308A (or 308B as applicable) to be passed to data bus 303 by data buffer 304A (or 304B as applicable); if an ECC error is detected by ECC chip 306A (or 306B as applicable) data will be read from the opposite bank, and if no error is detected by ECC chip 306B (or 306A as applicable), then cache control 322 causes the data from the opposite memory bank (i.e., memory bank 308B if 308 A detected an error on its side) to be passed to data bus 303 by data buffer 304B (or 304A as applicable).

In one embodiment, if single-bit errors are detected by either ECC chip 306A or 306B, the ECC chip 306A 306B of the corresponding bank will correct the single-bit error it detects, and cache control 322 causes the corrected data from affected ECC chip 306A/306B to be passed to data bus 303 by the corresponding data buffer 304A/304B, and to be rewritten to the affected memory bank 308A/308B (and the other bank is not read from). If, for example, both a multiple-bit error is detected by ECC chip 306 A and a single-bit error is detected by ECC chip 306B, ECC chip 306B will correct the single-bit error it detects, and cache control 322 causes the corrected data from ECC chip 306B to be passed to data bus 303 by data buffer 304B. In one embodiment, cache control 322 causes memory bank 308A and 308B to be sequentially read, corrected, and rewritten during otherwise unused cycles, in order that single-bit errors (which may "spontaneously" appear from time-to-time) are detected and corrected. In one such embodiment, a pointer is maintained, and sequenced through each successive lpcation of memory banks 308A and 308B, in order to correct all single-bit soft errors (a soft error is one which can be corrected by overwriting the location with the correct data). Once all (or substantially all) locations have been checked and corrected, the process starts over again. This repeated operation, called 'scrubbing', allows the correction of certain single-bit errors in a data word before a second error in the same data word makes correction impossible (in the embodiment shown, ECC chips 306A and 306B detect and correct all single-bit errors, but double-bit errors are only detectable, not correctable). In one embodiment, if a multiple-bit error is detected on one memory bank 308 A or 308B, but not in the other, correct data from the bank without the error are written into the other bank, thus providing a way to correct multiple-bit errors.

In one embodiment, cache control 322 also detects certain failures (such as insufficient voltage or charge) in batteries 311 A and 31 IB and in their respective associated circuitry (i.e., PNP transistors 314A and 314B, GG chips 310A and 31 OB, and BVCC controls 313A and 313B). These failures are collectively called 'battery failures', even though some may actually be caused by failures in the other associated components. Under normal conditions (i.e., when no battery failures have been detected), writes go to both memory banks 308A and 308B, and reads come alternately from memory banks 308A and 308B, as described above. In this embodiment, if a battery failure associated with memory bank 308A is detected, no new data is written to memory bank 308A, and that memory bank is cleared of data, in order that the battery failure does not allow undetected errors to pass from memory bank 308A.

In one embodiment, if cache control 322 detects any failures in either memory bank or in their respective support circuitry, then no new data is written into intelligent disk cache memory 300, but instead read and write operations thereafter bypass the cache until a repair of the failed device is effected. In addition, data already in memory bank 308A and/or 308B are "flushed." This flushing operation involves writing the write data to the disk devices 240 as specified by the respective write operations which were previously cached, and all cache resources are released as the corresponding write operations complete to disk devices 240.

In one embodiment, only write data from write operations from main computer 102 to the ISS 200 are cached into intelligent disk-cache memory 300, while read data from read operations bypass intelligent disk-cache memory 300. This write data is then eventually written to disk devices 240 (as specified by the RAID level currently running on ISS 200) as described above (a write cache); the cached write data are also available for read operations (a read cache), however, no other read data (e.g., from read operations, read-ahead operations or read-buffering operations) is placed or held in intelligent disk- cache memory 300. In one embodiment, intelligent disk-cache memory 300 is fabricated as a removable cache module 350. If an ISS 200 happens to fail, such a removable cache module 350 can be removed, with its data intact, from the failed ISS 200, and plugged into a replacement ISS 200, which in turn is replaced into system 100. Data stored in the removable plug-in module are then used to complete operations which were in process, but not yet completed, at the moment of failure.

Figure 4 is a diagram illustrating an embodiment of removable cache module 350 having a signal connector 399. Removable cache module 350 of one embodiment is designed such that it can be unplugged or plugged without losing any data in memory banks 308A and 308B, due, for example, to voltage spikes caused by the unplugging or plugging processes. In one such embodiment, protection from data loss due to electro-static discharge (ESD) is also provided.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

What is claimed is:

1. A computer system comprising: a main computer; a memory; and an intelligent disk subsystem, the intelligent disk subsystem comprising: a disk-cache memory, the disk-cache memory comprising: a first and a second memory bank; a first and a second battery, wherein the first battery is coupled to the first memory bank and the second battery is coupled to the second memory bank; and a cache controller coupled to the first and second memory banks, wherein the cache controller causes write operations to write redundant copies of data to both the first and second memory banks, and detects detected failures of the first memory bank and of the first battery, and on the basis of no detected failures, selects read operations to read data from the first memory bank, and on the basis of the detected failures, selects read operations to read data from the second memory bank.

2. The computer system according to claim 1, wherein successive read operations are performed to the first and second memory banks alternately.

3. The computer system according to claim 1, wherein the detected failures include error-correction-code (ECC) errors.

4. The computer system according to claim 1, wherein the first memory bank is periodically scrubbed in order to purge the first memory bank of correctable errors.

5. The computer system according to claim 1, wherein the disk-cache memory is packaged on a removable cache module.

6. The computer system according to claim 1 , wherein the intelligent disk subsystem gives read operations which transfer data from disk devices to the disk-cache memory a higher priority than write operations.

7. An intelligent disk subsystem, the intelligent disk subsystem comprising: a disk-cache memory, the disk-cache memory comprising: a first and a second memory bank; a first and a second battery, wherein the first battery is coupled to the first memory bank and the second battery is coupled to the second memory bank; and a cache controller coupled to the first and second memory banks, wherein the cache controller causes write operations to write redundant copies of data to both the first and second memory banks, and detects detected failures of the first memory bank and of the first battery, and on the basis of no detected failures, selects read operations to read data from the first memory bank, and on the basis of the detected failures, selects read operations to read data from the second memory bank.

8. The intelligent disk subsystem according to claim 7, wherein the detected failures include error-correction-code (ECC) errors.

9. The intelligent disk subsystem according to claim 7, wherein the first memory bank is periodically scrubbed in order to purge the first memory bank of correctable errors.

10. The intelligent disk subsystem according to claim 7, wherein the disk- cache memory is packaged on a removable cache module.

11. The intelligent disk subsystem according to claim 7, wherein the intelligent disk subsystem gives read operations going to disk devices and returning data to the disk-cache memory a higher priority than write operations.

12. A disk-cache memory, the disk-cache memory comprising: a first and a second memory bank; a first and a second battery, wherein the first battery is coupled to the first memory bank and the second battery is coupled to the second memory bank; and a cache controller coupled to the first and second memory banks, wherein the cache controller causes write operations to write redundant copies of data to both the first and second memory banks, and detects detected failures of the first memory bank and of the first battery, and on the basis of no detected failures, selects read operations to read data from the first memory bank, and on the basis of the detected failures, selects read operations to read data from the second memory bank.

13. The disk-cache memory according to claim 12, wherein the detected failures include error-correction-code (ECC) errors.

14. The disk-cache memory according to claim 12, wherein the first memory bank is periodically scrubbed in order to purge the first memory bank of correctable errors.

15. The disk-cache memory according to claim 12, wherein the disk-cache memory is packaged on a removable cache module.

16. The disk-cache memory according to claim 12, wherein read operations going to disk devices and returning data to the disk-cache memory are given a higher priority than write operations.

17. A method for intelligently caching data in an intelligent disk subsystem connected to a main computer having a main memory, the disk subsystem comprising a first and a second memory, the method comprising the steps of: holding a first copy and a second copy of data in the first and second memory, respectively, wherein the first memory is coupled to a first battery and the second memory is coupled to a second battery; detecting whether a detected failure occurred either in the first memory or in the first battery; and in response to a request, providing either the first copy or the second copy of data, based on whether a detected failure occurred.

18. The method according to claim 17, wherein the detected failure includes error-correction-code (ECC) errors.

19. The method according to claim 17, further comprising the step of periodically scrubbing the first memory in order to purge the first memory of correctable errors.

20. The method according to claim 17, further comprising the step of giving read operations going to disk devices and returning data to the first or second memory a higher priority than write operations.

21. The method according to claim 17, wherein the step of holding comprises holding data originating from write operations but not data originating from read operations, wherein write operations are defined as transferring data from the main computer to the disk subsystem and read operations are defined as transferring data from the disk subsystem to the main computer.

22. A method for intelligently caching data in an intelligent disk subsystem having a cache and connected to a main computer having a main memory, the method comprising the step of: caching data originating from write operations but not caching data originating from read operations in the cache, wherein write operations are defined as transferring data from the main computer to the disk subsystem and read operations are defined as transferring data from the disk subsystem to the main computer.

23. The method according to claim 22, wherein the step of caching comprises the steps of: storing data from a first write operation having a first intended address in the cache; and examining a second write operation having a second intended address to determine whether data from the second write operation is intended to be written to the first intended address, and if so, overwriting the data from the first write operation with the data from the second write operation in the cache.