WO2008029146A1

WO2008029146A1 - A distributed file system operable with a plurality of different operating systems

Info

Publication number: WO2008029146A1
Application number: PCT/GB2007/003363
Authority: WO
Inventors: Eric Zigmund Sandler; Konstantin Belousov
Original assignee: XPLOITE PLC
Current assignee: XPLOITE PLC
Priority date: 2006-09-07
Filing date: 2007-09-07
Publication date: 2008-03-13
Anticipated expiration: 2009-03-07

Abstract

A distributed file system for managing access to data, including: a plurality of storage nodes, each node including a memory arranged for storing data items and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node; a master node, arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item; and a client node arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.

Description

A DISTRIBUTED FILE SYSTEM OPERABLE WITH A PLURALITY OF DIFFERENT OPERATING SYSTEMS

Field of Invention

The present invention relates to a distributed file system, particularly a file system distributed over a heterogeneous operating system environment.

Background

Data is stored by computer systems within secondary memory, such as on hard disk, using a file system. The file system manages access to the data, for example, for read, write, and delete purposes.

Access to the data can be managed for multiple users by storing the data on a file server which is accessible to the users over a network.

There is a requirement for some organisations or for some purposes for the storage of very large quantities of data (for example, archiving the internet).

One way of storing very large quantities of data is by use of a specialised computer system with multiple large hard disks. However, specialised hardware is very expensive.

There are several alternative methods for providing access to very large quantities of data using a distributed model.

One method is implemented by Google and is known as the Google File System (GFS). GFS stores data within chunks across multiple storage nodes. All the metadata related to the data, such as on which storage node the data is stored, is stored on a single master node. Client nodes request access to the data by obtaining metadata from the master node and using the metadata to access the data directly from the storage nodes. GFS includes the use of a customised Linux kernel for the storage nodes. Another distributed file system is a POSIX-compliant open-source system called Lustre.

Each file stored on a Lustre file system is considered as an object. Lustre provides clients with concurrent read and write access to the objects. A Lustre file system has four functional units. These are a Meta data server (MDS) for storing the meta data; an Object storage target (OST) for storing the actual data; an Object storage server (OSS) for managing the OSTs; and client(s) for accessing and the actual usage of the data. OSTs are block-based devices. An MDS, OSS, and an OST can be on the same node or on different nodes.

Lustre is deployable only on systems running variants of the Linux operating system.

The disadvantage with GFS and Lustre is that neither system provides the capability for implementation across systems with heterogeneous operating system deployment.

Within organisations there is generally an established hardware and software infrastructure. Within this infrastructure a large portion of processing cycles and storage space goes unused, particularly the processing cycles and storage space of user machines.

There is a desire to utilise the unused resources of an organisation's existing infrastructure for storing very large quantities of data.

However, user machines are unreliable in that a user may dominate processing cycles or the user machine may be temporarily unavailable due to user-caused reboots.

In addition, in most infrastructures there is heterogeneity in hardware and operating systems across the organisation. It is an object of the present invention to provide a reliable distributed file system for providing access to very large quantities of data stored across a heterogeneous infrastructure which overcomes the above disadvantages, or to at least provide a useful alternative.

Summary of the Invention

According to a first aspect of the invention there is provided a distributed file system for managing access to data, including: a plurality of storage nodes, each node including a memory arranged for storing data items and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node; a master node arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item; and a client node arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.

The metadata may include a universal identifier.

It is preferred that the memory of each storage node is secondary memory.

It is also preferred that the metadata database is stored within primary memory of the master node.

The data may be web content, such as web pages.

The client node may be further arranged for requesting the metadata by transmitting a URL for the web content to a master node.

Preferably, the client node is further arranged for requesting metadata for data to be stored in the file system. The master node is preferably further arranged, in response to the request for metadata for data to be stored, for providing a metadata corresponding to an empty block on a storage node. Alternatively, the master node may be further arranged, in response to the request for metadata for data to be stored, for providing a plurality of metadata, the metadata corresponding to empty blocks on a plurality of storage nodes.

The client node may be further arranged for writing the data to each of the storage nodes.

The metadata may include checksums related to the corresponding data item. The client node may be further arranged for calculating checksums for a data item and comparing the calculated checksums to the checksums within the metadata for the data item.

It is preferred that the memory management unit is executing within user space of the operating system.

At least some of the nodes may exist on the same computing device.

Preferably, the client node includes a file management unit and the file management unit is arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata. The file management unit is preferably arranged for execution within kernel space of an operating system.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:

Figure 1 : shows a schematic diagram illustrating a distributed file system in accordance with an embodiment of the invention. Figure 2: shows a schematic diagram illustrating a storage node for use within the distributed file system.

Detailed Description of the Preferred Embodiments

The present invention provides a file system distributed over a heterogeneous operating system environment. The system comprises a number of storage nodes executing on different computer platforms containing data for the file system, a master node containing metadata for the file system and a client node which accesses data on the storage nodes by first querying the master node for the relevant metadata.

Figure 1 shows storage nodes 1 , 2, and 3.

Each storage node includes a storage management unit 4 connected with a memory 5. The memory 5 is secondary non-volatile non-removable memory such as a hard disk. The storage management unit may be connected with the memory via a hard disk interface such as IDE or SCSI.

In other embodiments the memory 5 may be removable or volatile.

The storage management unit 4 may be connected through a network 6 or inter-network to master nodes and client nodes.

Master nodes 7 and 8 are shown in Figure 1.

Each master node includes a master management unit 9 connected with a memory 10. The memory 10 is primary memory such as RAM. The master management unit 9 may be connected with the memory 10 via a bus 11.

In other embodiments the memory 10 may be secondary memory, such as a hard disk or flash RAM. The master management unit 9 may be connected through the network 6 to client nodes and storage nodes.

Client nodes 12 and 13 are also shown.

Each client node includes an application 14 and a client management unit 15. The application 14 is executing within user space 16 on the operating system of the client node 13.

The client management unit 15 may be executing within kernel space 17 of the operating system of the client node 13. A request by the application 14 for data stored within the distributed file system is sent to the client management unit 15 by the operating system.

The client management unit 15 may provide a standard interface (Application Programming Interface - API) for the distributed file system such that applications executing in user space can utilise the client management unit in the same/similar way to use of the local file system of the operating system.

The client management unit 15 may be connected through the network 6 to master nodes and storage nodes.

An embodiment of the invention will now be described.

The client management unit 15 receives a request for access to data from an application 14 executing on the client node 13. The access can be for read, write, or delete access.

The request includes a universal identifier for the data such as a fixed length key. A 64-bit key may be used.

In alternative embodiments other identifiers may be used such as path and file name identifiers. The client management unit 15 of the client node 13 transmits the request to the master node 7 over the network 6.

The master management unit 9 of the master node 7 receives the request for access to data from the client node 13.

The master management unit 9 includes a database for storing metadata about the data within the distributed file system. The metadata is stored within a database, such as Berkley DB or another relational database. The database is stored within memory 10.

The metadata database maps identifiers for the data to the metadata for the data using an index. The database can use any indexing system, such as hashing or tries.

The data may be stored as files. The metadata includes attributes about the file such as inodes. The inodes include which blocks comprise the file, on which storage nodes the blocks are located, and checksums for the blocks.

The master management unit 9 includes a free extents list - a list of free blocks across the storage nodes.

The master management unit 9 may include a locking system to ensure that multiple write accesses are not given on the same file to different client nodes, and write accesses are not given when the file is open for read access. The locking system may be a lease-based locking where client nodes obtain leases for specific access to specific data from the master node.

The master management unit 9 uses the index within the metadata database to map the identifier to the metadata and then sends the metadata to the client node 13. The client management unit 15 uses the received metadata to access the storage node 1 which contains the data. The unit forms requests for access to the data held by the storage node 1 and transmits this to the storage node 1 over the network 6.

It will be appreciated that in some embodiments of the invention the blocks for the data may all be stored on more than one storage node and requests for these blocks will go to all the relevant storage nodes.

The request may include identifiers of the blocks held by the storage node and the type of access that is required.

The storage node 1 receives the request and the storage management unit 4 locates the data or portion of data that is stored at the storage node. The data may be stored within blocks in the secondary memory 5 of the storage node 1.

In accordance with the access that is required the storage management unit 4 receives data from the client node 13 to write to blocks on the storage node 1 (write access), transmits blocks of the data from secondary memory 5 to the client node 13 (read access), or marks the data blocks as free and notifies the master node 7 (delete access).

Where blocks of data are transmitted to the client node 13, the client management unit 15 on the client node 13 may use the checksums to verify the data received from the storage node 1. This helps to ensure the integrity of the data as it enables the detection of corrupted data.

Where blocks of data are written by the client node 13 to the storage node 1 , the storage node 1 may notify the client node 13 of a successful action when the process of writing the data is completed.

The storage nodes 1 , 2, 3 are executing on more than one type of operating system. Referring to Figure 2, one embodiment of a storage node 20 in accordance with present invention will be further described.

The storage management unit 21 executes within user space 22 of the operating system. The unit 21 connects with the network 23 or inter-network to communicate with client and master nodes.

The unit 21 is connected to the local file system 24 of the operating system. The storage management unit 21 uses the local file system 24 to store and retrieve data in a secondary memory 25.

It will be appreciated that the local file system 24 may store data within a primary memory buffer.

User applications 26 may also be executing within user space 22 on the operating system. The user applications 27 can access other data stored within the secondary memory 25 for their own purposes through the local file system 24 using known methods.

In alternative embodiments the storage node 20 may also be a client node and include a client management unit. The user applications 26 may be connected with the client management unit to access data stored within the distributed file system, which may or may not be stored in the secondary memory 25 of the storage node 20. In this embodiment the client management unit may be connected with the network 23 or inter-network and the storage management unit 21.

An example of a distributed file system providing access to data for a web crawler client application in accordance with the invention will be described.

The web crawler application is executing on a client node and accessing the internet to download web pages. Also on the client node is a database, such as a Berkley DB. The database converts each URL for the web page into a 64-bit unique identifier (UUID) using a hashing technique.

The database then requests free blocks to store the downloaded web page from a client management unit on the client node. The management unit sends the request to the master nodes in the distributed file system. The master node provides free data block identifiers and responsible storage nodes using its free extents list. Each identifier is a unique block identifier.

The unit provides the identifiers to the database which stores the block identifiers for each web page with its associated UUID. The database uses the unit which maps the identifiers to the storage nodes to access the blocks stored on the storage nodes to store the web page data.

A web page analyzer is also executing on the client node. The web analyzer uses the database to access a specific web page by providing the URL for the web page. The database converts the URL for the web page into a UUID and maps this to the associated block identifiers. The block identifiers are provided to the web page analyzer.

The web page analyzer requests the blocks from the client management unit. The unit sends the request to the master nodes which return the storage nodes responsible for the blocks.

The unit maps access to the blocks for the web page analyzer to the correct storage node.

To provide for replication of data the master nodes may provide multiple block identifiers for a new block request or for write access. The client management unit stores the same data for a block within multiple blocks corresponding to the multiple identifiers. When the client node receives notification from all storage nodes that the block containing the data has been written, the client management unit can treat the writing action as successful. The replication and notification only when all writing action is successful provides a robust redundancy method to protect against failure of storage nodes and replacement of corrupted data.

Potential advantages of embodiments of the present invention include the ability to efficiently store very large amounts of data, ability to deploy the system on existing infrastructure utilizing already deployed computers with variety of operating systems, ability to efficiently satisfy the most frequent access pattern - random read/write requests, a high distribution degree, and fault-tolerant.

The advantage of utilizing existing network and workstation infrastructure, results in system deployment cheaper and easier. In addition storage capability of the system can be expanded by adding computers to the system configuration.

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

1. A distributed file system for managing access to data, including: a plurality of storage nodes, each node including a memory arranged for storing data items and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node; a master node arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item; and a client node arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.

2. A distributed file system as claimed in claim 1 wherein the metadata includes a universal identifier.

3. A distributed file system as claimed in one of the preceding claims wherein the memory of each storage node is secondary memory.

4. A distributed file system as claimed in one of the preceding claims wherein the metadata database is stored within primary memory of the master node.

5. A distributed file system as claimed in one of the preceding claims wherein the data is web content, such as web pages.

6. A distributed file system as claimed in one of the preceding claims wherein the client node is further arranged for mapping a URL for the web content to a universal identifier and requesting the metadata by transmitting the universal identifier to the master node.

7. A distributed file system as claimed in one of the preceding claims wherein the client node is further arranged for requesting metadata for data to be stored in the file system.

8. A distributed file system as claimed in claim 7 wherein the master node is further arranged, in response to the request for metadata for data to be stored, for providing a metadata corresponding to an empty block on a storage node.

9. A distributed file system as claimed claim 7 wherein the master node is further arranged, in response to the request for metadata for data to be stored, for providing a plurality of metadata, the metadata corresponding to empty blocks on a plurality of storage nodes.

10. A distributed file system as claimed in claim 9 wherein the client node is further arranged for writing the data to each of the storage nodes.

11. A distributed file system as claimed in one of the preceding claims wherein metadata includes checksums related to the corresponding data item.

12. A distributed file system as claimed in claim 11 where the client node is further arranged for calculating checksums for a data item and comparing the calculated checksums to the checksums within the metadata for the data item.

13. A distributed file system as claimed in one of the preceding claims wherein the memory management unit is executing within user space of the operating system.

14. A distributed file system as claimed in one of the preceding claims wherein at least some of the nodes exist on the same computing device.

15. A distributed file system as claimed in one of the preceding claims wherein the client node includes a file management unit, the file management unit is arranged for requesting metadata related to a data item from the master node and accessing the data item from one or more storage nodes containing the item using the requested metadata..

16. A distributed file system as claimed in claim 15 wherein the file management unit is arranged for execution within kernel space of an operating system.

17. A storage node for a distributed file system as claimed in any one of the preceding claims including: a memory arranged for storing data items; and a memory management unit arranged for execution on one of a plurality of different operating systems and arranged for providing access to a data item to a client node.

18. A master node for a distributed file system as claimed in any one of claims 1 to 16 including: a memory arranged for storing a metadata database including a plurality of metadata, each metadata defining a mapping between a data item and one or more storage nodes containing the data item.

19. A client node for a distributed file system as claimed in any one of claims 1 to 16 including: a file management unit arranged for requesting metadata related to a data item from a master node and accessing the data item from one or more storage nodes containing the item using the requested metadata.

20. A computer program arranged for effecting the system as claimed in any one of claims 1 to 16.

21. A computer program arranged for effecting the node as claimed in any one of claims 18 to 19.

22. Storage media arranged for storing a computer program as claimed in any one of claims 20 to 21.