WO2023033100A1

WO2023033100A1 - Processing apparatus

Info

Publication number: WO2023033100A1
Application number: PCT/JP2022/032921
Authority: WO
Inventors: Lukasz Slusarczyk; Michal Welnicki; Krzysztof Lichota; Rafal WIJATA; Tadeusz KOPEC; Mateusz KIELAR; Andrzej JACKOWSKI; Cezary Dubnicki
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2021-09-02
Filing date: 2022-09-01
Publication date: 2023-03-09
Anticipated expiration: 2024-03-02

Abstract

A processing apparatus includes an object driver including an acquisition unit and a generation unit. The acquisition unit is configured to acquire a chunk distributed based on a hash value of the chunk. The chunk is obtained by division of an object. The generation unit is configured to generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the chunk acquired by the acquisition unit. The sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.

Description

PROCESSING APPARATUS

The present invention relates to a processing apparatus, a processing method, and a recording medium.

A technique used for a storage apparatus and the like for storing data is known.

For example, Patent Document 1 describes a storage apparatus that includes: a data dividing unit configured to divide data to be written into a plurality of block data; a block detecting unit configured to confirm whether or not each of the block data obtained by division by the data dividing unit is already stored in a storage device; and a data writing unit configured to store each of the block data obtained by division by the data dividing unit into the storage device and, in the case of storing other block data having the same content as the block data already stored in the storage device into the storage device, refer to the block data already stored in the storage device as the other block data. According to Patent Document 1, the block detecting unit detects a common rate that represents the rate of a common part between a plurality of consecutive block data forming a predetermined range in the data to be written among the block data obtained by division by the data dividing unit and a plurality of block data of a predetermined range already stored consecutively in the storage device. Moreover, the data writing unit newly stores the block data obtained by division by the data dividing unit into the storage device in accordance with the common rate detected by the block detecting unit.

Further, a related technique is described in, for example, Patent Document 2. For example, Patent Document 2 discloses a system that includes a first file storage having a file system, a second file storage accepting an object request from the first file storage, and an object storage for storing a file or directory transferred as an object. According to Patent Document 2, when the file system is updated in response to a request by a client, the first file storage detects the updated file or directory of the file system. Then, the first file storage aggregates a plurality of difference data of the detected file or directory and update information of the difference data into a difference aggregation object, and transfers the difference aggregation object to the second file storage. Then, the second file storage receives the difference aggregation object, and reflects each of the difference data to the object of the object storage in accordance with the difference data update information of the difference aggregation object.

Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2013-541055 Japanese Unexamined Patent Application Publication No. 2021-144288

At the time of reading or writing data in the object storage and the like, metadata containing information indicating the storage location of the data may be referred to. Such metadata is generated at the time of storing data, and it takes time to generate metadata, which may become a bottleneck in storage performance. Thus, there is a problem that it may be difficult to efficiently generate metadata.

Accordingly, an object of the present invention is to provide a processing apparatus, a processing method, and a recording medium that solve the abovementioned problem.

In order to achieve the object, a processing apparatus as an aspect of the present disclosure includes an object driver including an acquisition unit and a generation unit. The acquisition unit is configured to acquire a chunk distributed based on a hash value of the chunk. The chunk is obtained by division of an object. The generation unit is configured to generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the chunk acquired by the acquisition unit. The sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.

Further, a processing method as another aspect of the present disclosure is executed by an information processing apparatus. The processing method includes: acquiring a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and generating a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk. The sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.

Further, a recording medium as another aspect of the present disclosure is a non-transitory computer-readable recording medium having a program recorded thereon. Th program includes instructions for causing an information processing apparatus to realize processes to: acquire a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk. The sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.

With the configurations as described above, the abovementioned problem can be solved.

Fig. 1 is a view showing an example of a configuration of a storage system in a first example embodiment of the present disclosure; Fig. 2 is a view showing an example of a configuration of a processing node; Fig. 3 is a view showing an example of processing by an object driver and a processing unit; Fig. 4 is a view showing an example of a configuration of the object driver that is characteristic of this example embodiment; Fig. 5 is a view showing an example of a configuration of the processing unit that is characteristic of this example embodiment; Fig. 6 is a view showing an example of metadata tree generation; Fig. 7 is a view showing an example of metadata tree generation; Fig. 8 is a flowchart showing an example of an operation of the processing node; Fig. 9 is a view showing an example of a hardware configuration of a processing apparatus in a second example embodiment of the present disclosure; Fig. 10 is a block diagram showing an example of a configuration of the processing apparatus; Fig. 11 is an example of sample that shows capacity changes over week; Fig. 12 is a view showing an example of a number of samples with weekly writes to deletes ratio; Fig. 13 is a view showing an example of a Peak increase of capacity as percentage of current system capacity; Fig. 14 is a view showing an example of a Peak decrease of capacity as percentage of current system capacity; Fig. 15 is a view showing an example of a system capacity usage; Fig. 16 is a view showing an example of an Example of blocks organization; Fig. 17 is a view showing an example of a Write path architecture; Fig. 18 is a view showing an example of an Object driver architecture; Fig. 19 is a view showing an example of an Example of ObjectMetadataTree; Fig. 20 is a view showing an example of an Example of ObjectMetadataLog; Fig. 21 is a view showing an example of a Distributed merge; Fig. 22 is a view showing an example of a Merging two SubObjectMetadataTrees; Fig. 23 is a view showing an example of a Number of OPS; Fig. 24 is a view showing an example of a Write throughput; Fig. 25 is a view showing an example of a Time of ObjectMetadataTree merge; Fig. 26 is a view showing an example of a Time of ObjectMetadataTree merge depending on key and metadata size; Fig. 27 is a view showing an example of a Merge time (random keys) with and without prefetch; Fig. 28 is a view showing an example of a Merge time (sequential keys) - AllInternal vs OMT Prefetch; Fig. 29 is a view showing an example of a Distributed merge time;

First Example Embodiment

A first example embodiment of the present disclosure will be described with reference to Figs. 1 to 7. Fig. 1 is a view showing an example of a configuration of a storage system 100. Fig. 2 is a view showing an example of a configuration of a processing node 200. Fig. 3 is a view showing an example of processing by an object driver 210 and an execution unit 240. Fig. 4 is a view showing an example of a configuration of the object driver 210 that is characteristic of this example embodiment. Fig. 5 is a view showing an example of a configuration of the execution unit 240 that is characteristic of this example embodiment. Figs. 6 and 7 are views showing examples of metadata tree generation. Fig. 8 is a flowchart showing an example of an operation of the processing node 200.

In the first example embodiment of the present disclose, a storage system 100 in which a plurality of object drivers 210 generate sub metadata trees based on metadata logs thereof, respectively, and a metadata tree is generated by merging the generated sub metadata trees will be described. As will be described later, when acquiring an object, which is a set of data to be stored into a storage node 300, from an external device or the like, the storage system 100 divides the acquired object into a plurality of chunks based on any criterion, and also calculates the hash values of the respective chunks obtained by division. Moreover, the storage system 100 distributes the chunks obtained by division to a plurality of object drivers 210 based on the calculated hash values. For example, the storage system 100 distributes the chunks to the object drivers 210 based on the calculated hash values so that a process to generate a sub metadata tree is evenly performed by the respective object drivers 210. Each of the object drivers 210 generates a sub metadata tree for the object driver 210 based on the log of metadata indicating a chunk storage location and the like, stored in the object driver 210. The storage system 100 can efficiently generate a metadata tree by merging the sub metadata trees generated by the respective object drivers 210.

Metadata generated in this example embodiment contains information necessary for executing a storage and reproduction operation process corresponding to an instruction from an external device or the like, for example, at the time of retrieving data. For example, the metadata can contain a content address indicating the storage location of data such as a chunk and an object name. The metadata may contain information other than the information illustrated above, such as information indicating the type of the metadata and information indicating the name of a bucket including the object.

Further, the metadata tree has a structure that holds metadata necessary to support the storage and reproduction operation process, for example, at the time of retrieving data. For example, the metadata tree has a tree structure such as a B+ tree. As an example, in the metadata tree, leaf nodes as terminal nodes hold metadata, and internal nodes other than the leaf nodes, such as a root node and nodes on the route, hold reference information to the other node. For example, in the metadata tree, the size of each of the nodes is determined in advance to be within a predetermined range (may be any range). Moreover, the metadata tree is generated so that the lengths of paths from the root node to the respective leaf nodes are equal, for example. The sub metadata tree may have the same structure as the metadata tree, for example, holding metadata corresponding to a partial range of the entire range of the hash values calculated from the chunks obtained by division of the object.

Fig. 1 shows an example of a configuration of the storage system 100. Referring to Fig. 1, the storage system 100 has a configuration in which a plurality of server computers are connected so as to be able to communicate with each other. Specifically, the storage system 100 includes a processing node 200 that is a server computer controlling the storage and reproduction operation process, and a storage node 300 that is a server computer including a storage device for storing data. For example, the processing node 200 is an information processing apparatus that functions as a gateway to the storage system 100 and provides a data access API (Application Programming Interface). The storage node 300 is an information processing apparatus that holds data acquired via the processing node 200. The number of processing nodes 200 and the number of storage nodes 300 are not limited to those illustrated in Fig. 1. Moreover, the storage system 100 may have, for example, a hybrid node that combines the function of the processing node 200 and the function of the storage node 300, instead of the processing node 200 or in addition to the processing node 200.

For example, the storage system 100 in this example embodiment may be a content address storage system that divides data to be stored such as an object into a plurality of chunks and stores the chunks into a plurality of storage devices and also identifies a storage location where the chunk or the like is stored by using a unique content address set in accordance with the content of the stored data. The storage system 100 may make the chunks obtained by division redundant and store into the storage devices in a distributed manner. In other words, the storage location of each data stored in the storage system 100 can be identified by using the content address. For example, a hash value contained by the content address of each data such as the chunk is calculated based on the content of the data by using a hash function such as 160-bit SHA1. Moreover, the storage system 100 may have a deduplication function or the like that avoids duplicately storing data such as the chunk having already been stored.

The storage system 100 is not necessarily limited to including the processing node 200 and the storage node 300 as illustrated in Fig. 1, and may have any configuration. For example, the storage system 100 may be configured by a single information processing apparatus that has the function of the processing node 200 and the function of the storage node 300.

Thee processing node 200 (processing apparatus) is a computer that controls the storage and reproduction operation process in the storage system 100. Fig. 2 shows an example of a configuration of the processing node 200. Referring to Fig. 2, for example, the processing node 200 includes an object driver 210 that controls a process relating to an object, a file system driver 220 that controls a process relating to a file system, a VTL (virtual tape library) driver 230 that controls a process relating to a VTL, and an execution unit 240. For example, the processing node 200 includes an operation device such as a CPU (Central Processing Unit) and a storage device such as a hard disk drive and a memory. For example, the processing node 200 can implement the abovementioned functions and processing units of the respective functions by execution of a program stored in the storage device by the operation device. The processing node 200 may include, instead of the CPU, a GPU (Graphic Processing Unit), a DSP (Digital Signal Processor), an MPU (Micro Processing Unit), an FPU (Floating point number Processing Unit), a PPU (Physics Processing Unit), a TPU (Tensor Processing Unit), a quantum processor, a microcontroller, or a combination thereof. Besides, the processing node 200 may have one or more of each of the components described above.

The object driver 210, the file system driver 220 and the VTL driver 230 perform, in accordance with a corresponding instruction from outside, a storage and reproduction operation process such as writing and reading data into and from the respective storage nodes 300 configuring a storage cluster via the execution unit 240. The storage system 100 in this example embodiment has a characteristic in the object driver 210 and the execution unit 240 particularly among the abovementioned components. Below, the object driver 210 and the execution unit 240 will be described in more detail.

Fig. 3 is a view for describing processing by the object driver 210 and the execution unit 240. As illustrated in Fig. 3, for example, the object driver 210 includes an HTTP server 211, a request processing unit 212, and a metadata log writing unit 213. Moreover, the object driver 210 includes a metadata information storage unit 214 such as a memory. Metadata information such as a metadata log that is the log of metadata, a sub metadata tree and a metadata tree that will be described later are stored majorly in the metadata information storage unit 214 that is a memory.

For example, the object driver 210 accepts an instruction for reading and writing an object that is data to be stored, by using the HTTP protocol or the like. As an example, the HTTP server 211 receives an instruction for reading and writing an object using a RESTAPI command or the like from an external device. The request processing unit 212 then processes the received command. For example, the request processing unit 212 performs a process to retrieve data from or write data into the storage nodes 300 and the like configuring the storage cluster via the execution unit 240, by using the metadata information such as a metadata tree held by the metadata information storage unit 214. In a case where necessary metadata information is not held in the metadata information storage unit 214, the request processing unit 212 may acquire necessary metadata information from the storage nodes 300 and the like configuring the storage cluster. The metadata log writing unit 213 stores a metadata log indicating the content of metadata such as a content address into the metadata information storage unit 214 when, for example, the request processing unit 212 writes data into the storage cluster via the execution unit 240. The metadata writing unit 213 may store a metadata log and the like corresponding to data deletion into the metadata information storage unit 214. The metadata log writing unit 213 may store the metadata log into, in addition to the metadata information storage unit 214, the storage nodes 300 and the like configuring the storage cluster.

More specifically, as described above, chunks obtained by dividing an object are distributed to the object drivers 210. For example, as illustrated in Fig. 4, the object driver 210 can include an acquisition unit 215, a storing unit 216 and a sub metadata tree generation unit 217 in addition to the above configuration. As described above, each of the processing units may be implemented by execution of a program stored in the storage device by the operation device.

The acquisition unit 215 acquires chunks obtained by dividing an object. As will be described later, the chunks obtained by dividing the object are, for example, evenly distributed to the respective object drivers 210 by the execution unit 240 so that a process to generate a sub metadata tree is performed evenly by the respective object drivers 210. Therefore, the acquisition unit 215 acquires chunks evenly distributed to each of the object drivers 210 so that a process to generate a sub metadata tree is performed evenly by the respective object drivers 210.

The storing unit 216 stores chunks acquired by the acquisition unit 215 into the storage nodes 300 configuring the storage cluster via the execution unit 240. The storing unit 216 may make the chunks redundant by a method as described in Patent Document 1 and then store in a distributed manner. Moreover, the storing unit 216 may perform a deduplication process and then store the chunks.

For example, when the storing unit 216 stores a chunk, the metadata log writing unit 213 stores a metadata log indicating the content of metadata containing a content address indicating a location where the storing unit 216 stores into the metadata information storage unit 214. That is to say, the metadata log is stored majorly on the metadata information storage unit 214 that is a memory.

The sub metadata tree generation unit 217 generates a sub metadata tree based on the metadata log stored in the metadata information storage unit 214. That is to say, the sub metadata tree generation unit 217 generates a sub metadata tree based on the metadata log stored in the memory, corresponding to the chunks acquired by the acquisition unit 215.

For example, the sub metadata tree generation unit 217 generates a sub metadata tree having a tree structure such as a B+ tree, based on the metadata log stored in the metadata information storage unit 214. For example, the sub metadata tree generation unit 217 causes the leaf nodes to hold metadata, and also causes the internal nodes to hold reference information to another node. As an example, the sub metadata tree generation unit 217 generates a new leaf node every time a size that the leaf node can hold is exceeded. Consequently, the sub metadata tree generation unit 217 stores metadata in the leaf nodes so that the size of the leaf node is within a predetermined range. As a result, in the sub metadata tree, the sizes of the respective leaf nodes are within a predetermined range defined in advance except a last generated leaf node. Moreover, the sub metadata tree generation unit 217 generates a sub metadata tree so that the lengths of paths from the root node to the respective leaf nodes become equal, for example.

For example, by a method as described above, the sub metadata tree generation unit 217 generates a sub metadata tree based on the metadata log stored in the metadata information storage unit 214. Moreover, the sub metadata tree generation unit 217 stores the generated sub metadata tree into the metadata information storage unit 214. The sub metadata tree generation unit 217 may store the generated sub metadata tree into any storage node 300 in the storage cluster.

Further, referring to Fig. 5, the execution unit 240 includes a division and distribution unit 241 and a metadata tree generation unit 242 as components that are characteristic of this example embodiment.

Upon acquisition of an object to be stored from the object driver 210, the division and distribution unit 241 divides the acquired object into a plurality of chunks. Moreover, the division and distribution unit 241 distributes the chunks obtained by division to the respective object drivers 210. The size of the chunks obtained by division by the division and distribution unit 241 may be variable.

Further, the division and distribution unit 241 distributes the chunks obtained by division to the respective object drivers 210 so that the process to generate a sub metadata tree is evenly performed by the respective object drivers 210. As an example, the division and distribution unit 241 calculates the hash values of the respective chunks. Then, the division and distribution unit 241 distributes the chunks as evenly as possible to the respective object drivers 210 by using the calculated hash values. For example, by determining a distribution target hash value range for each of the object drivers 210 and distributing chunks whose calculated hash values are within the range, the division and distribution unit 241 distributes the chunks as evenly as possible to the respective object drivers 210. In other words, the division and distribution unit 241 can distribute a corresponding chunk to the object driver specified in accordance with a range to which a hash value belongs.

As shown in Fig. 6, the metadata tree generation unit 242 merges sub metadata trees generated by the respective object drivers 210 to generate one metadata tree based on the plurality of sub metadata trees. Moreover, the metadata tree generation unit 242 stores the generated metadata tree into the metadata information storage unit 214. The metadata tree generation unit 242 may store the generated metadata tree into the storage cluster.

For example, Fig. 7 shows an example of processing in merging two sub metadata trees, sub metadata tree 1 and sub metadata tree 2. Referring to Fig. 7, the metadata tree generation unit 242 extracts and manipulates leaf nodes that are the ends of the two sub metadata trees to merge the two sub metadata trees. As an example, the metadata tree generation unit 242 stores metadata contained in the first leaf node (leaf node on leftmost edge) of the sub metadata tree 2 until the size of the last leaf node (leaf node on rightmost edge) of the sub metadata tree 1 exceeds the upper limit. Thus, in the case illustrated in Fig. 7, based on the leaf node where metadata j is stored of the sub metadata tree 1 and the leaf node where metadata k, l, m, and n are stored of the sub metadata tree 2, a leaf node where metadata j, k, and l are stored and a leaf node where metadata m and n are stored are generated. After the above process, the metadata tree generation unit 242 recalculates the reference information to the respective leaf nodes and the like. Consequently, the two sub metadata trees, the sub metadata tree 1 and the sub metadata tree 2, are merged. In the case of the above method, the leaf nodes other than the leaf nodes at the ends to be manipulated continue to hold the same metadata as before merging. The metadata tree generation unit 242 can merge a plurality of sub metadata trees by repeatedly performing the process to merge a pair of sub metadata trees.

The example of the configurations of the object driver 210 and the execution unit 240 has been described above. The configurations of the object driver 210 and the execution unit 240 are not limited to those illustrated in this example embodiment. For example, at least part of the functions of the division and distribution unit 241 and the metadata tree generation unit 242 included by the execution unit 240 may be included by at least some of the object drivers 210. That is to say, any object driver 210 may divide an object and distribute chunks to the other object drivers 210. The execution unit 240 may include at least part of the function included by the object driver 210.

The storage node 300 is a server computer including a storage device for storing data. As described above, in this example embodiment, a plurality of storage nodes 300 configure a storage cluster.

In the storage node 300, chunks obtained by dividing an object are stored. The storage node 300 may have a deduplication function to avoid duplicately storing a chunk having already been stored. Moreover, metadata information such as a metadata log, a sub metadata tree and a metadata tree may be stored in the storage node 300. The storage node 300 can transmit the stored metadata information to the processing node 200 in accordance with an instruction from the processing node 200.

The example of the configuration of the storage system 100 has been described above. Subsequently, referring to Fig. 8, an example of an operation of the processing node 200 will be described.

The execution unit 240 acquires an object to be stored via the object driver 210. Then, the division and distribution unit 241 divides the object into a plurality of chunks (step S101). The size of the chunk may be variable.

The division and distribution unit 241 distributes the chunks obtained by division to a plurality of object drivers 210 (step S102). For example, the division and distribution unit 241 may distribute the chunks to the object drivers 210 as evenly as possible based on the hash values of the chunks.

Upon acquisition of the chunk, the object driver 210 stores the acquired chunk into a storage cluster. Moreover, the metadata log writing unit 213 stores a metadata log indicating the content of metadata indicating a chunk storage location and the like into the metadata information storage unit 214 (step S201). That is to say, the metadata log is stored majorly in the metadata information storage unit 214 that is a memory.

The sub metadata tree generation unit 217 generates a sub metadata tree based on the metadata log stored in the metadata information storage unit 214 (step S202). That is to say, the sub metadata tree generation unit 217 generates a sub metadata tree based on the metadata log corresponding to the chunk acquired by the acquisition unit 215.

The metadata tree generation unit 242 merges the sub metadata trees generated by the respective object drivers 210 to generate one metadata tree based on the plurality of sub metadata trees (step S103).

Thus, the processing node 200 includes the sub metadata tree generation unit 217 and the metadata tree generation unit 242. With such a configuration, the metadata tree generation unit 242 can generate a metadata tree by merging the sub metadata trees generated by the respective sub metadata tree generation unit 217. As a result, it is possible to quickly generate a metadata tree.

Further, the processing node 200 includes the division and distribution unit 241. With such a configuration, the division and distribution unit 241 can distribute the chunks obtained by division to the respective object drivers 210 so that a process to generate a sub metadata tree is evenly performed by the respective object drivers 210. As a result, it is possible to perform efficient generation of sub metadata trees, and it is possible to easily merge into a metadata tree.

Further, the processing node 200 described in this example embodiment stores a metadata log and the like in the metadata information storage unit 214 that is a memory. By performing such in-memory processing, it is possible to realize more efficient generation of sub metadata trees and generation of a metadata tree.

Second Example Embodiment

In the second example embodiment of the present disclosure, a processing apparatus 400 that is an information processing apparatus generating a metadata tree by merging a plurality of sub metadata trees will be described. Fig. 9 shows an example of a hardware configuration of the processing apparatus 400. Referring to Fig. 9, the processing apparatus 400 has, as an example, a hardware configuration as shown below including;
a CPU (Central Processing Unit) 401 (operation device),
a ROM (Read Only Memory) 402 (storage device),
a RAM (Random Access Memory) 403 (storage device),
programs 404 loaded to the Ram 403,
a storage device 405 for storing the programs 404,
a drive device 406 reading from and writing to a recording medium 410 outside the information processing apparatus,
a communication interface 407 connected to a communication network 411 outside the information processing apparatus,
an input/output interface 408 inputting and outputting data, and
a bus 409 connecting the respective components.

Further, the processing apparatus 400 can implement functions of an acquisition unit 421 and a generation unit 422 shown in Fig. 10 by acquisition and execution of the programs 404 by the CPU 401. The programs 404 are, for example, stored in the storage device 405 or the ROM 402 in advance, and loaded to the RAM 403 or the like and executed by the CPU 401 as necessary. Alternatively, the programs 404 may be supplied to the CPU 401 via the communication network 411, or may be stored in the recording medium 410 in advance and retrieved by the drive device 406 and supplied to the CPU 401.

Fig. 9 shows an example of the hardware configuration of the processing apparatus 400. The hardware configuration of the processing apparatus 400 is not limited to the above example. For example, the processing apparatus 400 may be configured by part of the above configuration, for example, excluding the drive device 406.

The acquisition unit 421 acquires chunks distributed based on the hash values of chunks obtained by dividing an object. For example, the acquisition unit 421 acquires chunks distributed by the same object driver or the other object driver.

The generation unit 422 generates a sub metadata tree having a tree structure holding metadata based on the metadata log of the chunks acquired by the acquisition unit 421. By merging the sub metadata tree with another sub metadata tree, a metadata tree holding the metadata of the object is configured. That is to say, the sub metadata tree has a structure containing part of the metadata contained in the metadata tree.

Thus, the processing apparatus 400 includes the acquisition unit 421 and the generation unit 422. With such a configuration, the generation unit 422 can generate a sub metadata tree having a tree structure holding metadata based on the metadata log of chunks acquired by the acquisition unit 421. As a result, it is possible to generate a metadata tree based on sub metadata trees generated in a distributed manner, and it is possible to efficiently generate a metadata tree.

The processing apparatus 400 described above can be implemented by installation of a predetermined program in an information processing apparatus such as the processing apparatus 400. Specifically, a program as another aspect of the present invention is a program for causing an information processing apparatus such as the processing apparatus 400 to realize processes to: acquire chunks obtained by dividing an object, distributed based on hash values of the chunks; generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunks; and configure a metadata tree holding metadata of the object by merging the sub metadata tree with another sub metadata tree.

Further, a processing method executed by an information processing apparatus such as the processing apparatus 400 described above is a method including, by the information processing apparatus such as the processing apparatus 400: acquiring chunks obtained by dividing an object, distributed based on hash values of the chunks; generating a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunks; and configuring a metadata tree holding metadata of the object by merging the sub metadata tree with another sub metadata tree.

The invention of a program or a computer-readable recording medium having a program thereon or a processing method configured as described above also exerts the same action and effect as the processing apparatus 400 described above, and can therefore achieve the abovementioned object of the present invention.

Third Example Embodiment

In a third example embodiment of the present disclosure, a more specific example of processing described in the first and second example embodiments and an example of the effect will be described in the form of a paper.

Abstract

Backup is one of common use cases of object storage, not only due to its growing popularity. Object storage interfaces were designed to move large data over long distances, moreover its object locking functionality provide native ransomware protection making object storage suitable solution to protect data. That is why backup applications often support object storage interfaces.

Deduplication is a data reduction technique that is widely used in backup storage. The highest deduplication efficiency is achieved by backup appliances that provide block-level, in-line deduplication. Such an appliance can reduce storage footprint even if backup application has its internal deduplication. Moreover, backup appliance can deduplicate data written with different apps.

Thus far, backup appliances were designed to handle workloads consisting of huge backup images. Object storage backup brings up a new challenge, because backup applications leverage object storage by writing data in smaller objects (e.g. 1 MB - 64 MB). Uploading data in smaller parts allow quick recovery from network failures, upload parallelization and initial deduplication, but generates amount of metadata that has been not seen before.

We present ObjDedup, which is an efficient implementation of object storage interface that can be used in a backup appliance with block-level in-line deduplication. In our approach we store objects metadata in novel ObjectMetadataTree to limit resource consumption when creating numerous objects. In comparison to object storage solutions without in-line deduplication we provide 1.8-3.93x higher write throughput for duplicate data. In comparison to solution with object storage setup on backup appliance’s filesystem we were able to process 5.26-11.34x more PUT/s.

1 Introduction

Object storage is in demand in recent years. In 2017 there was a prediction that over 30% of data centers capacity is in object stores [1]. In 2021 Amazon announced that they store over 100 trillion objects in their Amazon S3 service [2], nine years after they announced the first trillion [3]. MINIO, which is a popular open-source object storage has over 500 millions of docker pulls and over 12000 slack channel members [4].

Data deduplication is a data reduction technique that can reduce particular data as much as 10-30x [5], thus it is available in wide range of systems [6-8]. The capacity savings comes with additional costs, such as fingerprint of duplicate elimination, data fragmentation, and overhead on metadata, depending on selection of chunking algorithm or desired chunk size [9,10]. Since backups contains data that is repeating over time, they can achieve high deduplication ratio [11]. Currently backups are the major use-case for deduplication appliances.

Popularity of object storage was not unnoticed in market of backup and recovery solutions. Most of leading backup applications integrate with Amazon S3 and similar interfaces [12-14]. The main advantages of object storage in terms of backup are: easy movement of data to off-site storage and strong ransomware protection through so called object locks. However, in cases of object storage backups the data is often stored as many objects of 1 MB - 64 MB size [15-17], which is challenging for deduplication storage that used to serve very large files (study from 2012 showed mean file size as 286 GB [18]). However, the pattern is coherent with general use case of object storage - Amazon recommends uploading objects over 100 MB in multiple parts to improve throughput and recover from network issues quickly [19].

We present a study of implementing and evaluating object storage interface in distributed system with global deduplication of data written using different interfaces (e.g. filesystem, VTL, Veritas NetBackupTM OpenStorage). Our studies focused on providing object storage that is well-suited for backup application integration based on our experience with other interfaces, but we present general solution that can be applied with various applications. As we explain with more details in further sections, the greatest challenge was efficient management of objects metadata, so the study mostly deals with this topic. Our contribution consists of (a) analysis of usage-patterns in 600+ storage systems with global deduplication (b) analysis of object storage interface operations (c) a design of system with object storage interface and deduplication (d) a design of ObjectMetadataTree that allows efficient and parallel processing of object storage metadata (e) evaluation of our implementation.

The rest of the paper is organized in following way. Section 2 describes background of problem and the state-of-the-art. Section 3 describes elements of object storage API that were the most important in our research. Section 4 contains our analysis of foregoing use cases of backup appliances. Section 5 describes our model of deduplication storage system and how object interface fits in it. Section 6 describes algorithms used for metadata management and Section 7 describes distributed version of metadata management. Section 8 contains evaluation of our implementation. Sections 9 contains conclusions.

2 Background and Related Work

2.1 Deduplication storage

Deduplication is a data reduction technique that has many applications such as primary storage [45-47] and increasing lifespan of SSDs [48]. In our study we focus on secondary storage and on premise deduplication appliances that achieve the highest deduplication efficiency.

Currently there are several deduplication appliances and backup storage is their primary use case. In these systems all or most of capacity is stored on hard drives [49-52] for cost-effectiveness. Deduplication capacity of these systems varies from hundreds of terabytes to hundreds of petabytes depending on system size.

Inline deduplication with small chunks is a key feature of deduplication appliances, but imposes requirements on a system design. Duplicate elimination requires chunking of data, as well as computation and validation of fingerprint [53]. Depending on data type fixed-sized or variable-size chunking algorithm can be used and extended with application-specific knowledge [9].

The chunk size has to be small (typically in 1-64KB range) to provide high deduplication ratio [18]. Each chunk has its related metadata - at least an entry in fingerprint index and one reference for each duplicate elimination. The latter increases with high deduplication ratios making a metadata significant part of the whole storage, especially if the metadata are replicated for resiliency.

Moreover, small chunk deduplication causes data fragmentation. Large objects and files are divided into small blocks which are distributed across the whole system with miscellaneous locality, depending on pattern of written data. There are algorithms that reduce impact of fragmentation on performance of system with deduplication [54-56], but they also cost additional cpu cycles, memory and the final latency is still high.

2.2 Object Storage API

Amazon Web Services initiated popularity of object storage when Amazon S3 was introduced in 2006 [20]. Currently storage with the same or similar interface is provided by hyperscalers [21,22], other public clouds [23], on-premise enterprise storage systems [24] and open-source solutions [25, 26]. Such interfaces are used in very diverse applications, e.g. video services [27], social media [28], games [29], and of course backups [30-32].

A multitude of object storage API suppliers caused a differentiation in interfaces provided by different companies. For example Google Storage XML API provides different than Amazon S3 commands to upload a file in multiple parts, requires different request headers and has some other dissimilarities [33]. That is why object interfaces in on-premise solutions are often described as S3-compatible and SWIFTcompatible to emphasize that they implement a large subset of commands provided by these popular interfaces.

In terms of cost and performance object storage interface has versatile applications. The standard Amazon S3 offers 3,500-5,500 requests per prefix per second, 100Gb/s single instance throughput, and 100-200 ms small file latency starting at $21 per terabyte per month [34] (the final price vary on number of requests, data lifecycle and more [35]). There are other products, such as Amazon Glacier, sacrificing the latency to decrease the preservation cost [36], as well as the opposite - systems offering improved performance with additional costs [37].

2.3 Related work

There are no publications discussing implementation of object storage interface in backup appliances. There has been research on deduplicated object storage in on-premise systems, however we haven’t found any study that targets backup usecase or elaborates how to deal with increased metadata footprint in object storage with deduplication. DedupeSwift [65] is an implementation of deduplication for OpenStack Swift that reaches read/write bandwidth of 10.54-25.51MB/s with RAID5 HDDs and SSD for fingerprint cache. There are several publications about deduplication in Ceph, but the feature is still under development. The leading approach that is mentioned in Ceph’s the documentation [6] is a post process deduplication, so it cannot reach better deduplication write throughput than underlying storage and might not be the best for backup workloads. There are other publications in topic of deduplication in Ceph [66] [67] [68] [69], but they focus on Block Device performance and does not show results with RadosGW or elaborate how their deduplication impacts object interface.

Data Domain Cloud Tier [64] is a feature that allows expanding capacity of backup appliance by moving its data to cloud object storage. In that solution backup appliance is managing two separate deduplication domains - active tier with local storage and cloud tier that stores data in cloud. In both Cloud Tier and ObjDedup the data ends deduplicated in object storage, but in our solution backup appliance is an object storage provider, not a client. Our approach allows to store backup from multiple sources through common object storage interface and deduplicate them with data provided by other interfaces.

Can’t we all get along [9] is a study about increasing performance of a filesystem with deduplication by enrichment of HDD based system with SSD drives. The performance of nonsequential accesses was increased a lot to allow operations such as Instant Restore. Same as in our research the goal was to limit utilization of HDD drives, but to support different use cases. We state that backup appliance should effectively implement multiple interfaces to reach the highest level of global deduplication, so the paper is complementary to our work.

3 Object Storage API details

Objects and buckets are two building blocks of object storage. The object itself is a data stream with associated metadata. These metadata contain a key of size up to 1 KB uniquely identifying the object within a bucket, information about object content (length, language, MD5 digest...), up to 2 KB of user custom metadata and some other fields related to ACLs, encryption and more. Unlike in filesystems, objects hierarchy is flat: each object is identified by its key and belongs to a bucket, but bucket cannot be stored in a bucket.

The interface is built on HTTP protocol using REST principles, so objects are uploaded by PUTs, deleted with DELETEs and so on. The interface can be very extensive and object storage operations can support features that has no counterpart in POSIX filesystems. For example, Amazon S3 currently has almost 100 commands: objects management such as PutObject, GetObject, DeleteObject, ListObjects; security related such as PutBucketEncryption, PutObjectAcl, PutBucketOwnership- Controls; allowing uploading object in multipler parts such as CreateMultipartUpload, UploadPart, CompleteMultipartUpload and others supporting tiering, replication and more. The most common features that had the impact on design of our solution are described with more details in following sections.

The object storage interfaces of different vendors are constantly evolving and backup applications also change the set of used commands with time. For example, some backup application started to use object storage to use ObjectLocks very recently [38, 39]. We also observed that some applications does not use multipart uploads, but chunk backup data into smaller objects by themselves.

3.1 Object listing

Despite the fact that the bucket hierarchy is flat, objects can be managed in a manner similar to hierarchical directory structure, because many commands operates on prefixes of object
keys. First of all, ListObjects can filter objects based on key prefix and group them based on a selected delimiter. Listing objects with delimiter "/" and prefix "mydirectory/" give results corresponding to calling "ls" in "mydirectory" on filesystem. However, in filesystems the character representing delimiter is designated by operating system (typically slash or backslash), whereas each ListObjects command can use different prefixes and delimiters. Prefixes are also sometimes used to indicate how data should be stored, for example it is common to scale performance per-prefix.

ListObjects can return at most 1000 objects. If the selected bucket or prefix has more than 1000 objects many ListObjects requests must be send and specify beginning of the listing with a marker parameter.

3.2 Multipart upload

Multi Part Upload (MPU) is an operation that allows to upload an object using multiple HTTP requests. It is recommended for larger objects, because it makes it possible to handle connection failures efficiently and multiply throughput. MPU starts with CreateMultipartUpload which can be called even if a final size of object is unknown and is finished with CompleteMultipartUpload
when all parts are uploaded. It can be also interrupted with AbortMultipartUpload which deletes all previously written parts. An object updated using multipartupload operation can consist of at most 10,000 parts and its size is limited to 5 TB. What is important, when the MPU is finished, each part of object can be still accessed separately if its number is specified in GetObject request.

3.3 Object versioning and locking

Object versioning allows storing multiple versions of the same object. If a bucket has enabled versioning then each PUT/POST/COPY operation creates a new version of the object. Furthermore, default DELETE creates a delete marker instead of permanently deleting object. Versions and delete markers can be overviewed using ListObjectVersions operation, because ListObjects returns only the newest version of object or no object if a delete marker is its newest version. Each version and delete marker can be deleted permanently if object’s versionId is included in DeleteObject call.

The object can be stored in write-once-read-many (WORM) model to provide ransomware protection. First, an object version can have a configured retention period, which is a time in which version cannot be deleted or overwritten. Alternatively, a legal hold can be placed on object version, which prevents its removal or overwrite as long as the legal hold is not removed by user. Both mechanisms can be used at the same time.

3.4 Consistency and scaling

For years, Amazon S3 had eventual consistency for read-afterwrite, so recently put objects may not have been included in ListObject reply or returned 404 for GET request. The model was changed to strong consistency in 2020 [40], which is similar to other Object Storage services which provides strong consistency [41, 42].

Providing strong consistency with high availability and scalable performance is challenging. That is why object storage services often need time to reorganize if the number of requests per bucket or prefix increases dramatically in very short period [43, 44].

4 Backup data pattern analysis

Since backup appliances are used primarily for backups, its usage patterns result from user backup policies. To implement a new interface we examined how customers uses the system with current interfaces. The examined data shows how much data is added and removed from system every week, independently from which interface was used to write the data. We expect that with new object storage interface the behavior will be similar, although it is a topic for future research when object storage interface will be commonly used in backup appliances.

We conducted the analysis to confirm that in backup appliances huge amount of data is added and removed frequently. In our research we used the results to confirm that huge number of metadata changes needs to be addressed in our metadata management method, but these data can be fundamental for other backup appliances research as well. Generally, backups are kept for their retention period, so regularly a fraction of system capacity is both written and removed. Typically, the total amount of data in each system increases gradually.

We examined usage of 600+ systems, specifically 13000+ samples that represents capacity change over a single week. Analysis of weekly cycles is reasonable, because backups are often done in daily or weekly schedules. Moreover, the garbage-collection process in deduplication storage takes at least hours, so it is common to run it only once, twice a week or at most once a day.

First, we examined how capacity usage changes within each sample. We calculated a ratio of the maximal capacity increase and decrease (figures 11 and 12) for raw capacity (the amount that really goes to disk) and effective capacity (the amount before deduplication). We calculated their ratio as quotient of maximal capacity increase and maximal capacity decrease. Typically, the ratio is in 0:66-1:5 range, which means that similar amount of data is written and deleted, however other scenarios are also frequent. Despite the fact that in general more data is written than deleted in 36%-40% of samples deletes exceeded writes.

A less expected observation is that there is a significant difference between characteristics of raw and effective capacity if the increases and decreases are compared to maximal system capacity in a given sample. For raw capacity, in most samples the changes were at most 10% of maximal system capacity, so the system is rather stable. However, for effective capacity the differences are more varied - majority of samples had difference over 10% of max capacity and 23% over 20% of max capacity. These results are fundamental for our research, because object related metadata needs to be kept even if object data is mostly deduplicated. That means the system needs to add and remove a large number of object metadata every week, especially when backup application writes backup in smaller objects.

As data of system capacity usage shows (figure 15), in most cases system has significant amount of free space. However, in 16% of cases the capacity usage was over 80%. In such a case quick space reclamation (within hours) may be required to handle all new load (as figure 13 and 14 shows, increments over 20% can happen). The requirement for fairly quick space reclamation affected our design.

5 System Model

5.1 Deduplication system model

Our deduplication system, like many others [57], stores data in immutable blocks that can be deduplicated. References to blocks can be kept in other blocks, so data can be organized in tree-shaped structure (see fig. 16). Each file or object has its node which has references to trees that contains the data (DataTrees). Such a node can be referenced from a metadata structure (in case of objects it is ObjectMetadataTree described in section 6). Our system is content addressable, so address of each block is determined based on its data.

Blocks that have no live references are eventually deleted. Deleting objects in systems with inline deduplication requires additional effort to prevent situation new writes reference to data that is deleted. Typically, the system implements a multiphase process that implements garbage collection technique such as reference counting or mark-and-sweep [58-60]. In our system we synchronize user operations using epochs as in [58]. That enforces additional requirement on our object API implementation - we need to keep the invariant that in epoch T the client does not keep addresses obtained in epochs T-2 and earlier. In practice, it means that driver must be able finish all already started operations within few minutes. Using the epochs system is able to limit the number of re-calculated counters only to blocks that were changed recently and their children.

As mentioned in section 2.1, deduplication can increase latency of read and write operations. For that reason, the system allows sending of priority requests that have preference in queues and in case of writes are written to media with lower latency. Priority request are used to provide lower latency for operations such as writes of small files.

5.2 Base System Architecture

In our research we focused on a system model that is common in enterprise backup appliances. The system size can vary from a few to thousands of nodes. Data can be written using multiple interfaces, so the storage layer is separated from driver layer which implements interfaces. As shown in Figure 17 we divided the system into two parts: DriverProcesses and StorageCluster. The ratio of DriverProcesses to and nodes in StorageCluster can vary depending on requirements for overall system performance.

DriverProcesses is a group of GenericDriverProcesses that loads plugins for different interfaces and uses it with Generic- DriverImplementation that reuses operations that are common such as writing data with deduplication, reading data, ensuring QoS. StorageCluster is responsible for distributing data across devices, handling hardware errors, and performing garbage collection. StorageCluster is also a source of information about block existence for deduplication.

The communication between DriverProcesses and StorageCluster is done using block interface. StorageCluster can write a block and return its BlockAddress or read a block when BlockAddress is provided. It can also answer if a block is already stored if a hash of block is given.

6 Object Driver metadata

To implement effective object storage interface in a system described in section 5 we required metadata structure with following characteristics:
1. Metadata management is not a bottleneck if possible, even considering increased number of metadata operations due to deduplication and userload characteristic.
2. Metadata structures use the same immutable storage as data.
3. The blocks that keep references to deleted objects are removed quickly, so the space can be reclaimed after next space reclamation process.
4. User requests have low latency, but only a fraction of metadata related operations are performed as priority requests.
5. The impact of metadata management on StorageCluster is as low as possible, because we want to leave as much resources as possible to operations not related to object storage.

Requirement 3. was the reason why did not use static sorted table based log structured merge tree. Such a structure has excellent amortized cost of insertions when entries are immutable, but requires keeping not removed entries for a long time. Instead we propose an algorithm based on immutable tree that is similar to B+-tree and is effectively rewritten in short time intervals.

6.1 Metadata architecture

The high-level architecture of metadata management in ObjectDriver is shown in fig. 18. ObjectDriver handles REST API commands and processes them in RequestHandler. Operations require metadata which subset is kept by InMemoryObject- MetadataStore. If a particular metadata is not available in memory, than it is read from StorageCluster. ObjectDriver metadata is organized in ObjectMetadataTree which is discussed in section 6.2. New writes of metadata are written by ObjectMetadataLogHandler to ObjectMetadataLog which keeps them both persistently in StorageCluster and in memory. Insertions to ObjectMetadataLog are priority requests, so the latency of user requests is as low as possible.

When ObjectMetadataLog reaches a limit of its size a new log is created and the old one is merged into ObjectMetadataTree. The performance of merge implies the number of requests that can be done in a given time frame, because we need to flush ObjectMetadataLog to not exceed memory limits. If the merges are too slow the userload needs to be stopped, that is why a huge part of our research and evaluation is related to efficiency of merge. Storage with deduplication has high read latency, thus during the merge prefetching is done.

6.2 Objects Metadata Tree

ObjectMetadataTree keeps persistently all metadata required to support Object API operations and references to data trees. It is organized like a B+-tree, so internal nodes keep only references to other nodes and separators and leaves keep all other metadata. There are few different types of metadata structures kept in ObjectMetadataTree - BucketMetadata, ObjectMetadata and MpuMetadata. The structures are kept in order which is designated by metadata structure content, especially its type, bucket name, object key and object version (as shown in figure 19).

In ObjectMetadataTree, each node (besides the rightmost ones) has size between 1/2 S and S, where S is configured maximum size of each node. To keep the tree balanced, all paths from root to leaves except the rightmost one has the same length. Both invariants are kept by Metadata Merge (section 6.4) operation, which is the only operation that modifies ObjectMetadataTree.

6.3 ObjectMetadataLog

ObjectMetadataLog is a structure that obtains all recently written metadata. Every metadata entry is written as a single priority request, so a response to user request can be sent quickly (if there are many concurrent requests they can be grouped to decrease the number of priority requests). Object- MetadataLog size is small enough to keep its content cached in memory, so its content is read from StorageCluster only in case of a failure (the failover can be conducted by a different ObjectDriver). Unlike ObjectMetadataTree it also keeps information which data should be removed as deleteEntries that has information which objects should be deleted later during merge. Entries in ObjectMetadataLog are kept ordered by time its request-it is important because object storage interface allows empty deletes, so in the situation in which PUT came before DELETE and DELETE came before PUT must be distinguishable (as presented in figure 20).

6.4 Metadata Merge

MetadataMerge is operation of rewriting ObjectMetadataTree by applying all changes from ObjectMetadataLog. Logically we rewrite the tree, but rewriting all data after filling Object- MetadataLog is infeasible, so nodes from existing Object- MetadataTree are re-used as much as possible.

The MetadataMerge traverses ObjectMetadataTree in DFS manner. For each node the decision is taken if the node with its whole subtree can be re-used in new version of tree or if any entry from ObjectMetadataLog needs application. If a change in a sub-tree is required, than we walk down to the relevant leaf node and apply the changes relevant to it. If after the changes node size is not in [1/2 S;S], the node is split into two or node next to the right is read in purpose of creating node or nodes of valid size. Regardless of change in number of children, all internal nodes in the path from modified leaf to the root requires modification, because blocks are immutable and the address of leaf is changed. If the size of internal node is outside [1/2 S;S] the procedure is to split it in the same way as leaves. When the size of root exceed the limit a new root node above is added.

A single MPU delete entry can affect more than one node, because it deletes all MPU parts (up to 10000 elements). In such a case both leaves (and their ancestors) that contains start and end of MPU range needs modification and all leaves in between are deleted.

6.5 Leaf space reservation

Application of changes in ObjectMetadataTree can affect multiple consecutive leaves. For example, if ObjectMetadataLog contains many consecutive objects or MPU parts then multiple consecutive leaves will be created. Typically, in such a situation we generate all leaves with filling around 3/4 S, so some additional inserts to the leaf are possible without modifying the siblings. However, in case of two MPU parts (with part numbers a and b) of the same object the only objects that can be placed between them are other MPU parts of the same object with particular numbers (in range (a;b)). We can use that information to predict how much if any spare space in a leaf is needed.

6.6 ObjectMetadataTree Prefetch

A merge that iterates through ObjectMetadataTree and starts a new read request to StorageCluster each time a node intersects with any entry from ObjectMetadataLog would take an unacceptably long time, hence a prefetch is needed. We present a prefetch algorithm (referred to as OMT Prefetch) that reads from StorageCluster at most 2 * h * (n - m) + 3 * h * m =h(2n+m) nodes of ObjectMetadataTree, where h is the height of ObjectMetadataTree and n is number of all entries in ObjectMetadataLog and m is a number of MPU delete entries in ObjectMetadataLog. The algorithm can be run in parallel and h subsequent read requests are required at most to be able to issue prefetch of all nodes. Our implementation iterates through nodes in the same order as merge, but we keep it several thousands of nodes ahead, so memory usage is limited.

We distinguish three types of operations: metadata inserts/ updates, metadata deletes and MPU deletes which removes metadata of multiple MPU parts. For each type of operation we need to prefetch different set of nodes. Metadata insert/update can lead to leaf node overfill and thus require to split the node and possibly its ancestor nodes too. Split does not require to read nodes other than the node being split. Therefore for insert/update we prefetch only nodes whose key ranges include inserted key. Metadata delete can lead to merge or merge and split of a leaf node and possibly its ancestor nodes with first succeeding nodes at the same levels. Therefore we prefetch only nodes whose key ranges include deleted key and their first right siblings, but except siblings of leaf nodes. The right sibling of leaf is not prefetched even though it may be needed to create leaf of a proper size. The leaf can be read if needed later without interrupting the tree walk,
because the size of leaf and its right sibling is stored in their parent, so all further steps of tree walk can be done even if that particular node is not finished yet. With this optimization the algorithm never reads more leaves than necessary.

Single MPU removal can lead to multiple nodes removal and no more than three nodes update at each level of Object- MetadataTree. It is enough to prefetch the nodes that intersects with start and end of MPU parts with their ancestors and right-siblings of the MPU end path. All of the nodes in between will be removed.

6.7 ObjectMetadataTree dimensions

The height of the tree is equal h =[log d (n 0 ) ], where d is a degree of branching and no is total number of objects. Even with 20000 StorageCluster nodes that have 12 HDDs of 14 TB, average object size of 10 MB, deduplication ratio of 20:1 and compression 5:1 there are 3:36 * 1013 objects, which means that with d = 64 at most nine levels of tree are required.

The prefetch algorithm allows parallel reading (only h subsequent read requests are required), so the amount of operations is more important than the height of the tree. In the worst case, ObjectMetadataTree Prefetch algorithm reads 2 * (h - 1) * n internal nodes (or slightly more if there are MPU deletes) and 2 * n leaves and the content of each read node is written. The writes can be processed as non-priority requests and they are grouped inside StorageCluster, so the reads are determining the merge performance. The size of internal nodes is specified by its degree of branching and size of keys that can be up to 1 KB each. With d = 64 it means internal nodes are up to 64 KB and in practice it is even less, because the whole blocks are compressed by StorageCluster and large keys typically has a huge common part that compresses well.

We wanted to keep branching factor relatively small, not only to reduce size of internal blocks, but also due to space reclamation algorithm, which requires modification of file with reference counters for every child of a written node. If the fanout is big then the number of nodes that require counters modification is huge, which significantly affects deletion time.

With average object size of 10 MB, deduplication ratio 20:1 and compression 5:1 each ObjectDriver needs to store metadata of 1:68 * 109 objects. Considering maximal keys length of 1 KB, all metadata take over 1.7 TB. Considering 1 MB objects and additional 2 KB of user metadata per object it is over 50 TB. Assuming current architecture of deduplication appliances storing it ObjectMetadataTree in RAM or SSD is infeasible. In particular, even if there are some flash drives in deduplication appliances (which is not always the case), their capacity can be already used for other purpose. However, caching in RAM some (or even all) levels of internal nodes can be done to reduce required IOs.

7 Distributed ObjectMetadataTree

The solution described in section 6 performs MetadataMerge using single ObjectDriver, which can quickly become a bottleneck in a large cluster. There are no dependences between metadata of buckets, so each bucket can be handled by different ObjectDriver and have its own ObjectMetadataTree, but such an approach does not scale performance within a single bucket.

We present an algorithm for processing a single Object- MetadataTree using multiple ObjectDrivers. Each of Object- Drivers can write its own ObjectMetadataLog. The logs are merged into ObjectMetadataTree in two phases (as shown in figure 21). In the first phase many disjoint ObjectMetadata- Trees are generated (called SubObjectMetadataTrees), each one covering a given subrange of object keys. Then all these trees are merged using our SubObjectMetadataTrees merge algorithm explained in section 7.2.

7.1 Object key space division

The first phase of distributed merge generates many disjunctive SubObjectMetadataTrees. The idea is that the space of keys kept in ObjectMetadataTree is divided into many ranges and for each range a separate SubObjectMetadataTrees is generated by different ObjectDriver. Each SubObjectMetadataTree is created by merging all relevant entries from all ObjectMetadataLogs with a subset of nodes from Object- MetadataTrees. That means each driver reads all ObjectMetadataLogs, but since it happens in exactly the same time our StorageCluster can cache the data efficiently.

To distribute work evenly each SubObjectMetadataTrees should contain a similar number of nodes that require rewriting during merge. We propose to first distribute objects to drivers randomly (for example by a hash function) and store them in ObjectMetadataLogs of each ObjectDriver. Then, a single chosen driver calculates ranges for all drivers based on content of its own ObjectMetadataLog. The ranges are broadcasted to each driver that creates its SubObjectMetadataTrees by performing a single ObjectDriver merge of relevant entries from all ObjectMetadataLogs and a part of ObjectMetadata- Tree. Note that distribution of objects to many ObjectDrivers impacts ObjectLists which is discussed in section 7.3).

7.2 SubObjectMetadataTrees merge

Our distributed algorithm generates multiple SubObjectMetadataTrees and efficiently merges them into one. Each of Sub- ObjectMetadataTrees has disjoint range of object keys. Nodes on the rightmost edge of each tree can have size below [ 1/2S;S], moreover height of trees may vary, so referencing all of Sub- ObjectMetadataTrees in a single root may not generate a new valid ObjectMetadataTree. However, for each two consecutive ObjectMetadataTrees, a new valid ObjectMetadataTree can be generated and only right most edge of the first tree and leftmost edge of the second tree needs to be read during such a construction.

First, read the rightmost edge of the first tree. According to the tree invariant all other nodes of the first tree have correct sizes. In next step, read the leftmost edge of the second tree. Starting from the lowest node in the leftmost edge of the second tree, add each entry from the node to a relevant node from rightmost edge of the first tree. If size of any node is exceeded, create a new node and add reference to it on a higher level (if the size of node above is exceeded repeat the process). The key observation is that if there are two nodes and at least one of the nodes (one from the second tree) has valid size then one or two valid nodes can be created in a way that there are no leftovers. So in the end the only nodes of size less than size limit 1/2 S are: the node that contains entries from root of the second tree, rightmost edge of the second tree that was spliced without reading, and if the first tree was higher - upper nodes from rightmost edge of the first tree. The operation is presented in Figure 22.

The operation of merging two trees reads only h1 +h2 nodes (where h1, h2 are height of first and second trees). Because height of metadata tree is low (up to 9, see section 5.7) so number of nodes processed by SubObjectMetadataTrees merge is low enough to be processed by single driver within negligible time compared to first phase. This is valid even if there are many (e.g. 20000) trees to be merged.

7.3 ListObjects with many drivers

When objects are distributed by a hash function each metadata operation for a given object are always conducted by the same ObjectDriver. Therefore the freshest metadata of object to be listed are in InMemoryObjectMetadataStore of different ObjectDrivers. In case of MPUs the whole operation is also handled by single driver, which is not a problem because MPU can have at most 10K parts. However, ListObjects is an operation that requires information about many objects. There is a single ObjectMetadataTree, so persistent data can be retrieved effectively by a bunch of operations from single driver and if many consecutive ListObjects requests are send (because of limit of 1000 returned objects) then the StorageCluster can cache the repeating path in ObjectMetadataTree. However, the latest state of listed objects can be available in Object- MetadataLog and InMemoryObjectMetadataStore and since objects are hashed, all of ObjectDrivers needs to be asked about the objects.

To prevent flooding all ObjectDrivers during ListObjects, the space of keys can be divided into ranges based on a current load in a system. If there are no requests, each driver is responsible for the equal number of ObjectMetadataTree keys. If any range of keys gets more request then a driver can handle, the ranges are recalculated based on traffic reports sent to distinguished driver, and multiple drivers can handle the same range using the hashing. In that way, asking of multiple drivers during listing is required only if the particular range received recently a lot of requests. Note that the traffic report gathering and range calculation may introduce slight delay, but as explained in section 3.4 it is fair.

7.4 Failover handling

If a process with ObjectDriver fails (for example due to hardware error) the failover procedure is started. All data was persistently stored in StorageCluster, but the content of its InMemoryOb jectMetadataStore needs to be recovered from ObjectMetadataLog. The process also needs to perform MetadataMerge operations for the failed node in order to complete the ongoing merge. The node that handles failover can have additional keys to manage, so new key partitioning
should take into account the change.

8 Evaluation

8.1 Files/Objects per second

In the first experiment we compared our object storage with deduplication and existing filesystems with deduplication in terms of number of new files/objects that can be stored per second. As a baseline we selected three enterprise filesystems with inline deduplication (vendors are not disclosed). The filesystems were designed to store data on HDDs, but some of them has an option of adding faster storage for example for writeback journal. To make the solutions comparable, we conducted experiments in two setups - that uses HDD drive for all kind of storage and one with fast NVMe SSD for all kind of storage. The detailed environment is described in table 1.

The experiments were conducted using MinIO client, which can copy data from filesystem or objectStorage to filesystem or objectStorage. The source was a file system directory that contained files with short names (8-32 chars) and small content (8-32 chars). The same directory was copied to the same destination twice, so in the second test all data (including filenames) was a duplicate. The filenames and content was on purpose that small, to show that competitors are below requirements even in the simplest possible scenario. For each filesystem two experiments were done: with filesystem mounted over NFS and with MinIO configured on top of the filesystem (to test the cases in which S3 interface is a necessity).

As shown in figure 23, ObjectDriver can handle 4.6 - 8.35 times more operations than the fastest filesystem. The result was consistent with our expectations, because filesystems with inline deduplication are complex and it is very hard to handle small files effectively. The results that in most cases are below 100 OP/s are consistent with existing publications [61]. Moreover, the MinIO client copies the files to a temporary location and renames them afterwards to prevent listing on inconsistent files. It means that in fact two file operations (create and rename) are needed.

In comparison with MinIO on top of filesystem solution, ObjectDriver can handle 5.26 - 11.34 times more puts. The result proves that our dedicated object storage implementation
is superior to a solution composed of two elements.

The market of filesystems with inline deduplication is huge, so there are many solutions that we were unable to verify, but we haven’t found any benchmarks that showed a solution more efficient than products we have tested. We are also aware that there are protocols dedicated to particular backup applications (such as Veritas NetBackupTM OpenStorage) that generally have higher performance than NFS, but only handful of backup applications support such an interface, so unlike filesystem and object storage interface they cannot be used with many different backup applications.

8.2 Backup Throughput

In this experiment we used COSBench [62], which is a framework that allows performance testing of object storage, recognizable by object storage community [63]. COSBench does not support distinction between load with duplicates and nonduplicates, but we modified it to make content of each object being generated deterministically from the object name in a way that the file does not contain internal duplicates. In all tests object keys were created with COSBench uniform generator.

As a baseline we used MinIO and RadosGW that do not provide inline deduplication. As described in details in section 2.3 we haven’t found a single object storage with inline deduplication that could be used in experiments or has comparable results published. The experiment was conducted using one server with 12 hard drives (details in table 2). Both MinIO and RadosGW were set up using CentOS 7. MinIO used XFS as underlying filesystem and RadosGW used BlueStore, which are recommended for high performance.

As shown in figure 24 with the logarithmic scale on the yaxis, in case of duplicate load the ObjectDriver has 1.8 - 3.83 times higher throughput, which is expected, because in case of duplicate writes inline deduplication throughput can overcome limits of HDDs. In case of non-duplicates objectDriver achieves at most 488-595 MB/s, which is close to the limit of block interface of StorageCluster with one Storage Node. For objects up to 32 MB the non-duplicate performance is comparable or even better than MinIO/Ceph, but with larger objects MinIO is faster, because it does not chunk files into small blocks for deduplication.

8.3 ObjectMetadataTree details

If the MetadataMerge takes too long, the performance of ObjectsDriver drops. For example, if in a single merge we change 50000 entries and merge takes 100s than we cannot constantly perform more than 500 PUT/s. In this section we show details of MetadataMerge performance under various circumstances.

All ObjectMetadataTree experiments were conducted using two node setup in which one server (node 1) hosted both objectApiDriver and a StorageCluster node and node 2 hosted only StorageCluster node (hardware details in Table 3).

8.3.1 ObjectMetadataTree merge

In the first experiment we verified how the number of objects in ObjectMetadataTree affects the time of merge operation. During each merge was triggered after next 50K objects were added using a PUT command called in parallel by 1000 workers. We verified two types of loads. In sequential workload each object constant key (50 characters) is ended with two concatenated natural numbers: an ID of worker and sequential number of an object. In a random workload each object key is a random UUID. The cluster was not handling load other than the PUTs, so we included results of additional two experiments with delayed read operations on StorageCluster side by 150ms which is a typical latency observed in our system in production use cases.

As shown in Figure 25 in sequential workload the merge takes up to 10 seconds, even with introduced delay. Despite the fact the data is written to 1000 different parts of tree, the data in each part is written sequentially, so many objects are added to new leaf nodes and limited amount of internal nodes requires rewriting.

In case of random load the merge time plot looks more like linear than logarithmic function due to the following two reasons. First, the number of internal nodes is small (only about 10K with 25M objects), so almost all of them need to be rewritten in each merge when 50K random objects is written. With increasing number of internal nodes the number of nodes not modified by merge will increase as well, so the curve will start to flatten even more. The second effect is a decreasing efficiency of StorageCluster internal caching-in larger tree there is less data locality. In Rand and Rand-delay tests storage nodes kept merge tree using 9+3 erasure codes, so read of a block needs 9 disk accesses. For comparison, Rand2 and Rand2-delay tests were using 3+9 codes which prevented disks from being overwhelmed. There are erasure codes techniques that reduces the number of IOs required during reads, but they are not implemented in our Storage Cluster. The other option is to store all of ObjectMetadata- Tree internal nodes (or at least a big fraction) in memory to decrease number of IOs of internal node reads.

8.3.2 Impact of key and metadata size

In section 8.3.1 we described experiments with limited key size. Amazon S3 allows key size up to 1 KB and metadata attached to object up to 2 KB. In our system we store full keys in each node of ObjectMetadataTree, so the key size affects amount of data that needs to be written. Objects metadata is not needed in internal nodes, but can be stored in leaves, so additional IO is not needed to perform HEAD operation.

Figure 26 shows how extensive key and metadata size affects time of merge. If key has 1 KB but the entropy of key is low (noted as largeKeyCompressable), the data that reaches StorageCluster is compresed, so there is hardly any impact of increased key size. We consider compressible keys a common case, because even if the key is long and consists of randomly generated UUIDs, objects within one leaf should share a common part of it. When the content of key is completely random for each object and the key has 1 KB, then the disk has more load and effectiveness of StorageCluster cache decreases, so the merge takes almost twice as long. The effect increases if additional 2 KB of uncompressed metadata is stored in leaves. That is why storage of objects metadata with object data should be considered if object metadata is huge.

8.3.3 Prefetch algorithms

Data prefetching is an absolute necessity to perfom merge in a reasonable time. Waiting for reads one after another increases merge time to tens of minutes even with very small amount of objects (like 23 minutes with only 600k objects in system, see Figure 27). Additionally, the prefetch should not read to much not needed data. Figure 28 compares OMT Prefetch with a simple algorithm that prefetches all internal nodes in a seq-delay scenario from section 8.3.1. The total number of internal nodes increases quickly, but only fraction of them really needs to be read.

8.4 Distributed merge

In distributed setup each ObjectDrivers can write its own ObjectMetadataLog, so the number of entries processed in each merge can be increased without changing maximal size of each ObjectMetadataLog per ObjectDrivers. Figure 29 shows how the time of merge changes if the number of ObjectDrivers and StoregeCluster nodes (one per driver) increases with the number of entries to be processed. In the experiment object keys were generated randomly, as in section 8.3.1.

Despite overhead on merge distribution, 4 nodes are able to merge 200k objects into the tree of the same size faster than 1 node merges 50k objects. That is because with more objects
in a single merge the ratio of changed internal blocks to leaf blocks decreases.

Of course with increased number of ObjectDrivers and StoregeCluster nodes the amount of objects in the whole system can increase. With taken into account linear system growth there is some loss resulting from load distribution overhead and increased height of subtrees. For example 1 server merge with 25M objects in system takes 42.56s, 50M with 2 servers takes 50.713s and with 4 servers and 100M objects takes 56.78s. However, the test uses random reads that are distributed across the whole ObjectMetadataTree, so if a backup job affects only a subset of the tree of the constant size such a job can be done more than 4 times faster using 4 ObjectDrivers.

9 Conclusions

Writing backups with object interface is becoming a common feature of backup applications. As our research shows a backup appliance can efficiently implement object storage interface. ObjDedup can can handle more requests per second than solution that uses filesystem with deduplication underneath, and in comparison with storage without deduplication offers not only reduced capacity usage, but higher throughput when writing duplicate data.

The trend to upload backups as not very large objects challenges backup appliances that stores most or all of the data on hard drives. With our metadata structure following scenarios result in performance below our expectations: storing backups in too small objects, using very large keys and metadata to every object that do not compress with their neighbors, and writing objects without any locality in their keys. We have not seen yet a backup application that write in a malicious pattern, but the market of object storage backup is still evolving. We hope that our research will put the object backups on the right track.

<Supplementary Notes>
The whole or part of the exemplary embodiments disclosed above can be described as the following supplementary notes. Below, the overview of a processing apparatus and so on according to the present invention will be described. However, the present invention is not limited to the following configurations.

(Supplementary Note 1)
A processing apparatus comprising an object driver including:
an acquisition unit configured to acquire a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
a generation unit configured to generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the chunk acquired by the acquisition unit,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.
(Supplementary Note 2)
The processing apparatus according to Supplementary Note 1, comprising
a distribution unit configured to divide the object into a plurality of chunks and, based on the hash value of the chunk obtained by the division, distribute the chunk to any of a plurality of object drivers.
(Supplementary Note 3)
The processing apparatus according to Supplementary Note 2, wherein
the distribution unit is configured to, based on the hash value of the chunk obtained by the division, distribute the chunk to any of the plurality of object drivers so that a process to generate the sub metadata tree is performed evenly by each of the plurality of object drivers.
(Supplementary Note 4)
The processing apparatus according to Supplementary Note 2, wherein
the distribution unit is configured to distribute the chunk corresponding to the specified object driver, the object driver being specified in accordance with a range to which the hash value of the chunk obtained by the division belongs.
(Supplementary Note 5)
The processing apparatus according to Supplementary Note 1, comprising
a metadata tree generation unit configured to generate the metadata tree by merging the sub metadata trees generated by a plurality of object drivers.
(Supplementary Note 6)
The processing apparatus according to Supplementary Note 5, wherein
the metadata tree generation unit is configured to repeatedly perform a process to merge a set of sub metadata trees to merge a plurality of sub metadata trees and generate the metadata tree.
(Supplementary Note 7)
The processing apparatus according to Supplementary Note 1, wherein
the generation unit is configured to generate the sub metadata tree based on the metadata log stored in a memory.
(Supplementary Note 8)
The processing apparatus according to Supplementary Note 1, comprising
a writing unit configured to store the metadata log indicating a content of metadata of the chunk acquired by the acquisition unit into a memory,
wherein the generation unit is configured to generate the sub metadata tree based on the metadata log stored into the memory by the writing unit.
(Supplementary Note 9)
A processing method executed by an information processing apparatus, the processing method comprising:
acquiring a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
generating a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.
(Supplementary Note 10)
A non-transitory computer-readable recording medium having a program recorded thereon, the program comprising instructions for causing an information processing apparatus to realize processes to:
acquire a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.

The program disclosed in the exemplary embodiments and supplementary notes is stored in a storage device or recorded on a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk and a semiconductor memory.

Although the present invention has been described above referring to the exemplary embodiments, the present invention is not limited to the exemplary embodiments described above. The configurations and details of the present invention can be changed and modified in various manners that can be understood by one skilled in the art within the scope of the present invention.

100 STORAGE SYSTEM
200 PROCESSING NODE
210 OBJECT DRIVER
211 HTTP SERVER
212 REQUEST PROCESSING UNIT
213 METADATA LOG WRITING UNIT
214 METADATA INFORMATION STORAGE UNIT
215 ACQUISITION UNIT
216 STORING UNIT
217 SUB METADATA TREE GENERATION UNIT
220 FILE SYSTEM DRIVER
230 VTL DRIVER
240 EXECUTION UNIT
241 DIVISION AND DISTRIBUTION UNIT
242 METADATA TREE GENERATION UNIT
300 STORAGE NODE
401 CPU
402 ROM
403 RAM
404 programs
405 storage device
406 drive device
407 communication interface
408 input/output interface
409 bus
410 recording medium
411 communication network
421 ACQUISITION UNIT
422 GENERATION UNIT

Claims

A processing apparatus comprising an object driver including:
an acquisition unit configured to acquire a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
a generation unit configured to generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the chunk acquired by the acquisition unit,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.
The processing apparatus according to Claim 1, comprising
a distribution unit configured to divide the object into a plurality of chunks and, based on the hash value of the chunk obtained by the division, distribute the chunk to any of a plurality of object drivers.
The processing apparatus according to Claim 2, wherein
the distribution unit is configured to, based on the hash value of the chunk obtained by the division, distribute the chunk to any of the plurality of object drivers so that a process to generate the sub metadata tree is performed evenly by each of the plurality of object drivers.
The processing apparatus according to Claim 2, wherein
the distribution unit is configured to distribute the chunk corresponding to the specified object driver, the object driver being specified in accordance with a range to which the hash value of the chunk obtained by the division belongs.
The processing apparatus according to Claim 1, comprising
a metadata tree generation unit configured to generate the metadata tree by merging the sub metadata trees generated by a plurality of object drivers.
The processing apparatus according to Claim 5, wherein
the metadata tree generation unit is configured to repeatedly perform a process to merge a set of sub metadata trees to merge a plurality of sub metadata trees and generate the metadata tree.
The processing apparatus according to Claim 1, wherein
the generation unit is configured to generate the sub metadata tree based on the metadata log stored in a memory.
The processing apparatus according to Claim 1, comprising
a writing unit configured to store the metadata log indicating a content of metadata of the chunk acquired by the acquisition unit into a memory,
wherein the generation unit is configured to generate the sub metadata tree based on the metadata log stored into the memory by the writing unit.
A processing method executed by an information processing apparatus, the processing method comprising:
acquiring a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
generating a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.
A non-transitory computer-readable recording medium having a program recorded thereon, the program comprising instructions for causing an information processing apparatus to realize processes to:
acquire a chunk distributed based on a hash value of the chunk, the chunk being obtained by division of an object; and
generate a sub metadata tree having a tree structure holding metadata based on a metadata log of the acquired chunk,
wherein the sub metadata tree is merged with another sub metadata tree to configure a metadata tree holding metadata of the object.