WO2018165957A1

WO2018165957A1 - Log-appended-structured storage management with byte-level accessibility

Info

Publication number: WO2018165957A1
Application number: PCT/CN2017/076999
Authority: WO
Inventors: Jinglei REN; Anirudh Badam; Thomas Moscibroda
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2017-03-16
Filing date: 2017-03-16
Publication date: 2018-09-20
Anticipated expiration: 2019-09-16

Abstract

Implementations of the subject matter described herein provide a non-volatile memory (NVM) based storage management solution. In this solution, a log-appended-structured NVM based storage system is provided. In such log-appended-structured storage system, the plurality of data blocks are sequentially appended and having non-fixed block sizes. The storage system further includes a mapping structure including a plurality of address mappings, and each of the address mappings associates a home address (accessible to an application) with a log address (a physical address of the data block) for a data block. Furthermore, the storage system includes a storage manager, which is configured to control various operations of the storage system. By utilizing such log-appended-structured NVM with its associated address mapping mechanism, the issue of fragmentation can be significantly reduced with high efficiency and low cost. Meanwhile, with the aid of the mapping structure, byte-level look-up granularity can be achieved, which renders byte-level addressability.

Description

LOG-APPENDED-STRUCTURED STORAGE MANAGEMENT WITH BYTE-LEVEL ACCESSIBILITY

BACKGROUND

Emerging non-volatile memory (NVM) unlocks the performance potential of applications by enabling storage of data in the main memory (traditionally， such data can only be stored in disks) in a long term. However， storage of data in NVM brings several critical problems to memory space management， one of which is the memory fragmentation. That is， unused spaces cannot be allocated with other data， which will cause significant wastage under certain workloads. Since data will be held for a long term in NVM， the fragmentation may become especially severe..

SUMMARY

Implementations of the subject matter described herein provide a “log-structured” NVM and associated storage management solution with byte addressability. Generally speaking， the whole NVM is organized in the form of logs， which act as a virtual home space at the same time. Such log-appended-structured storage system eliminates the dichotomy of memory data between home space and log space， and unifies those two spaces， so that applications view the NVM in a same way as the traditional home space. In such log-appended-structured NVM， the plurality of data blocks are sequentially appended and having non-fixed block sizes. The storage system further includes a mapping structure including a plurality of address mappings， and each of the address mappings associates a home address (accessible to an application) with a log address (a physical address of the data block) for a data block. Furthermore， the storage system includes a storage manager， which is configured to control various operations of the storage system.

It is to be understood that the Summary is not intended to identify key or essential features of implementations of the subject matter described herein， nor is it intended to be used to limit the scope of the subject matter described herein. Other features of the subject matter described herein will become easily comprehensible through the description below.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objectives， features and advantages of the subject matter described herein will become more apparent through more detailed depiction of example implementations of the subject matter described herein in conjunction with the accompanying drawings， wherein in the example implementations of the subject matter described herein， same reference numerals usually represent same components.

FIG. 1 illustrates a schematic of the storage system according to an implementation of the subject matter described herein；

FIG. 2 illustrates a typical tree structure that represents the mapping table as shown in FIG. 1；

FIG. 3 illustrates a flow chart of a process performed by the storage manager according to an implementation of the subject matter described herein；

FIG. 4 illustrates a schematic two-layer mapping structure according to an implementation of the subject matter described herein；

FIG. 5 shows a flow chart illustrating the an example method for accessing a node in the tree when there is a cache equipped with the storage system， according to an implementation of the subject matter described herein；

FIG. 6 shows a flow chart of an example method for writing data into NVM of the storage system according to an implementation of the subject matter described herein； and

FIG. 7 illustrates a block diagram of an example implementation of the storage manager in which one or more implementations of the subject matter described herein may be implemented.

DETAILED DESCRIPTION

The subject matter described herein will now be discussed with reference to several example implementations. It should be understood these implementations are discussed only for the purpose of enabling those skilled persons in the art to better understand and thus implement the subject matter described herein， rather than suggesting any limitations on the scope of the subject matter.

As used herein， the term “includes” and its variants are to be read as open terms that mean “includes， but is not limited to. ” The term “based on” is to be read as “based at least in part on. ” The term “one implementation” and “an implementation” are to be read as “at least one implementation. ” The term “another implementation” is to be read as “at least one other implementation. ” The terms “first， ” “second， ” and the like may refer to different or same objects. Other definitions， explicit and implicit， may be included below. A definition of a term is consistent throughout the description unless the context clearly indicates otherwise.

Most current storage management systems adopt the legacy memory management mechanism that is designed for dynamic random access memory (DRAM) . Typically， such DRAM-oriented storage management solutions divide the whole physical memory space into blocks of different pre-defined sizes， and a minimal-sized block is chosen to serve an allocation request. This method brings excessive memory fragments. However， in conventional DRAM-oriented systems， data is not intended to be persistently stored in a DRAM， thus simply powering the DRAM off may easily solve the fragmentation issue existed in DRAM.

Now with the use of NVM-oriented storage system， data can be stored persistently in NVM. In other words， the data stored in NVM would not go away by simply powering the NVM off. In this case， how to deal with the issue of external fragmentation existed in NVM would become challenging.

In general， there are two sources of fragmentation. One is called “internal” fragmentation. For example， supposing that an allocator aligns any NVM allocation to fixed 64 Byte (64B) . At present， for example， if 65B of NVM is requested， it shall effectively allocate 128B. In this case， an “internal” fragmentation of 63B is created consequently. The other one is called “external” fragmentation. Again， supposing that a 64B data block is freed or discarded but the resulting freed space is still surrounded by data blocks in use. In this case， the freed space cannot serve any data allocation request beyond 64B. As discussed above， such external fragmentation would become even severe if allocated block sizes vary.

Fragmentation can be reduced by garbage collection that is capable of changing allocated addresses by moving allocated data. However， such a process involves time-consuming object reference analysis and process pauses. Further， this technique is typically used in a managed language runtime， so no byte-addressability is provided to programs， which means a program cannot access an arbitrary address.

In order to at least partially solve the above and other potential problems， a novel log-appended-structured NVM-based storage system associated with its storage management mechanism are proposed herein. By utilizing such log-appended-structured NVM and its associated address mappings， the issue of fragmentation can be effectively avoided or significantly reduced. Meanwhile， byte-level look-up granularity can be achieved， which renders byte-level addressability. In addition， efficient memory management could be achieved， which further avoids redundant bus traffic and write wear.

FIG. 1 illustrates a schematic of the storage system 1000 according to an implementation of the subject matter described herein. As illustrated in FIG. 1， the storage system 1000 includes a non-volatile storage (NVM) 100. As shown， the NVM 100 is configured to store a plurality of data blocks (D₁， D₂， ...， D_n， collectively referred to as “D” ) that are sequentially appended (or logged) and having non-fixed block sizes. That is， a newly allocated data block will be always “compactly” appended to the end of the log， and the block size of each appended data block may have an arbitrary block size， even tens of bytes， per request.

It is to be noted that although the NVM 100 as shown in FIG. 1 includes an array of NVM chunks (represented by several stripes) ， in other examples， the storage 100 may only include a single NVM chunk.

It is also to be noted that although the plurality of data blocks (D₁， D₂， ...， D_n) as shown in FIG. 1 are sequentially appended (starting from the first NVM chunk) in an order of those data blocks being stored into the NVM. However， a plurality of data blocks that are sequentially appended in each chunk， but not following an order of storing should also be covered herein.

It is further to be noted that the term “compactly” herein does not have to be understood as “immediately. ” That is， a small gap may exist between any of two neighboring appended data blocks per request. Such small gap (for example， 1B) could be for example the header of the data block， and thus usually has much smaller length compared to the length of the data block.

Those sequentially appended data blocks as shown in FIG. 1 forms a data structure which can be called “log-appended-structure. ” Traditionally， persistent transactional memory (PTM) always builds a logging system atop NVM in order to maintain crash consistency of data. In this case， NVM needs to hold not only the home space of data that is “viewable” by applications， but also a separate log space hidden from the application as the physical space of data.

On the contrary， in accordance with implementations of the subject matter described herein， the sequentially organized data structure in NVM 100 as shown in FIG. 1 eliminates the “dichotomy” of NVM data introduced by the traditional memory management and PTM. In other words， the log-appended-structured NVM-based storage 100 according to the subject matter described herein unifies the home space and the log space. As illustrated in FIG. 1， the whole NVM itself is physically organized in a form of logs， and at the same time the NVM also acts as the home space exposed to the application.

In light of the inventor’s discovery and observations， conventional NVM containing a home space and a separate log space， as discussed above， cannot perform the defragmentation. On the contrary， physically log-appended-structured NVM according to various implementations of the subject matter described herein can effectively solve the “internal” fragmentation by nature， due to the fact that the data allocation now becomes an compactly append to the end of a log. Furthermore， the “external” fragmentation in the log-appended-structured NVM would probably still need to be solved. This is because some of previously appended data blocks might not be needed anymore and thus will be freed or discarded， as what may occur， for example， in the conventional DRAM-oriented system.

Therefore， a defragmentation procedure becomes necessary for a NVM based system. During the defragmentation procedure， moving allocated data and merging (or coalescing) free spaces can “absorb” external fragmentation without stopping the process. As a result， some data blocks in use might be moved to some other space (or addresses) . In this case， when any of those moved data blocks needs to be accessed， the address mapping is necessary for locating the physical address where the data blocks are currently stored， which further makes the performance or efficiency of the address mapping becomes crucial， especially when the defragmentation is performed.

Compared with conventional solid-state disks (SSDs) ， much faster NVM-based storage system can effectively solve the dominated latency issue (such as the latency from the SSDs) in existing log-appended-structured systems. Hence， optimization to the mapping structure now can be expected to result in a further improved performance of the address mapping.

Still with reference to FIG. 1， the storage system 1000 further includes a mapping structure 200. The mapping structure 200 as shown including a plurality of address mappings， and each of the address mappings associates a home address 202 with a log address 204 for one data block D. Herein， the home address indicates an address that is accessible to an application， and the log address indicates a physical address of the data block stored in the NVM 100. As an example， home address “0x789” is mapped to the log address “0x1aa” for the data block D₃ stored in the NVM 100， where “0x” indicates hexadecimal. The correspondence between data block D₃ and the home address “0x789” (and its log address “0x1aa” ) is indicated by the arrow 150.

It is to be noted that in this and other examples throughout the context of the present disclosure， the home addresses or log addresses all indicate the staring addresses unless the context clearly indicates otherwise. As further illustrated in FIG. 1， the third column of the mapping table 200 indicates the length 206 for each allocated data block. The starting address and the length together defines an area (or address scope) in the home space and in the log space.

It is also to be understood that although the mapping structure 200 as shown in FIG. 1 is in the form of a mapping table (for example， a hash table) ， this is merely for ease of discussion without suggesting any limitations as to the scope of the subject matter described herein. In some other implementations， the mapping structure 200 may take other forms， such as a fully balanced binary tree structure or a skip-list data structure， which will be further described later.

As discussed above， since the log-appended-structured data blocks stored in NVM 100 allows for arbitrary block size， such log-appended-structured NVM 100 containing byte-level data blocks “intrinsically” enables the mapping structure 200 as shown in FIG. 1 to support byte-level accessibility/addressability， or in other words， to have byte-level look-up granularity.

However， such log-appended-structured system， on the other hand， targets an arbitrary starting address with any length. In some cases， such log-appended-structured data structure with byte-level addressability may face a unique challenge of address mapping overhead. In light of the inventor’s discovery and observations， other log-appended-structured systems having basic data elements with fixed (or predefined) block size can simply use， for example， a hash table or a bloom filter for the mapping， thus other systems may not face this challenge described as above.

This address mapping overhead may become even severe， when data blocks are moved frequently in order to perform the defragmentation (that is， the consolidation of fragments) . This is because， during the procedure of defragmentation， one or more data blocks proximate to or immediately surrounding a fragment will be moved， which in turn leads to a change of mapping associating the home address with the log address， that is mapping the same home address to an updated/new log address.

It is to be noted that in a conventional Hash table， sometimes the (starting) home address that is to be accessed by the application is different from the created (starting) home address. In this case， if the granularity (or length) of the home address is also unknown， it would cause a failed access operation， because the application cannot even access a home address that is closest to the to-be-accessed home address. It might be possible to predefine a certain granularity of the created home address， for example， to make the granularity of the home addresses small enough， such as 8B， so that the application may fast locate the to-be-accessed home address or at least the home address which is located closest to the home address that is to be accessed. However， such hash table with small or ultra-small block size for use in a large NVM would be oversized. Therefore， in order to achieve efficient and low cost mappings associated with the log-appended-structured NVM described as above， in some implementations， the mapping structure 200 can be organized as a tree structure consisting of a plurality of chained nodes covering the whole address space of the NVM 100. Each of nodes includes home address， length for a data block， and log address that the home address is mapped to. For example， logically， one node in the tree holds a pair {home address， length} denoting an area in the home space， and the log address that the area is mapped to.

FIG. 2 illustrates a typical tree structure 200 that represents the mapping table 200 as shown in FIG. 1. As shown， node 201 with the home address “0x456” is selected as the root node of the tree structure， with node 202 with the home address “0x123” and node 203 with the home address “0x789” immediately downstream to root node as intermediate nodes. For ease of illustration， only two levels of the tree structure 200 are shown in FIG. 2， and further downstream nodes (or lower levels of nodes or children nodes) are omitted from FIG. 2.

As an example， the root node 201 includes the home address “0x456， ” the length “140B” for the data block D₂， and the log address “0x11e， ” and node 202 includes the home address “0x123， ” the length “30B” for data block D₁， and the log address “0x100， ” and node 203 includes information about the home address “0x789， ” the length “70B” for data block D₃， and the log address “0x1aa. ”

It can be seen from FIGs. 1 and 2 that due to the fact data blocks stored in the log-appended-structured NVM-based storage system are sequentially appended， the log addresses for those data blocks are contiguous. In other words， the (starting) log address and the length for a data block both determine the (starting) log address for the sequentially appended data block. For example， node 202 with the log address “0x100” and the length “30B” for the data block D₁ determines the log address “0x11e” for data block D₂ that is sequentially appended to D₁ (0x100 + 30B ＝ 0x11e) . Similarly， node 201 with the log address “0x11e” and the length “140B” for a data block D₂ determines the log address “0x1aa” for the data block D₃ that is sequentially appended to D₂ (0x1 1e + 140B ＝ 0x1aa) .

In some implementations， the tree structure 200 is based on a fully balanced tree. For example， as shown in FIG. 2， the fully balanced tree could be a binary tree where each node has at most two children nodes， which sometimes are referred to as the left child node and the right child node. In the example shown in FIG. 2， node 201 has only two children nodes with node 202 as the left child node and 203 as the right child node. In this case， with the height/depth of the tree 200 equal to n， the time complexity of an operation on the tree 200 is O (log n) . It is to be noted that such a fully balanced binary tree as shown in FIG. 2 is only an example mapping structure， and any other mapping structures， such as skip list， are also possible.

Referring back to FIG. 1， the storage system 1000 according to an implementation of the subject matter described herein further includes a storage manager 700 which is configured to control various operations of the storage system 1000. For example， as discussed above， the storage manager 700 can be configured to control the data blocks to be allocated in a form of a log-appended-structured data blocks， or in other words， to control the plurality of data blocks having non-fixed sizes (that is， arbitrary size) to be sequentially appended to the end of the logs. The storage manager 700 may further be configured to manage the change of the address mapping caused by the defragmentation operation.

FIG. 3 illustrates a flow chart of a process 300 performed by the storage manager 700 according to an implementation of the subject matter described herein. The process 300 generally illustrates how the storage manager 700 responds to the idle fragments.

At 320， an idle fragment in the NVM 100 as shown in FIG. 1 is detected. Again， a plurality of data blocks that are sequentially appended and have non-fixed sizes are stored in NVM 100. As discussed above， such idle fragment is caused by removal of one or more data blocks from the NVM 100. As illustrated in FIG. 1， fragments (or freed space) represented by the shaded

areas

102 and 104 are caused by the removal or discard of data blocks D₂ and D₄， respectively. As further illustrated in FIG. 1， the fragment 102 is immediately surrounded by two data blocks D₁ and D₃ in use， and similarly， the fragment 104 is surrounded by two data blocks D₃ and D₅ in use.

As discussed above， in order to reduce the external fragmentation， in other words， consolidate fragments to form a larger freed space for a future data allocation， one or more data blocks proximate to or immediately next to the fragments might need to be moved. Therefore， at 340， when the idle fragment is detected， one or more second data blocks proximate to the idle fragment is caused to be moved from a first physical address to a second physical address in the NVM 100. For example， when the

idle fragments

102 and 104 are detected， the data block D₃ surrounded by

fragments

102 and 104 might need to be moved in order to form a larger freed space for a further data allocation.

At 360， when the second data block is moved， an address mapping for the second data block to associate the second address with a logical address (that is， home address) of the second data block accessible to an application is updated. For example， if the data block D₃ is moved from its original physical address “0x1aa” to a new physical address “0x2000， ” the address mapping for D₃ to associate the logical address with the physical address will be changed/updated， accordingly.

After the change/update of the address mapping for the data block D₃， in some implementations， when an access (read/write) request to the data block D₃ is received， the data block D₃ will be accessed at its new log address “0x2000” via the updated address mapping.

However， utilizing such a single tree structure might still be costly. As known， the average cost of an operation on a tree structure is proportional to the tree height. In this case， when a single tree structure 200 as shown in FIG. 2 is utilized for a large NVM， the height of the tree can be huge， especially when the block size is small (for example， tens of bytes) while the NVM is large. Therefore， in order to further improve the mapping efficiency and reduce the cost， four tree structure optimization mechanisms are further proposed as blow.

Skip-list

In some implementations， the single tree structure 200 as shown in FIG. 2 may be replaced by a skip-list. A skip-list is a type of data structure that is based on a probabilistically balanced data structure. Contrary to the conventional balanced tree (as described with respect to FIG. 2) ， such skip-list has a “express lane” within an ordered sequence of elements and does not need a complex rebalancing operation which may be costly and largely impair parallelism or concurrent operations.

Such skip-list is crucial especially for multi-threaded scenarios. A typical balanced tree requires a readers-writer lock to protect concurrent operations. That is， a readers-writer lock allows concurrent access for read-only operations， while write operations require exclusive access. For example， a look-up that typically serves a memory load has to obtain a read lock in a shared manner， while an update that typically serves a memory store has to hold an exclusive write lock. Such lock contention， however， can deteriorate the throughput of multiple threads applications.

In contrast， by leveraging skip-lists， the locking for read-only operations is eliminated. Furthermore， an update of the skip list involves only simple insert operations on the singly linked lists. In this way， only an exclusive lock for updates is needed， but all look-up operations can execute in parallel. By avoiding such lock contention， the transaction throughput with multiple treads can be improved.

Two-layer Mapping

In some implementations， instead of using a single tree， a partitioned tree approach can be used， so that the number of objects in each partition is limited to an acceptable value. In this case， the whole home space can be divided into a plurality of sub-home spaces (hereafter also referred to as partitions or chunks) ， and each partition holds a respective tree structure for further address look-ups. Such mapping structure is also referred to as two-layer mapping structure.

FIG. 4 illustrates a schematic two-layer mapping structure 400. As shown， in the first layer， the whole address space is first divided into many static fixed-length partitions (or chunks) (collectively referred to as 410) with starting addresses of 0x0， 0x8000， 0x10000， ...， 0x28000， respectively， so that data can be first routed to a specific partition where the address locates in O (1) time which may only cost as low as one CPU cycle. As an example， the first layer can simply take the form of a hash table.

In the second layer， the address can be finally accessed via the corresponding “local” tree structure held in each partition (collectively referred to as “420” ) . Due to the fact that the whole address space has been divided into many smaller spaces， or in other words， a huge tree is now split into numerous small local trees， the average number of nodes in each local tree can be much smaller than that of a huge tree coveting the whole home space.

As an example， considering a 10 GB home space and supposing random updates are performed over the whole space in aligned 128B， this behavior results in every node mapping a 128B area. When a single fully balanced binary tree as shown in FIG. 2 is adopted， the height of such a tree containing all these mappings of the whole space is over 26. On the contrary， when the two-layer mapping as shown in FIG. 4 is adopted with the each partition 410 of the first layer equal to 32KB， the height of each local tree is only 8. In this way， the average cost can be reduced， and meanwhile the transaction throughput can be improved.

Node Cache

Alternatively， or in addition， the storage system 1000 may further include a cache (also called the node cache) configured for storing a node that is recently or frequently accessed. In this case， when an address is to be accessed， the node cache will be first searched. In some implementations， the cache node stores recently accessed home-space addresses and pointers to their nodes in the trees.

FIG. 5 shows a flow chart illustrating the process 500 of accessing a node in the tree when there is a cache equipped with the storage system 1000. As illustrated in FIG. 5， at 520， when it is determined that a first node is cached， the first node will be directly accessed via the cache. Specifically， if the first node is cached， the corresponding pointer will directly point to the first node in the tree that contains the requested address mapping. Otherwise， at 540， when it is determined that the first data block is not cached， the node will still need to be accessed via the mapping structure 200. For example， if the first node is not cached， a full tree look-up is still necessary and the resulting node is then added to the node cache. By caching recently visited nodes， many full tree lookup paths starting from the root node can be avoided.

In some implementations， once a node is cached， all addresses within its mapped area tend to be a cache hit. A plain hash table does not support such feature， as cached addresses are randomly distributed. For example， conventionally， supposing the node for a 70B area starting at 0x1000 is cached， an access to， for example， the address 0x1008 is usually not a hit， because it falls into a different bucket and loses the chance to be checked with this node. On the contrary， in our case， all addresses within a 70B scope to the same bucket will become a cache hit， so that nearby addresses can be checked with chained tree nodes that may cover them.

Group Update

Merging tree nodes is another way to further reduce the number of nodes and thereby the height of the tree. In particular， when two sibling nodes contain contiguous home addresses and are mapped to contiguous log addresses， they can be merged to one node. For example， if a node has a home address of “0x500” and block length of 70B， and another sibling node has a home address of “0x546” and block length of 30B， these two nodes can be considered as two sibling nodes containing contiguous home address (0x500+70B＝0x546) . As a result， these two sibling nodes can be subsequently merged to one node having a home address of “0x500” and block length of “100B. ”

Accordingly， in some implementations， the storage manager 700 is further configured to first buffer a plurality of data blocks. If it is detected that two or more data blocks have contiguous target home addresses as discussed above， then the two or more data blocks are grouped. Then the grouped data blocks are appended into the storage space at a time.

FIG. 6 is a flowchart of an example method 600 for writing data into NVM 100 of the storage system in accordance with an example implementation of the subject matter described herein. As illustrated in FIG. 6， at 620， a plurality of data blocks that are to be written into the non-volatile storage 100 is buffered. At 640， when at least two data blocks of the plurality of data blocks having contiguous logical addresses are detected， the at least two data blocks are grouped. At 660， the grouped at least two data blocks is appended into the non-volatile storage 100. At 680， the tree structure is updated.

In some implementations， within each NVM transaction， all writes can be buffered into NVM 100， and for those with contiguous home addresses on transaction commit， they can be combined into a group. The group of combined writes is then appended to the end of logs. In this way， the mapping structure (for example， the tree 200) only needs to be updated once， rather than multiple times. This allows for coalescing overwrites to the same address， which reduces NVM bandwidth consumption and wear as well.

Example Device

Hereinafter， an example implementation of the storage manager 700 is shown in FIG. 7. In this example， the storage manager 700 is in a form of a general-purpose computing device. Components of the storage manager 700 may include， but are not limited to， one or more processors or processing units 710， a memory 1320， one or more input devices 730， one or more output devices 740， storage 750， and one or more communication units 760. The processing unit 710 may be a real or a virtual processor and is capable of performing various processes in accordance with a program stored in the memory 720. In a multi-processing system， multiple processing units execute computer-executable instructions to increase processing power.

The storage manager 700 typically includes a variety of machine readable medium. Such medium may be any available medium that is accessible by the computing system/server 1000， including volatile and non-volatile medium， removable and non-removable medium. The memory 720 may be volatile memory (e.g.， registers， cache， a random-access memory (RAM) ) ， non-volatile memory (e.g.， a read only memory (ROM) ， an electrically erasable programmable read only memory (EEPROM) ， a flash memory) ， or some combination thereof. The storage 750 may be removable or non-removable， and may include machine readable medium such as flash drives， magnetic disks or any other medium which can be used to store information and which can be accessed within the storage manager 700.

The storage manager 700 may further include other removable/non-removable， volatile/non-volatile computing system storage medium. Although not shown in FIG. 7， a disk driver for reading from or writing to a removable， non-volatile disk (e.g.， a “floppy disk” ) ， and an optical disk driver for reading from or writing to a removable， non-volatile optical disk can be provided. The memory 720 may include at least one program product having a set (e.g.， at least one) of program modules that are configured to carry out the functions of various implementations of the subject matter described herein.

A program/utility tool 722 having a set (at least one) of the program modules 724 may be stored in， for example， the memory 720. Such program modules 724 include， but are not limited to， an operating system， one or more applications， other program modules， and program data. Each or a certain combination of these examples may include an implementation of a networking environment. The program modules 724 generally carry out the functions and/or methodologies of implementations of the subject matter described herein， for example， the method 300 and method 500.

The input unit (s) 730 may be one or more of various different input devices. For example， the input unit (s) 730 may include a user device such as a mouse， keyboard， trackball， a pointing stick， etc. The input unit (s) 730 may implement one or more natural user interface techniques， such as speech recognition or touch and stylus recognition. As other examples， the input unit (s) 730 may include a scanning device， a network adapter， or another device that provides input to the storage manager 700. The output unit (s) 740 may be a display， printer， speaker， network adapter， or another device that provides output from the storage manager 700. The input unit (s) 730 and output unit (s) 740 may be incorporated in a single system or device， such as a touch screen or a virtual reality system.

The communication unit (s) 760 enables communication over communication medium to another computing entity. Additionally， functionality of the components of the storage manager 700 may be implemented in a single computing machine or in multiple computing machines that are able to communicate over communication connections. Thus， the storage manager 700 may operate in a networked environment using logical connections to one or more other servers， network personal computers (PCs) ， or another common network node. By way of example， and not limitation， communication media include wired or wireless networking techniques.

The storage manager 700 may also communicate， as required， with one or more external devices (not shown) such as a storage device， a display device， and the like， one or more devices that enable a user to interact with the storage manager 700， and/or any device (e.g.， network card， a modem， etc. ) that enables the storage manager 700 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (s) (not shown) .

The functionally described herein can be performed， at least in part， by one or more hardware logic components. For example， and without limitation， illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs) ， Application-specific Integrated Circuits (ASICs) ， Application-specific Standard Products (ASSPs) ， System-on-a-chip systems (SOCs) ， Complex Programmable Logic Devices (CPLDs) ， and the like.

Program code for carrying out methods of the subject matter described herein may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer， special purpose computer， or other programmable data processing apparatus， such that the program codes， when executed by the processor or controller， cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a machine， partly on the machine， as a stand-alone software package， partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of present disclosure， a machine readable medium may be any tangible medium that may contain， or store a program for use by or in connection with an instruction execution system， apparatus， or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic， magnetic， optical， electromagnetic， infrared， or semiconductor system， apparatus， or device， or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires， a portable computer diskette， a hard disk， a random access memory (RAM) ， a read-only memory (ROM) ， an erasable programmable read-only memory (EPROM or Flash memory) ， an optical fiber， a portable compact disc read-only memory (CD-ROM) ， an optical storage device， a magnetic storage device， or any suitable combination of the foregoing.

Further， while operations are depicted in a particular order， this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order， or that all illustrated operations be performed， to achieve desirable results. In certain circumstances， multitasking and parallel processing may be advantageous. Likewise， while several specific implementation details are contained in the above discussions， these should not be construed as limitations on the scope of the subject matter described herein， but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Conversely， various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination.

Example Implementations

Hereinafter， some example implementations of the subject matter described herein will be listed.

In some implementations， a storage system is provided. The storage system comprises： a non-volatile storage configured to store a plurality of data blocks in a storage space； a storage manager configured to cause the plurality of data blocks to be sequentially appended and have non-fixed block sizes； and a mapping structure including a plurality of address mappings， each of the address mappings associating a logical address with a physical address for one of the data blocks， the logical address indicating an address accessible to an application， the storage manager being further configured to defragment the plurality of data blocks by moving at least part of the plurality of data blocks to merge idle fragments.

In some implementations， the defragmenting the plurality of data blocks comprises： detecting an idle fragment in the non-volatile storage， the idle fragment being caused by removal of a first data block from the non-volatile storage； in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address； and in response to the second data block being moved， updating the address mapping for the second data block.

In some implementations， the mapping structure comprises a tree structure including of a plurality of nodes， and each of the plurality of nodes includes a logical address， length for a data block， and a physical address that the logical address is mapped to.

In some implementations， the tree structure includes a fully balanced tree.

In some implementations， the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.

In some implementations， the storage space includes a plurality of subspaces， and each of the subspaces is associated with a group of logical addresses， and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.

In some implementations， the storage system further comprises： a cache configured to store a node that is recently accessed.

In some implementations， a storage manager is provided. The storage manager comprising： a processing unit； and a memory coupled to the processing unit and storing instructions for execution by the processing unit， the instructions， when executed by the processing unit， causing the device to perform acts comprising： detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space， the idle fragment being caused by removal of a first data block from the non-volatile storage； in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage； and in response to the second data block being moved， updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.

In some implementations， the acts further comprises： sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.

In some implementations， the acts further comprises： in response to receiving an access request to the second data block， causing the second data block to be accessed via the tree structure.

In some implementations， the tree structure includes a fully balanced tree.

In some implementations， the storage space includes a plurality of subspaces， each of the subspaces being associated with a group of logical addresses， and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.

In some implementations， the causing the second data block to be accessed via the tree structure comprises： determining a subspace that the second data block is associated with； and causing the second data block to be accessed via the tree structure associated with the subspace.

In some implementations， the causing the second data block to be accessed via the tree structure comprises： in response to determining that a first node for the second data block is stored in a cache， causing the second data block to be accessed via the cache； and in response to determining that the first node is not stored in the cache， causing the second data block to be accessed via the tree structure.

In some implementations， the acts further comprises： buffeting a plurality of data blocks that are to be written into the non-volatile storage； in response to detecting at least two data blocks of the plurality of data blocks having contiguous logical addresses， grouping the at least two data blocks； appending the grouped at least two data blocks into the non-volatile storage； and updating the tree structure.

In some implementations， the updating the tree structure further comprises： merging at least two nodes having the contiguous logical addresses as one node.

In some implementations， a computer-implemented method is provided. The method comprises： detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space， the idle fragment being caused by removal of a first data block from the storage； in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage； and in response to the second data block being moved， updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.

In some implementations， the method further comprises： sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.

Claims

A storage system comprising：

a non-volatile storage configured to store a plurality of data blocks in a storage space；

a storage manager configured to cause the plurality of data blocks to be sequentially appended and have non-fixed block sizes； and

a mapping structure including a plurality of address mappings， each of the address mappings associating a logical address with a physical address for one of the data blocks， the logical address indicating an address accessible to an application，

the storage manager being further configured to defrag ment the plurality of data blocks by moving at least part of the plurality of data blocks to merge idle fragments.
The storage system of claim 1， wherein the defragmenting the plurality of data blocks comprises：

detecting an idle fragment in the non-volatile storage， the idle fragment being caused by removal of a first data block from the non-volatile storage；

in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address； and

in response to the second data block being moved， updating the address mapping for the second data block.
The storage system of claim 1， wherein the mapping structure comprises a tree structure including of a plurality of nodes， and each of the plurality of nodes includes a logical address， length for a data block， and a physical address that the logical address is mapped to.
The storage system of claim 3， wherein the tree structure includes a fully balanced tree.
The storage system of claim 1， wherein the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
The storage system of claim 1， wherein the storage space includes a plurality of subspaces， and each of the subspaces is associated with a group of logical addresses， and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
The storage system of claim 3， further comprising：

a cache configured to store a node that is recently accessed.
A storage manager comprising：

a processing unit； and

a memory coupled to the processing unit and storing instructions for execution by the processing unit， the instructions， when executed by the processing unit， causing the device to perform acts comprising：

detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space， the idle fragment being caused by removal of a first data block from the non-volatile storage；

in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage； and

in response to the second data block being moved， updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
The device of claim 8， wherein the acts further comprises：

sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
The device of claim 8， wherein the mapping structure comprises a tree structure including of a plurality of nodes， and each of the plurality of nodes includes a logical address， length for a data block， and a physical address that the logical address is mapped to.
The device of claim 10， wherein the acts further comprises：

in response to receiving an access request to the second data block， causing the second data block to be accessed via the tree structure.
The device of claim 11， wherein the tree structure includes a fully balanced tree.
The device of claim 8， wherein the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
The device of claim 11， wherein the storage space includes a plurality of subspaces， each of the subspaces being associated with a group of logical addresses， and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
The device of claim 14， wherein the causing the second data block to be accessed via the tree structure comprises：

determining a subspace that the second data block is associated with； and

causing the second data block to be accessed via the tree structure associated with the subspace.
The device of claim 11， wherein the causing the second data block to be accessed via the tree structure comprises：

in response to determining that a first node for the second data block is stored in a cache， causing the second data block to be accessed via the cache； and

in response to determining that the first node is not stored in the cache， causing the second data block to be accessed via the tree structure.
The device of claim 10， wherein the acts further comprises：

buffering a plurality of data blocks that are to be written into the non-volatile storage；

in response to detecting at least two data blocks of the plurality of data blocks having contiguous logical addresses， grouping the at least two data blocks；

appending the grouped at least two data blocks into the non-volatile storage； and

updating the tree structure.
The device of claim 17， wherein the updating the tree structure further comprises：

merging at least two nodes having the contiguous logical addresses as one node.
A computer-implemented method comprising：

detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space， the idle fragment being caused by removal of a first data block from the storage；

in response to detecting the idle fragment， causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage； and

in response to the second data block being moved， updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
The method of claim 19， wherein the mapping structure comprises a tree structure including of a plurality of nodes， and each of the plurality of nodes includes a logical address， length for a data block， and a physical address that the logical address is mapped to.