WO2018165957A1 - Log-appended-structured storage management with byte-level accessibility - Google Patents
Log-appended-structured storage management with byte-level accessibility Download PDFInfo
- Publication number
- WO2018165957A1 WO2018165957A1 PCT/CN2017/076999 CN2017076999W WO2018165957A1 WO 2018165957 A1 WO2018165957 A1 WO 2018165957A1 CN 2017076999 W CN2017076999 W CN 2017076999W WO 2018165957 A1 WO2018165957 A1 WO 2018165957A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- address
- data block
- data
- storage
- data blocks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
- G06F12/0238—Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
- G06F12/0246—Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/72—Details relating to flash memory management
- G06F2212/7201—Logical to physical mapping or translation of blocks or pages
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/72—Details relating to flash memory management
- G06F2212/7205—Cleaning, compaction, garbage collection, erase control
Definitions
- NVM non-volatile memory
- main memory traditionally, such data can only be stored in disks
- storage of data in NVM brings several critical problems to memory space management, one of which is the memory fragmentation. That is, unused spaces cannot be allocated with other data, which will cause significant wastage under certain workloads. Since data will be held for a long term in NVM, the fragmentation may become especially severe.
- Implementations of the subject matter described herein provide a “log-structured” NVM and associated storage management solution with byte addressability.
- the whole NVM is organized in the form of logs, which act as a virtual home space at the same time.
- Such log-appended-structured storage system eliminates the dichotomy of memory data between home space and log space, and unifies those two spaces, so that applications view the NVM in a same way as the traditional home space.
- the plurality of data blocks are sequentially appended and having non-fixed block sizes.
- the storage system further includes a mapping structure including a plurality of address mappings, and each of the address mappings associates a home address (accessible to an application) with a log address (a physical address of the data block) for a data block. Furthermore, the storage system includes a storage manager, which is configured to control various operations of the storage system.
- FIG. 1 illustrates a schematic of the storage system according to an implementation of the subject matter described herein;
- FIG. 2 illustrates a typical tree structure that represents the mapping table as shown in FIG. 1;
- FIG. 3 illustrates a flow chart of a process performed by the storage manager according to an implementation of the subject matter described herein;
- FIG. 4 illustrates a schematic two-layer mapping structure according to an implementation of the subject matter described herein;
- FIG. 5 shows a flow chart illustrating the an example method for accessing a node in the tree when there is a cache equipped with the storage system, according to an implementation of the subject matter described herein;
- FIG. 6 shows a flow chart of an example method for writing data into NVM of the storage system according to an implementation of the subject matter described herein;
- FIG. 7 illustrates a block diagram of an example implementation of the storage manager in which one or more implementations of the subject matter described herein may be implemented.
- the term “includes” and its variants are to be read as open terms that mean “includes, but is not limited to. ”
- the term “based on” is to be read as “based at least in part on. ”
- the term “one implementation” and “an implementation” are to be read as “at least one implementation. ”
- the term “another implementation” is to be read as “at least one other implementation. ”
- the terms “first, ” “second, ” and the like may refer to different or same objects. Other definitions, explicit and implicit, may be included below. A definition of a term is consistent throughout the description unless the context clearly indicates otherwise.
- DRAM dynamic random access memory
- NVM-oriented storage system data can be stored persistently in NVM.
- the data stored in NVM would not go away by simply powering the NVM off.
- how to deal with the issue of external fragmentation existed in NVM would become challenging.
- fragmentation there are two sources of fragmentation.
- One is called “internal” fragmentation.
- 64B 64 Byte
- 64B 64 Byte
- 65B of NVM is requested, it shall effectively allocate 128B.
- an “internal” fragmentation of 63B is created consequently.
- the other one is called “external” fragmentation.
- supposing that a 64B data block is freed or discarded but the resulting freed space is still surrounded by data blocks in use. In this case, the freed space cannot serve any data allocation request beyond 64B.
- such external fragmentation would become even severe if allocated block sizes vary.
- Fragmentation can be reduced by garbage collection that is capable of changing allocated addresses by moving allocated data.
- garbage collection that is capable of changing allocated addresses by moving allocated data.
- Such a process involves time-consuming object reference analysis and process pauses.
- this technique is typically used in a managed language runtime, so no byte-addressability is provided to programs, which means a program cannot access an arbitrary address.
- a novel log-appended-structured NVM-based storage system associated with its storage management mechanism are proposed herein.
- the issue of fragmentation can be effectively avoided or significantly reduced.
- byte-level look-up granularity can be achieved, which renders byte-level addressability.
- efficient memory management could be achieved, which further avoids redundant bus traffic and write wear.
- FIG. 1 illustrates a schematic of the storage system 1000 according to an implementation of the subject matter described herein.
- the storage system 1000 includes a non-volatile storage (NVM) 100.
- the NVM 100 is configured to store a plurality of data blocks (D 1 , D 2 , ..., D n , collectively referred to as “D” ) that are sequentially appended (or logged) and having non-fixed block sizes. That is, a newly allocated data block will be always “compactly” appended to the end of the log, and the block size of each appended data block may have an arbitrary block size, even tens of bytes, per request.
- D non-volatile storage
- the NVM 100 as shown in FIG. 1 includes an array of NVM chunks (represented by several stripes) , in other examples, the storage 100 may only include a single NVM chunk.
- a small gap may exist between any of two neighboring appended data blocks per request.
- Such small gap (for example, 1B) could be for example the header of the data block, and thus usually has much smaller length compared to the length of the data block.
- Those sequentially appended data blocks as shown in FIG. 1 forms a data structure which can be called “log-appended-structure. ”
- PTM persistent transactional memory
- NVM needs to hold not only the home space of data that is “viewable” by applications, but also a separate log space hidden from the application as the physical space of data.
- the sequentially organized data structure in NVM 100 as shown in FIG. 1 eliminates the “dichotomy” of NVM data introduced by the traditional memory management and PTM.
- the log-appended-structured NVM-based storage 100 according to the subject matter described herein unifies the home space and the log space.
- the whole NVM itself is physically organized in a form of logs, and at the same time the NVM also acts as the home space exposed to the application.
- a defragmentation procedure becomes necessary for a NVM based system.
- moving allocated data and merging (or coalescing) free spaces can “absorb” external fragmentation without stopping the process.
- some data blocks in use might be moved to some other space (or addresses) .
- the address mapping is necessary for locating the physical address where the data blocks are currently stored, which further makes the performance or efficiency of the address mapping becomes crucial, especially when the defragmentation is performed.
- NVM-based storage system can effectively solve the dominated latency issue (such as the latency from the SSDs) in existing log-appended-structured systems.
- optimization to the mapping structure now can be expected to result in a further improved performance of the address mapping.
- the storage system 1000 further includes a mapping structure 200.
- the mapping structure 200 as shown including a plurality of address mappings, and each of the address mappings associates a home address 202 with a log address 204 for one data block D.
- the home address indicates an address that is accessible to an application
- the log address indicates a physical address of the data block stored in the NVM 100.
- home address “0x789” is mapped to the log address “0x1aa” for the data block D 3 stored in the NVM 100, where “0x” indicates hexadecimal.
- the correspondence between data block D 3 and the home address “0x789” (and its log address “0x1aa” ) is indicated by the arrow 150.
- the home addresses or log addresses all indicate the staring addresses unless the context clearly indicates otherwise.
- the third column of the mapping table 200 indicates the length 206 for each allocated data block. The starting address and the length together defines an area (or address scope) in the home space and in the log space.
- mapping structure 200 as shown in FIG. 1 is in the form of a mapping table (for example, a hash table) , this is merely for ease of discussion without suggesting any limitations as to the scope of the subject matter described herein.
- the mapping structure 200 may take other forms, such as a fully balanced binary tree structure or a skip-list data structure, which will be further described later.
- log-appended-structured data blocks stored in NVM 100 allows for arbitrary block size
- log-appended-structured NVM 100 containing byte-level data blocks “intrinsically” enables the mapping structure 200 as shown in FIG. 1 to support byte-level accessibility/addressability, or in other words, to have byte-level look-up granularity.
- log-appended-structured system targets an arbitrary starting address with any length.
- log-appended-structured data structure with byte-level addressability may face a unique challenge of address mapping overhead.
- other log-appended-structured systems having basic data elements with fixed (or predefined) block size can simply use, for example, a hash table or a bloom filter for the mapping, thus other systems may not face this challenge described as above.
- This address mapping overhead may become even severe, when data blocks are moved frequently in order to perform the defragmentation (that is, the consolidation of fragments) . This is because, during the procedure of defragmentation, one or more data blocks proximate to or immediately surrounding a fragment will be moved, which in turn leads to a change of mapping associating the home address with the log address, that is mapping the same home address to an updated/new log address.
- the mapping structure 200 can be organized as a tree structure consisting of a plurality of chained nodes covering the whole address space of the NVM 100.
- Each of nodes includes home address, length for a data block, and log address that the home address is mapped to.
- one node in the tree holds a pair ⁇ home address, length ⁇ denoting an area in the home space, and the log address that the area is mapped to.
- FIG. 2 illustrates a typical tree structure 200 that represents the mapping table 200 as shown in FIG. 1.
- node 201 with the home address “0x456” is selected as the root node of the tree structure, with node 202 with the home address “0x123” and node 203 with the home address “0x789” immediately downstream to root node as intermediate nodes.
- node 201 with the home address “0x456” is selected as the root node of the tree structure
- node 203 with the home address “0x789” immediately downstream to root node as intermediate nodes.
- FIG. 2 illustrates a typical tree structure 200 that represents the mapping table 200 as shown in FIG. 1.
- node 201 with the home address “0x456” is selected as the root node of the tree structure
- node 202 with the home address “0x123” and node 203 with the home address “0x789” immediately downstream to root node as intermediate nodes.
- the root node 201 includes the home address “0x456, ” the length “140B” for the data block D 2
- the log address “0x11e, ” and node 202 includes the home address “0x123, ” the length “30B” for data block D 1
- the log address “0x100, ” and node 203 includes information about the home address “0x789, ” the length “70B” for data block D 3 , and the log address “0x1aa. ”
- the log addresses for those data blocks are contiguous.
- the (starting) log address and the length for a data block both determine the (starting) log address for the sequentially appended data block.
- the tree structure 200 is based on a fully balanced tree.
- the fully balanced tree could be a binary tree where each node has at most two children nodes, which sometimes are referred to as the left child node and the right child node.
- node 201 has only two children nodes with node 202 as the left child node and 203 as the right child node.
- the time complexity of an operation on the tree 200 is O (log n) .
- the storage system 1000 further includes a storage manager 700 which is configured to control various operations of the storage system 1000.
- the storage manager 700 can be configured to control the data blocks to be allocated in a form of a log-appended-structured data blocks, or in other words, to control the plurality of data blocks having non-fixed sizes (that is, arbitrary size) to be sequentially appended to the end of the logs.
- the storage manager 700 may further be configured to manage the change of the address mapping caused by the defragmentation operation.
- FIG. 3 illustrates a flow chart of a process 300 performed by the storage manager 700 according to an implementation of the subject matter described herein.
- the process 300 generally illustrates how the storage manager 700 responds to the idle fragments.
- an idle fragment in the NVM 100 as shown in FIG. 1 is detected.
- a plurality of data blocks that are sequentially appended and have non-fixed sizes are stored in NVM 100.
- such idle fragment is caused by removal of one or more data blocks from the NVM 100.
- fragments (or freed space) represented by the shaded areas 102 and 104 are caused by the removal or discard of data blocks D 2 and D 4 , respectively.
- the fragment 102 is immediately surrounded by two data blocks D 1 and D 3 in use, and similarly, the fragment 104 is surrounded by two data blocks D 3 and D 5 in use.
- one or more data blocks proximate to or immediately next to the fragments might need to be moved. Therefore, at 340, when the idle fragment is detected, one or more second data blocks proximate to the idle fragment is caused to be moved from a first physical address to a second physical address in the NVM 100. For example, when the idle fragments 102 and 104 are detected, the data block D 3 surrounded by fragments 102 and 104 might need to be moved in order to form a larger freed space for a further data allocation.
- an address mapping for the second data block to associate the second address with a logical address (that is, home address) of the second data block accessible to an application is updated. For example, if the data block D 3 is moved from its original physical address “0x1aa” to a new physical address “0x2000, ” the address mapping for D 3 to associate the logical address with the physical address will be changed/updated, accordingly.
- the data block D 3 After the change/update of the address mapping for the data block D 3 , in some implementations, when an access (read/write) request to the data block D 3 is received, the data block D 3 will be accessed at its new log address “0x2000” via the updated address mapping.
- the average cost of an operation on a tree structure is proportional to the tree height.
- the height of the tree can be huge, especially when the block size is small (for example, tens of bytes) while the NVM is large. Therefore, in order to further improve the mapping efficiency and reduce the cost, four tree structure optimization mechanisms are further proposed as blow.
- the single tree structure 200 as shown in FIG. 2 may be replaced by a skip-list.
- a skip-list is a type of data structure that is based on a probabilistically balanced data structure. Contrary to the conventional balanced tree (as described with respect to FIG. 2) , such skip-list has a “express lane” within an ordered sequence of elements and does not need a complex rebalancing operation which may be costly and largely impair parallelism or concurrent operations.
- a typical balanced tree requires a readers-writer lock to protect concurrent operations. That is, a readers-writer lock allows concurrent access for read-only operations, while write operations require exclusive access. For example, a look-up that typically serves a memory load has to obtain a read lock in a shared manner, while an update that typically serves a memory store has to hold an exclusive write lock. Such lock contention, however, can deteriorate the throughput of multiple threads applications.
- a partitioned tree approach can be used, so that the number of objects in each partition is limited to an acceptable value.
- the whole home space can be divided into a plurality of sub-home spaces (hereafter also referred to as partitions or chunks) , and each partition holds a respective tree structure for further address look-ups.
- Such mapping structure is also referred to as two-layer mapping structure.
- FIG. 4 illustrates a schematic two-layer mapping structure 400.
- the whole address space is first divided into many static fixed-length partitions (or chunks) (collectively referred to as 410) with starting addresses of 0x0, 0x8000, 0x10000, ..., 0x28000, respectively, so that data can be first routed to a specific partition where the address locates in O (1) time which may only cost as low as one CPU cycle.
- the first layer can simply take the form of a hash table.
- the address can be finally accessed via the corresponding “local” tree structure held in each partition (collectively referred to as “420” ) . Due to the fact that the whole address space has been divided into many smaller spaces, or in other words, a huge tree is now split into numerous small local trees, the average number of nodes in each local tree can be much smaller than that of a huge tree coveting the whole home space.
- the storage system 1000 may further include a cache (also called the node cache) configured for storing a node that is recently or frequently accessed. In this case, when an address is to be accessed, the node cache will be first searched. In some implementations, the cache node stores recently accessed home-space addresses and pointers to their nodes in the trees.
- a cache also called the node cache
- FIG. 5 shows a flow chart illustrating the process 500 of accessing a node in the tree when there is a cache equipped with the storage system 1000.
- the first node will be directly accessed via the cache. Specifically, if the first node is cached, the corresponding pointer will directly point to the first node in the tree that contains the requested address mapping. Otherwise, at 540, when it is determined that the first data block is not cached, the node will still need to be accessed via the mapping structure 200. For example, if the first node is not cached, a full tree look-up is still necessary and the resulting node is then added to the node cache. By caching recently visited nodes, many full tree lookup paths starting from the root node can be avoided.
- all addresses within its mapped area tend to be a cache hit.
- a plain hash table does not support such feature, as cached addresses are randomly distributed. For example, conventionally, supposing the node for a 70B area starting at 0x1000 is cached, an access to, for example, the address 0x1008 is usually not a hit, because it falls into a different bucket and loses the chance to be checked with this node. On the contrary, in our case, all addresses within a 70B scope to the same bucket will become a cache hit, so that nearby addresses can be checked with chained tree nodes that may cover them.
- Merging tree nodes is another way to further reduce the number of nodes and thereby the height of the tree.
- the storage manager 700 is further configured to first buffer a plurality of data blocks. If it is detected that two or more data blocks have contiguous target home addresses as discussed above, then the two or more data blocks are grouped. Then the grouped data blocks are appended into the storage space at a time.
- FIG. 6 is a flowchart of an example method 600 for writing data into NVM 100 of the storage system in accordance with an example implementation of the subject matter described herein.
- a plurality of data blocks that are to be written into the non-volatile storage 100 is buffered.
- the at least two data blocks are grouped.
- the grouped at least two data blocks is appended into the non-volatile storage 100.
- the tree structure is updated.
- all writes can be buffered into NVM 100, and for those with contiguous home addresses on transaction commit, they can be combined into a group. The group of combined writes is then appended to the end of logs.
- the mapping structure for example, the tree 200
- the storage manager 700 is in a form of a general-purpose computing device.
- Components of the storage manager 700 may include, but are not limited to, one or more processors or processing units 710, a memory 1320, one or more input devices 730, one or more output devices 740, storage 750, and one or more communication units 760.
- the processing unit 710 may be a real or a virtual processor and is capable of performing various processes in accordance with a program stored in the memory 720. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.
- the storage manager 700 typically includes a variety of machine readable medium. Such medium may be any available medium that is accessible by the computing system/server 1000, including volatile and non-volatile medium, removable and non-removable medium.
- the memory 720 may be volatile memory (e.g., registers, cache, a random-access memory (RAM) ) , non-volatile memory (e.g., a read only memory (ROM) , an electrically erasable programmable read only memory (EEPROM) , a flash memory) , or some combination thereof.
- the storage 750 may be removable or non-removable, and may include machine readable medium such as flash drives, magnetic disks or any other medium which can be used to store information and which can be accessed within the storage manager 700.
- the storage manager 700 may further include other removable/non-removable, volatile/non-volatile computing system storage medium.
- a disk driver for reading from or writing to a removable, non-volatile disk e.g., a “floppy disk”
- an optical disk driver for reading from or writing to a removable, non-volatile optical disk can be provided.
- the memory 720 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of various implementations of the subject matter described herein.
- a program/utility tool 722 having a set (at least one) of the program modules 724 may be stored in, for example, the memory 720.
- Such program modules 724 include, but are not limited to, an operating system, one or more applications, other program modules, and program data. Each or a certain combination of these examples may include an implementation of a networking environment.
- the program modules 724 generally carry out the functions and/or methodologies of implementations of the subject matter described herein, for example, the method 300 and method 500.
- the functionally described herein can be performed, at least in part, by one or more hardware logic components.
- illustrative types of hardware logic components include Field-Programmable Gate Arrays (FPGAs) , Application-specific Integrated Circuits (ASICs) , Application-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , and the like.
- Program code for carrying out methods of the subject matter described herein may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
- the program code may execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine readable medium may be any tangible medium that may contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- the machine readable medium may be a machine readable signal medium or a machine readable storage medium.
- a machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- machine readable storage medium More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- CD-ROM portable compact disc read-only memory
- magnetic storage device or any suitable combination of the foregoing.
- a storage system comprises: a non-volatile storage configured to store a plurality of data blocks in a storage space; a storage manager configured to cause the plurality of data blocks to be sequentially appended and have non-fixed block sizes; and a mapping structure including a plurality of address mappings, each of the address mappings associating a logical address with a physical address for one of the data blocks, the logical address indicating an address accessible to an application, the storage manager being further configured to defragment the plurality of data blocks by moving at least part of the plurality of data blocks to merge idle fragments.
- the defragmenting the plurality of data blocks comprises: detecting an idle fragment in the non-volatile storage, the idle fragment being caused by removal of a first data block from the non-volatile storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address; and in response to the second data block being moved, updating the address mapping for the second data block.
- the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
- the tree structure includes a fully balanced tree.
- the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
- the storage space includes a plurality of subspaces, and each of the subspaces is associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
- the storage system further comprises: a cache configured to store a node that is recently accessed.
- a storage manager comprising: a processing unit; and a memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions, when executed by the processing unit, causing the device to perform acts comprising: detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the non-volatile storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; and in response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
- the acts further comprises: sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
- the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
- the acts further comprises: in response to receiving an access request to the second data block, causing the second data block to be accessed via the tree structure.
- the tree structure includes a fully balanced tree.
- the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
- the storage space includes a plurality of subspaces, each of the subspaces being associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
- the causing the second data block to be accessed via the tree structure comprises: determining a subspace that the second data block is associated with; and causing the second data block to be accessed via the tree structure associated with the subspace.
- the causing the second data block to be accessed via the tree structure comprises: in response to determining that a first node for the second data block is stored in a cache, causing the second data block to be accessed via the cache; and in response to determining that the first node is not stored in the cache, causing the second data block to be accessed via the tree structure.
- the acts further comprises: buffeting a plurality of data blocks that are to be written into the non-volatile storage; in response to detecting at least two data blocks of the plurality of data blocks having contiguous logical addresses, grouping the at least two data blocks; appending the grouped at least two data blocks into the non-volatile storage; and updating the tree structure.
- the updating the tree structure further comprises: merging at least two nodes having the contiguous logical addresses as one node.
- a computer-implemented method comprises: detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; and in response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
- the method further comprises: sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
- the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Implementations of the subject matter described herein provide a non-volatile memory (NVM) based storage management solution. In this solution, a log-appended-structured NVM based storage system is provided. In such log-appended-structured storage system, the plurality of data blocks are sequentially appended and having non-fixed block sizes. The storage system further includes a mapping structure including a plurality of address mappings, and each of the address mappings associates a home address (accessible to an application) with a log address (a physical address of the data block) for a data block. Furthermore, the storage system includes a storage manager, which is configured to control various operations of the storage system. By utilizing such log-appended-structured NVM with its associated address mapping mechanism, the issue of fragmentation can be significantly reduced with high efficiency and low cost. Meanwhile, with the aid of the mapping structure, byte-level look-up granularity can be achieved, which renders byte-level addressability.
Description
Emerging non-volatile memory (NVM) unlocks the performance potential of applications by enabling storage of data in the main memory (traditionally, such data can only be stored in disks) in a long term. However, storage of data in NVM brings several critical problems to memory space management, one of which is the memory fragmentation. That is, unused spaces cannot be allocated with other data, which will cause significant wastage under certain workloads. Since data will be held for a long term in NVM, the fragmentation may become especially severe..
SUMMARY
Implementations of the subject matter described herein provide a “log-structured” NVM and associated storage management solution with byte addressability. Generally speaking, the whole NVM is organized in the form of logs, which act as a virtual home space at the same time. Such log-appended-structured storage system eliminates the dichotomy of memory data between home space and log space, and unifies those two spaces, so that applications view the NVM in a same way as the traditional home space. In such log-appended-structured NVM, the plurality of data blocks are sequentially appended and having non-fixed block sizes. The storage system further includes a mapping structure including a plurality of address mappings, and each of the address mappings associates a home address (accessible to an application) with a log address (a physical address of the data block) for a data block. Furthermore, the storage system includes a storage manager, which is configured to control various operations of the storage system.
It is to be understood that the Summary is not intended to identify key or essential features of implementations of the subject matter described herein, nor is it intended to be used to limit the scope of the subject matter described herein. Other features of the subject matter described herein will become easily comprehensible through the description below.
The above and other objectives, features and advantages of the subject matter described herein will become more apparent through more detailed depiction of example
implementations of the subject matter described herein in conjunction with the accompanying drawings, wherein in the example implementations of the subject matter described herein, same reference numerals usually represent same components.
FIG. 1 illustrates a schematic of the storage system according to an implementation of the subject matter described herein;
FIG. 2 illustrates a typical tree structure that represents the mapping table as shown in FIG. 1;
FIG. 3 illustrates a flow chart of a process performed by the storage manager according to an implementation of the subject matter described herein;
FIG. 4 illustrates a schematic two-layer mapping structure according to an implementation of the subject matter described herein;
FIG. 5 shows a flow chart illustrating the an example method for accessing a node in the tree when there is a cache equipped with the storage system, according to an implementation of the subject matter described herein;
FIG. 6 shows a flow chart of an example method for writing data into NVM of the storage system according to an implementation of the subject matter described herein; and
FIG. 7 illustrates a block diagram of an example implementation of the storage manager in which one or more implementations of the subject matter described herein may be implemented.
The subject matter described herein will now be discussed with reference to several example implementations. It should be understood these implementations are discussed only for the purpose of enabling those skilled persons in the art to better understand and thus implement the subject matter described herein, rather than suggesting any limitations on the scope of the subject matter.
As used herein, the term “includes” and its variants are to be read as open terms that mean “includes, but is not limited to. ” The term “based on” is to be read as “based at least in part on. ” The term “one implementation” and “an implementation” are to be read as “at least one implementation. ” The term “another implementation” is to be read as “at least one other implementation. ” The terms “first, ” “second, ” and the like may refer to
different or same objects. Other definitions, explicit and implicit, may be included below. A definition of a term is consistent throughout the description unless the context clearly indicates otherwise.
Most current storage management systems adopt the legacy memory management mechanism that is designed for dynamic random access memory (DRAM) . Typically, such DRAM-oriented storage management solutions divide the whole physical memory space into blocks of different pre-defined sizes, and a minimal-sized block is chosen to serve an allocation request. This method brings excessive memory fragments. However, in conventional DRAM-oriented systems, data is not intended to be persistently stored in a DRAM, thus simply powering the DRAM off may easily solve the fragmentation issue existed in DRAM.
Now with the use of NVM-oriented storage system, data can be stored persistently in NVM. In other words, the data stored in NVM would not go away by simply powering the NVM off. In this case, how to deal with the issue of external fragmentation existed in NVM would become challenging.
In general, there are two sources of fragmentation. One is called “internal” fragmentation. For example, supposing that an allocator aligns any NVM allocation to fixed 64 Byte (64B) . At present, for example, if 65B of NVM is requested, it shall effectively allocate 128B. In this case, an “internal” fragmentation of 63B is created consequently. The other one is called “external” fragmentation. Again, supposing that a 64B data block is freed or discarded but the resulting freed space is still surrounded by data blocks in use. In this case, the freed space cannot serve any data allocation request beyond 64B. As discussed above, such external fragmentation would become even severe if allocated block sizes vary.
Fragmentation can be reduced by garbage collection that is capable of changing allocated addresses by moving allocated data. However, such a process involves time-consuming object reference analysis and process pauses. Further, this technique is typically used in a managed language runtime, so no byte-addressability is provided to programs, which means a program cannot access an arbitrary address.
In order to at least partially solve the above and other potential problems, a novel log-appended-structured NVM-based storage system associated with its storage management mechanism are proposed herein. By utilizing such log-appended-structured
NVM and its associated address mappings, the issue of fragmentation can be effectively avoided or significantly reduced. Meanwhile, byte-level look-up granularity can be achieved, which renders byte-level addressability. In addition, efficient memory management could be achieved, which further avoids redundant bus traffic and write wear.
FIG. 1 illustrates a schematic of the storage system 1000 according to an implementation of the subject matter described herein. As illustrated in FIG. 1, the storage system 1000 includes a non-volatile storage (NVM) 100. As shown, the NVM 100 is configured to store a plurality of data blocks (D1, D2, ..., Dn, collectively referred to as “D” ) that are sequentially appended (or logged) and having non-fixed block sizes. That is, a newly allocated data block will be always “compactly” appended to the end of the log, and the block size of each appended data block may have an arbitrary block size, even tens of bytes, per request.
It is to be noted that although the NVM 100 as shown in FIG. 1 includes an array of NVM chunks (represented by several stripes) , in other examples, the storage 100 may only include a single NVM chunk.
It is also to be noted that although the plurality of data blocks (D1, D2, ..., Dn) as shown in FIG. 1 are sequentially appended (starting from the first NVM chunk) in an order of those data blocks being stored into the NVM. However, a plurality of data blocks that are sequentially appended in each chunk, but not following an order of storing should also be covered herein.
It is further to be noted that the term “compactly” herein does not have to be understood as “immediately. ” That is, a small gap may exist between any of two neighboring appended data blocks per request. Such small gap (for example, 1B) could be for example the header of the data block, and thus usually has much smaller length compared to the length of the data block.
Those sequentially appended data blocks as shown in FIG. 1 forms a data structure which can be called “log-appended-structure. ” Traditionally, persistent transactional memory (PTM) always builds a logging system atop NVM in order to maintain crash consistency of data. In this case, NVM needs to hold not only the home space of data that is “viewable” by applications, but also a separate log space hidden from the application as the physical space of data.
On the contrary, in accordance with implementations of the subject matter described herein, the sequentially organized data structure in NVM 100 as shown in FIG. 1 eliminates the “dichotomy” of NVM data introduced by the traditional memory management and PTM. In other words, the log-appended-structured NVM-based storage 100 according to the subject matter described herein unifies the home space and the log space. As illustrated in FIG. 1, the whole NVM itself is physically organized in a form of logs, and at the same time the NVM also acts as the home space exposed to the application.
In light of the inventor’s discovery and observations, conventional NVM containing a home space and a separate log space, as discussed above, cannot perform the defragmentation. On the contrary, physically log-appended-structured NVM according to various implementations of the subject matter described herein can effectively solve the “internal” fragmentation by nature, due to the fact that the data allocation now becomes an compactly append to the end of a log. Furthermore, the “external” fragmentation in the log-appended-structured NVM would probably still need to be solved. This is because some of previously appended data blocks might not be needed anymore and thus will be freed or discarded, as what may occur, for example, in the conventional DRAM-oriented system.
Therefore, a defragmentation procedure becomes necessary for a NVM based system. During the defragmentation procedure, moving allocated data and merging (or coalescing) free spaces can “absorb” external fragmentation without stopping the process. As a result, some data blocks in use might be moved to some other space (or addresses) . In this case, when any of those moved data blocks needs to be accessed, the address mapping is necessary for locating the physical address where the data blocks are currently stored, which further makes the performance or efficiency of the address mapping becomes crucial, especially when the defragmentation is performed.
Compared with conventional solid-state disks (SSDs) , much faster NVM-based storage system can effectively solve the dominated latency issue (such as the latency from the SSDs) in existing log-appended-structured systems. Hence, optimization to the mapping structure now can be expected to result in a further improved performance of the address mapping.
Still with reference to FIG. 1, the storage system 1000 further includes a mapping structure 200. The mapping structure 200 as shown including a plurality of address
mappings, and each of the address mappings associates a home address 202 with a log address 204 for one data block D. Herein, the home address indicates an address that is accessible to an application, and the log address indicates a physical address of the data block stored in the NVM 100. As an example, home address “0x789” is mapped to the log address “0x1aa” for the data block D3 stored in the NVM 100, where “0x” indicates hexadecimal. The correspondence between data block D3 and the home address “0x789” (and its log address “0x1aa” ) is indicated by the arrow 150.
It is to be noted that in this and other examples throughout the context of the present disclosure, the home addresses or log addresses all indicate the staring addresses unless the context clearly indicates otherwise. As further illustrated in FIG. 1, the third column of the mapping table 200 indicates the length 206 for each allocated data block. The starting address and the length together defines an area (or address scope) in the home space and in the log space.
It is also to be understood that although the mapping structure 200 as shown in FIG. 1 is in the form of a mapping table (for example, a hash table) , this is merely for ease of discussion without suggesting any limitations as to the scope of the subject matter described herein. In some other implementations, the mapping structure 200 may take other forms, such as a fully balanced binary tree structure or a skip-list data structure, which will be further described later.
As discussed above, since the log-appended-structured data blocks stored in NVM 100 allows for arbitrary block size, such log-appended-structured NVM 100 containing byte-level data blocks “intrinsically” enables the mapping structure 200 as shown in FIG. 1 to support byte-level accessibility/addressability, or in other words, to have byte-level look-up granularity.
However, such log-appended-structured system, on the other hand, targets an arbitrary starting address with any length. In some cases, such log-appended-structured data structure with byte-level addressability may face a unique challenge of address mapping overhead. In light of the inventor’s discovery and observations, other log-appended-structured systems having basic data elements with fixed (or predefined) block size can simply use, for example, a hash table or a bloom filter for the mapping, thus other systems may not face this challenge described as above.
This address mapping overhead may become even severe, when data blocks are
moved frequently in order to perform the defragmentation (that is, the consolidation of fragments) . This is because, during the procedure of defragmentation, one or more data blocks proximate to or immediately surrounding a fragment will be moved, which in turn leads to a change of mapping associating the home address with the log address, that is mapping the same home address to an updated/new log address.
It is to be noted that in a conventional Hash table, sometimes the (starting) home address that is to be accessed by the application is different from the created (starting) home address. In this case, if the granularity (or length) of the home address is also unknown, it would cause a failed access operation, because the application cannot even access a home address that is closest to the to-be-accessed home address. It might be possible to predefine a certain granularity of the created home address, for example, to make the granularity of the home addresses small enough, such as 8B, so that the application may fast locate the to-be-accessed home address or at least the home address which is located closest to the home address that is to be accessed. However, such hash table with small or ultra-small block size for use in a large NVM would be oversized. Therefore, in order to achieve efficient and low cost mappings associated with the log-appended-structured NVM described as above, in some implementations, the mapping structure 200 can be organized as a tree structure consisting of a plurality of chained nodes covering the whole address space of the NVM 100. Each of nodes includes home address, length for a data block, and log address that the home address is mapped to. For example, logically, one node in the tree holds a pair {home address, length} denoting an area in the home space, and the log address that the area is mapped to.
FIG. 2 illustrates a typical tree structure 200 that represents the mapping table 200 as shown in FIG. 1. As shown, node 201 with the home address “0x456” is selected as the root node of the tree structure, with node 202 with the home address “0x123” and node 203 with the home address “0x789” immediately downstream to root node as intermediate nodes. For ease of illustration, only two levels of the tree structure 200 are shown in FIG. 2, and further downstream nodes (or lower levels of nodes or children nodes) are omitted from FIG. 2.
As an example, the root node 201 includes the home address “0x456, ” the length “140B” for the data block D2, and the log address “0x11e, ” and node 202 includes the home address “0x123, ” the length “30B” for data block D1, and the log address “0x100, ” and node
203 includes information about the home address “0x789, ” the length “70B” for data block D3, and the log address “0x1aa. ”
It can be seen from FIGs. 1 and 2 that due to the fact data blocks stored in the log-appended-structured NVM-based storage system are sequentially appended, the log addresses for those data blocks are contiguous. In other words, the (starting) log address and the length for a data block both determine the (starting) log address for the sequentially appended data block. For example, node 202 with the log address “0x100” and the length “30B” for the data block D1 determines the log address “0x11e” for data block D2 that is sequentially appended to D1 (0x100 + 30B = 0x11e) . Similarly, node 201 with the log address “0x11e” and the length “140B” for a data block D2 determines the log address “0x1aa” for the data block D3 that is sequentially appended to D2 (0x1 1e + 140B = 0x1aa) .
In some implementations, the tree structure 200 is based on a fully balanced tree. For example, as shown in FIG. 2, the fully balanced tree could be a binary tree where each node has at most two children nodes, which sometimes are referred to as the left child node and the right child node. In the example shown in FIG. 2, node 201 has only two children nodes with node 202 as the left child node and 203 as the right child node. In this case, with the height/depth of the tree 200 equal to n, the time complexity of an operation on the tree 200 is O (log n) . It is to be noted that such a fully balanced binary tree as shown in FIG. 2 is only an example mapping structure, and any other mapping structures, such as skip list, are also possible.
Referring back to FIG. 1, the storage system 1000 according to an implementation of the subject matter described herein further includes a storage manager 700 which is configured to control various operations of the storage system 1000. For example, as discussed above, the storage manager 700 can be configured to control the data blocks to be allocated in a form of a log-appended-structured data blocks, or in other words, to control the plurality of data blocks having non-fixed sizes (that is, arbitrary size) to be sequentially appended to the end of the logs. The storage manager 700 may further be configured to manage the change of the address mapping caused by the defragmentation operation.
FIG. 3 illustrates a flow chart of a process 300 performed by the storage manager 700 according to an implementation of the subject matter described herein. The process 300 generally illustrates how the storage manager 700 responds to the idle fragments.
At 320, an idle fragment in the NVM 100 as shown in FIG. 1 is detected. Again,
a plurality of data blocks that are sequentially appended and have non-fixed sizes are stored in NVM 100. As discussed above, such idle fragment is caused by removal of one or more data blocks from the NVM 100. As illustrated in FIG. 1, fragments (or freed space) represented by the shaded areas 102 and 104 are caused by the removal or discard of data blocks D2 and D4, respectively. As further illustrated in FIG. 1, the fragment 102 is immediately surrounded by two data blocks D1 and D3 in use, and similarly, the fragment 104 is surrounded by two data blocks D3 and D5 in use.
As discussed above, in order to reduce the external fragmentation, in other words, consolidate fragments to form a larger freed space for a future data allocation, one or more data blocks proximate to or immediately next to the fragments might need to be moved. Therefore, at 340, when the idle fragment is detected, one or more second data blocks proximate to the idle fragment is caused to be moved from a first physical address to a second physical address in the NVM 100. For example, when the idle fragments 102 and 104 are detected, the data block D3 surrounded by fragments 102 and 104 might need to be moved in order to form a larger freed space for a further data allocation.
At 360, when the second data block is moved, an address mapping for the second data block to associate the second address with a logical address (that is, home address) of the second data block accessible to an application is updated. For example, if the data block D3 is moved from its original physical address “0x1aa” to a new physical address “0x2000, ” the address mapping for D3 to associate the logical address with the physical address will be changed/updated, accordingly.
After the change/update of the address mapping for the data block D3, in some implementations, when an access (read/write) request to the data block D3 is received, the data block D3 will be accessed at its new log address “0x2000” via the updated address mapping.
However, utilizing such a single tree structure might still be costly. As known, the average cost of an operation on a tree structure is proportional to the tree height. In this case, when a single tree structure 200 as shown in FIG. 2 is utilized for a large NVM, the height of the tree can be huge, especially when the block size is small (for example, tens of bytes) while the NVM is large. Therefore, in order to further improve the mapping efficiency and reduce the cost, four tree structure optimization mechanisms are further proposed as blow.
Skip-list
In some implementations, the single tree structure 200 as shown in FIG. 2 may be replaced by a skip-list. A skip-list is a type of data structure that is based on a probabilistically balanced data structure. Contrary to the conventional balanced tree (as described with respect to FIG. 2) , such skip-list has a “express lane” within an ordered sequence of elements and does not need a complex rebalancing operation which may be costly and largely impair parallelism or concurrent operations.
Such skip-list is crucial especially for multi-threaded scenarios. A typical balanced tree requires a readers-writer lock to protect concurrent operations. That is, a readers-writer lock allows concurrent access for read-only operations, while write operations require exclusive access. For example, a look-up that typically serves a memory load has to obtain a read lock in a shared manner, while an update that typically serves a memory store has to hold an exclusive write lock. Such lock contention, however, can deteriorate the throughput of multiple threads applications.
In contrast, by leveraging skip-lists, the locking for read-only operations is eliminated. Furthermore, an update of the skip list involves only simple insert operations on the singly linked lists. In this way, only an exclusive lock for updates is needed, but all look-up operations can execute in parallel. By avoiding such lock contention, the transaction throughput with multiple treads can be improved.
Two-layer Mapping
In some implementations, instead of using a single tree, a partitioned tree approach can be used, so that the number of objects in each partition is limited to an acceptable value. In this case, the whole home space can be divided into a plurality of sub-home spaces (hereafter also referred to as partitions or chunks) , and each partition holds a respective tree structure for further address look-ups. Such mapping structure is also referred to as two-layer mapping structure.
FIG. 4 illustrates a schematic two-layer mapping structure 400. As shown, in the first layer, the whole address space is first divided into many static fixed-length partitions (or chunks) (collectively referred to as 410) with starting addresses of 0x0, 0x8000, 0x10000, ..., 0x28000, respectively, so that data can be first routed to a specific partition where the address locates in O (1) time which may only cost as low as one CPU cycle. As an example, the first layer can simply take the form of a hash table.
In the second layer, the address can be finally accessed via the corresponding “local” tree structure held in each partition (collectively referred to as “420” ) . Due to the fact that the whole address space has been divided into many smaller spaces, or in other words, a huge tree is now split into numerous small local trees, the average number of nodes in each local tree can be much smaller than that of a huge tree coveting the whole home space.
As an example, considering a 10 GB home space and supposing random updates are performed over the whole space in aligned 128B, this behavior results in every node mapping a 128B area. When a single fully balanced binary tree as shown in FIG. 2 is adopted, the height of such a tree containing all these mappings of the whole space is over 26. On the contrary, when the two-layer mapping as shown in FIG. 4 is adopted with the each partition 410 of the first layer equal to 32KB, the height of each local tree is only 8. In this way, the average cost can be reduced, and meanwhile the transaction throughput can be improved.
Node Cache
Alternatively, or in addition, the storage system 1000 may further include a cache (also called the node cache) configured for storing a node that is recently or frequently accessed. In this case, when an address is to be accessed, the node cache will be first searched. In some implementations, the cache node stores recently accessed home-space addresses and pointers to their nodes in the trees.
FIG. 5 shows a flow chart illustrating the process 500 of accessing a node in the tree when there is a cache equipped with the storage system 1000. As illustrated in FIG. 5, at 520, when it is determined that a first node is cached, the first node will be directly accessed via the cache. Specifically, if the first node is cached, the corresponding pointer will directly point to the first node in the tree that contains the requested address mapping. Otherwise, at 540, when it is determined that the first data block is not cached, the node will still need to be accessed via the mapping structure 200. For example, if the first node is not cached, a full tree look-up is still necessary and the resulting node is then added to the node cache. By caching recently visited nodes, many full tree lookup paths starting from the root node can be avoided.
In some implementations, once a node is cached, all addresses within its mapped area tend to be a cache hit. A plain hash table does not support such feature, as cached
addresses are randomly distributed. For example, conventionally, supposing the node for a 70B area starting at 0x1000 is cached, an access to, for example, the address 0x1008 is usually not a hit, because it falls into a different bucket and loses the chance to be checked with this node. On the contrary, in our case, all addresses within a 70B scope to the same bucket will become a cache hit, so that nearby addresses can be checked with chained tree nodes that may cover them.
Group Update
Merging tree nodes is another way to further reduce the number of nodes and thereby the height of the tree. In particular, when two sibling nodes contain contiguous home addresses and are mapped to contiguous log addresses, they can be merged to one node. For example, if a node has a home address of “0x500” and block length of 70B, and another sibling node has a home address of “0x546” and block length of 30B, these two nodes can be considered as two sibling nodes containing contiguous home address (0x500+70B=0x546) . As a result, these two sibling nodes can be subsequently merged to one node having a home address of “0x500” and block length of “100B. ”
Accordingly, in some implementations, the storage manager 700 is further configured to first buffer a plurality of data blocks. If it is detected that two or more data blocks have contiguous target home addresses as discussed above, then the two or more data blocks are grouped. Then the grouped data blocks are appended into the storage space at a time.
FIG. 6 is a flowchart of an example method 600 for writing data into NVM 100 of the storage system in accordance with an example implementation of the subject matter described herein. As illustrated in FIG. 6, at 620, a plurality of data blocks that are to be written into the non-volatile storage 100 is buffered. At 640, when at least two data blocks of the plurality of data blocks having contiguous logical addresses are detected, the at least two data blocks are grouped. At 660, the grouped at least two data blocks is appended into the non-volatile storage 100. At 680, the tree structure is updated.
In some implementations, within each NVM transaction, all writes can be buffered into NVM 100, and for those with contiguous home addresses on transaction commit, they can be combined into a group. The group of combined writes is then appended to the end of logs. In this way, the mapping structure (for example, the tree 200) only needs to be updated once, rather than multiple times. This allows for coalescing overwrites to the
same address, which reduces NVM bandwidth consumption and wear as well.
Example Device
Hereinafter, an example implementation of the storage manager 700 is shown in FIG. 7. In this example, the storage manager 700 is in a form of a general-purpose computing device. Components of the storage manager 700 may include, but are not limited to, one or more processors or processing units 710, a memory 1320, one or more input devices 730, one or more output devices 740, storage 750, and one or more communication units 760. The processing unit 710 may be a real or a virtual processor and is capable of performing various processes in accordance with a program stored in the memory 720. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.
The storage manager 700 typically includes a variety of machine readable medium. Such medium may be any available medium that is accessible by the computing system/server 1000, including volatile and non-volatile medium, removable and non-removable medium. The memory 720 may be volatile memory (e.g., registers, cache, a random-access memory (RAM) ) , non-volatile memory (e.g., a read only memory (ROM) , an electrically erasable programmable read only memory (EEPROM) , a flash memory) , or some combination thereof. The storage 750 may be removable or non-removable, and may include machine readable medium such as flash drives, magnetic disks or any other medium which can be used to store information and which can be accessed within the storage manager 700.
The storage manager 700 may further include other removable/non-removable, volatile/non-volatile computing system storage medium. Although not shown in FIG. 7, a disk driver for reading from or writing to a removable, non-volatile disk (e.g., a “floppy disk” ) , and an optical disk driver for reading from or writing to a removable, non-volatile optical disk can be provided. The memory 720 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of various implementations of the subject matter described herein.
A program/utility tool 722 having a set (at least one) of the program modules 724 may be stored in, for example, the memory 720. Such program modules 724 include, but are not limited to, an operating system, one or more applications, other program modules, and program data. Each or a certain combination of these examples may include an
implementation of a networking environment. The program modules 724 generally carry out the functions and/or methodologies of implementations of the subject matter described herein, for example, the method 300 and method 500.
The input unit (s) 730 may be one or more of various different input devices. For example, the input unit (s) 730 may include a user device such as a mouse, keyboard, trackball, a pointing stick, etc. The input unit (s) 730 may implement one or more natural user interface techniques, such as speech recognition or touch and stylus recognition. As other examples, the input unit (s) 730 may include a scanning device, a network adapter, or another device that provides input to the storage manager 700. The output unit (s) 740 may be a display, printer, speaker, network adapter, or another device that provides output from the storage manager 700. The input unit (s) 730 and output unit (s) 740 may be incorporated in a single system or device, such as a touch screen or a virtual reality system.
The communication unit (s) 760 enables communication over communication medium to another computing entity. Additionally, functionality of the components of the storage manager 700 may be implemented in a single computing machine or in multiple computing machines that are able to communicate over communication connections. Thus, the storage manager 700 may operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs) , or another common network node. By way of example, and not limitation, communication media include wired or wireless networking techniques.
The storage manager 700 may also communicate, as required, with one or more external devices (not shown) such as a storage device, a display device, and the like, one or more devices that enable a user to interact with the storage manager 700, and/or any device (e.g., network card, a modem, etc. ) that enables the storage manager 700 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (s) (not shown) .
The functionally described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs) , Application-specific Integrated Circuits (ASICs) , Application-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , and the like.
Program code for carrying out methods of the subject matter described herein may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of present disclosure, a machine readable medium may be any tangible medium that may contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of the subject matter described herein, but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination.
Example Implementations
Hereinafter, some example implementations of the subject matter described herein will be listed.
In some implementations, a storage system is provided. The storage system comprises: a non-volatile storage configured to store a plurality of data blocks in a storage space; a storage manager configured to cause the plurality of data blocks to be sequentially appended and have non-fixed block sizes; and a mapping structure including a plurality of address mappings, each of the address mappings associating a logical address with a physical address for one of the data blocks, the logical address indicating an address accessible to an application, the storage manager being further configured to defragment the plurality of data blocks by moving at least part of the plurality of data blocks to merge idle fragments.
In some implementations, the defragmenting the plurality of data blocks comprises: detecting an idle fragment in the non-volatile storage, the idle fragment being caused by removal of a first data block from the non-volatile storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address; and in response to the second data block being moved, updating the address mapping for the second data block.
In some implementations, the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
In some implementations, the tree structure includes a fully balanced tree.
In some implementations, the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
In some implementations, the storage space includes a plurality of subspaces, and each of the subspaces is associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
In some implementations, the storage system further comprises: a cache configured to store a node that is recently accessed.
In some implementations, a storage manager is provided. The storage manager comprising: a processing unit; and a memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions, when executed by the processing unit, causing the device to perform acts comprising: detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the non-volatile storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; and in response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
In some implementations, the acts further comprises: sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
In some implementations, the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
In some implementations, the acts further comprises: in response to receiving an access request to the second data block, causing the second data block to be accessed via the tree structure.
In some implementations, the tree structure includes a fully balanced tree.
In some implementations, the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
In some implementations, the storage space includes a plurality of subspaces, each of the subspaces being associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
In some implementations, the causing the second data block to be accessed via the tree structure comprises: determining a subspace that the second data block is associated with; and causing the second data block to be accessed via the tree structure associated with
the subspace.
In some implementations, the causing the second data block to be accessed via the tree structure comprises: in response to determining that a first node for the second data block is stored in a cache, causing the second data block to be accessed via the cache; and in response to determining that the first node is not stored in the cache, causing the second data block to be accessed via the tree structure.
In some implementations, the acts further comprises: buffeting a plurality of data blocks that are to be written into the non-volatile storage; in response to detecting at least two data blocks of the plurality of data blocks having contiguous logical addresses, grouping the at least two data blocks; appending the grouped at least two data blocks into the non-volatile storage; and updating the tree structure.
In some implementations, the updating the tree structure further comprises: merging at least two nodes having the contiguous logical addresses as one node.
In some implementations, a computer-implemented method is provided. The method comprises: detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the storage; in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; and in response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
In some implementations, the method further comprises: sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
In some implementations, the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
Claims (20)
- A storage system comprising:a non-volatile storage configured to store a plurality of data blocks in a storage space;a storage manager configured to cause the plurality of data blocks to be sequentially appended and have non-fixed block sizes; anda mapping structure including a plurality of address mappings, each of the address mappings associating a logical address with a physical address for one of the data blocks, the logical address indicating an address accessible to an application,the storage manager being further configured to defrag ment the plurality of data blocks by moving at least part of the plurality of data blocks to merge idle fragments.
- The storage system of claim 1, wherein the defragmenting the plurality of data blocks comprises:detecting an idle fragment in the non-volatile storage, the idle fragment being caused by removal of a first data block from the non-volatile storage;in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address; andin response to the second data block being moved, updating the address mapping for the second data block.
- The storage system of claim 1, wherein the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
- The storage system of claim 3, wherein the tree structure includes a fully balanced tree.
- The storage system of claim 1, wherein the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
- The storage system of claim 1, wherein the storage space includes a plurality of subspaces, and each of the subspaces is associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
- The storage system of claim 3, further comprising:a cache configured to store a node that is recently accessed.
- A storage manager comprising:a processing unit; anda memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions, when executed by the processing unit, causing the device to perform acts comprising:detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the non-volatile storage;in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; andin response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
- The device of claim 8, wherein the acts further comprises:sequentially appending the plurality of data blocks having non-fixed sizes into the storage space.
- The device of claim 8, wherein the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
- The device of claim 10, wherein the acts further comprises:in response to receiving an access request to the second data block, causing the second data block to be accessed via the tree structure.
- The device of claim 11, wherein the tree structure includes a fully balanced tree.
- The device of claim 8, wherein the mapping structure includes a skip-list that is based on a probabilistically balanced data structure.
- The device of claim 11, wherein the storage space includes a plurality of subspaces, each of the subspaces being associated with a group of logical addresses, and wherein each of the subspaces is provided with a tree structure to associate a logical address being within a respective group of logical addresses with a physical address.
- The device of claim 14, wherein the causing the second data block to be accessed via the tree structure comprises:determining a subspace that the second data block is associated with; andcausing the second data block to be accessed via the tree structure associated with the subspace.
- The device of claim 11, wherein the causing the second data block to be accessed via the tree structure comprises:in response to determining that a first node for the second data block is stored in a cache, causing the second data block to be accessed via the cache; andin response to determining that the first node is not stored in the cache, causing the second data block to be accessed via the tree structure.
- The device of claim 10, wherein the acts further comprises:buffering a plurality of data blocks that are to be written into the non-volatile storage;in response to detecting at least two data blocks of the plurality of data blocks having contiguous logical addresses, grouping the at least two data blocks;appending the grouped at least two data blocks into the non-volatile storage; andupdating the tree structure.
- The device of claim 17, wherein the updating the tree structure further comprises:merging at least two nodes having the contiguous logical addresses as one node.
- A computer-implemented method comprising:detecting an idle fragment in a non-volatile storage that stores a plurality of data blocks that are sequentially appended and have non-fixed sizes in a storage space, the idle fragment being caused by removal of a first data block from the storage;in response to detecting the idle fragment, causing a second data block proximate to the idle fragment to be moved from a first physical address to a second physical address in the non-volatile storage; andin response to the second data block being moved, updating an address mapping for the second data block to associate the second address with a logical address of the second data block accessible to an application.
- The method of claim 19, wherein the mapping structure comprises a tree structure including of a plurality of nodes, and each of the plurality of nodes includes a logical address, length for a data block, and a physical address that the logical address is mapped to.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2017/076999 WO2018165957A1 (en) | 2017-03-16 | 2017-03-16 | Log-appended-structured storage management with byte-level accessibility |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2017/076999 WO2018165957A1 (en) | 2017-03-16 | 2017-03-16 | Log-appended-structured storage management with byte-level accessibility |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018165957A1 true WO2018165957A1 (en) | 2018-09-20 |
Family
ID=63522756
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/076999 Ceased WO2018165957A1 (en) | 2017-03-16 | 2017-03-16 | Log-appended-structured storage management with byte-level accessibility |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2018165957A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112817526A (en) * | 2021-01-19 | 2021-05-18 | 杭州和利时自动化有限公司 | Data storage method, device and medium |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1243317A (en) * | 1998-07-28 | 2000-02-02 | 索尼公司 | Non-volatile memory, recording apparatus and recording method |
| US6282605B1 (en) * | 1999-04-26 | 2001-08-28 | Moore Computer Consultants, Inc. | File system for non-volatile computer memory |
| JP2004094429A (en) * | 2002-08-30 | 2004-03-25 | Toshiba Corp | Disk array device and raid level changing method in the device |
-
2017
- 2017-03-16 WO PCT/CN2017/076999 patent/WO2018165957A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1243317A (en) * | 1998-07-28 | 2000-02-02 | 索尼公司 | Non-volatile memory, recording apparatus and recording method |
| US6282605B1 (en) * | 1999-04-26 | 2001-08-28 | Moore Computer Consultants, Inc. | File system for non-volatile computer memory |
| JP2004094429A (en) * | 2002-08-30 | 2004-03-25 | Toshiba Corp | Disk array device and raid level changing method in the device |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112817526A (en) * | 2021-01-19 | 2021-05-18 | 杭州和利时自动化有限公司 | Data storage method, device and medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6205650B2 (en) | Method and apparatus utilizing non-uniform hash function to place records in non-uniform access memory | |
| US9471500B2 (en) | Bucketized multi-index low-memory data structures | |
| US9846642B2 (en) | Efficient key collision handling | |
| CN106354745B (en) | Method for providing an interface of a computer device and computer device | |
| JP7724318B2 (en) | Hardware-based memory compression | |
| US20170286507A1 (en) | Database search system and database search method | |
| US20150067283A1 (en) | Image Deduplication of Guest Virtual Machines | |
| US10296250B2 (en) | Method and apparatus for improving performance of sequential logging in a storage device | |
| JP5902323B2 (en) | Method and apparatus for arranging content-derived data in memory | |
| US10366011B1 (en) | Content-based deduplicated storage having multilevel data cache | |
| US11061827B2 (en) | Metadata representation for enabling partial page duplication | |
| US20150324281A1 (en) | System and method of implementing an object storage device on a computer main memory system | |
| US8688948B2 (en) | Flexible memory controller for autonomous mapping of memory | |
| US11747998B1 (en) | Indexing technique for large scale distributed key-value systems | |
| CN116795736A (en) | Data pre-reading method, device, electronic equipment and storage medium | |
| KR20210144656A (en) | How to allocate virtual pages to non-contiguous backup physical subpages | |
| WO2018165957A1 (en) | Log-appended-structured storage management with byte-level accessibility | |
| CN110832473A (en) | Log structure management system and method | |
| US12086411B2 (en) | Storage control apparatus and method for a database storing data in pages | |
| KR102697447B1 (en) | Half-match deduplication | |
| CN117762323A (en) | Data processing method and device | |
| CN120386744A (en) | Storage system and operating method thereof | |
| CN117931066A (en) | Disk space allocation method, device, electronic equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17901072 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17901072 Country of ref document: EP Kind code of ref document: A1 |