WO2023141378A1 - Fast skip-list scan and insert - Google Patents
Fast skip-list scan and insert Download PDFInfo
- Publication number
- WO2023141378A1 WO2023141378A1 PCT/US2023/060371 US2023060371W WO2023141378A1 WO 2023141378 A1 WO2023141378 A1 WO 2023141378A1 US 2023060371 W US2023060371 W US 2023060371W WO 2023141378 A1 WO2023141378 A1 WO 2023141378A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- key
- skip list
- record
- records
- prefix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
Definitions
- This linked hierarchy is implemented using varying heights of pointer towers such that, within a given a tower, pointers may be arranged based on the numbers of skipped-over records. This ability to skip over records when the skip list is traversed may allow a given record to be located more quickly than scanning through the records sequentially.
- Fig. l is a block diagram illustrating one embodiment of a database system that uses a skip list within a buffer data structure to process concurrent database transactions.
- Fig. 2 is a block diagram illustrating one embodiment of a record chain within the buffer data structure.
- Fig. 3 is a block diagram illustrating one embodiment of a skip list within the buffer data structure.
- Fig. 4 is a block diagram illustrating one embodiment of a slow skip-list scan.
- Fig. 5 is a flow diagrams illustrating one embodiment of a fast skip-list scan.
- FIGs. 6A and 6B are block diagrams illustrating embodiments of key space partitionings using key prefixes to facilitate the fast skip-list scan.
- Figs. 7A and 7B are block diagrams illustrating embodiments of anchor record identifications during the fast skip-list scan.
- Fig. 8 is a block diagram illustrating one embodiment of climbing the mountain during the fast skip-list scan.
- FIGs. 9A and 9B are block diagrams illustrating one embodiment of walking a skip list backwards for record insertion.
- Figs. 10A-10C are flow diagrams illustrating embodiments of methods related to scanning a skip list.
- FIG. 11 is a block diagram illustrating one embodiment of an exemplary multi-tenant system.
- Fig. 12 is a block diagram illustrating one embodiment of an exemplary computer system for implementing various systems described herein.
- skip lists may be used to maintain large quantities of information that is frequently manipulated.
- a database system may use a buffer data structure to store data of active database transactions until the database transactions can be committed and their data flushed to a persistent storage of the database system.
- the buffer data structure may include a skip list data structure that enables efficient storage and lookup of transaction records in key order. As this database system may process a high volume of transactions in parallel, efficient scanning of the skip list can be important for database performance.
- database system 10 includes a transaction manager 104, buffer data structure 106, and a database 108.
- buffer data structure 106 includes multiple record chains 110, hash table 120, active transaction list 130, and skip list 140.
- Record chains 110 include key-value records 112.
- Hash table 120 includes a hash function 122 and an array of a hash buckets 124, each including a latch 126. (As used herein, the term “latch,” “lock,” and “semaphore” are used generally to refer to a data structure that controls access to a resource shared among multiple potential consumers.)
- manager 104 also includes a scan engine 150.
- buffer data structure 106 may include more (or less) structures. Although some structures 110, 120, 130, and 140 are depicted separately for illustration purposes, in some embodiments, structures 110, 120, 130, and/or 140 may be intertwined. For example, skip list 140 may be implemented by adding pointers within key-value records 112 in record chains 110.
- buffer data structure 106 may be implemented by transaction manager 104 such as adding key-value records 112 to record chains 110, facilitating acquisition of hash-bucket latches 126 for transactions 102, modifications to active transaction list 130 and skip list 140, etc.
- committed transaction data is asynchronously flushed from buffer data structure 106 to the persistent storage of database 108. That is, rather than perform a flush for each transaction 102’s data upon its commitment, a flush is performed periodically for multiple committed transactions 102.
- transaction manager 104 initiates a flush to database 108 in response to buffer data structure 106 satisfying a particular size threshold.
- Database 108 may correspond to any suitable form of database implementation.
- database 108 is a non-relational database that is implemented using a log- structured merge (LSM) tree for persistent storage.
- LSM log- structured merge
- layers of the LSM tree may be distributed across multiple physical computer systems providing persistent storage.
- these computers systems are cluster nodes of a computer cluster that provides a cloud-based system accessible to multiple clients.
- database 108 may be part of a software as a service (SaaS) model; in other embodiments, database 108 may be directly operated by a user.
- SaaS software as a service
- a corresponding key -value record 112 may be created that includes the value and the key. If multiple transactions 102 attempt to write values associated with the same key, key -value records 112 may be generated for each value and linked to together to form a record chain 110 corresponding to the key. For example, if a user has withdrawn a first amount from a bank account resulting in a first database transaction 102 and then a second amount resulting in a second database transaction 102, a record chain 110 corresponding to an accountbalance key may have two key-value records 112 reflecting those withdrawals.
- each record 112 includes a transaction identifier (e.g., a transaction sequence number) specifying its associated transaction 102; records 112 may also be organized in a record chain 110 based on the ordering in which the transactions 102 are received.
- record chains 110 may be implemented using linked lists such that a new record 112 is inserted at the head of the linked list and migrates to the tail as newer records 112 are created and older ones are flushed to database 108.
- record chains 110 are appended to hash buckets 124 of hash table 120.
- Hash table 120 in one embodiment, is a data structure that allows constant-time lookups of record chains 110 based on given a key. That is, when a key is received, hash table 120 is indexed into by applying hash function 122 to the key to produce the appropriate index value for the hash bucket 124 corresponding to the key. The direct pointer in the hash bucket 124 may then be referenced to obtain to the record chain 110. Being able to perform constant-time lookups may significantly reduce the time consumed to read key-value records 112, write records 112, or perform key probes (i.e., determining whether a key has a key-value record 112 present in buffer data structure 106).
- each hash bucket 124 includes a respective latch 126 that controls access to its record chain 110. Accordingly, when a transaction is attempting to read or write a value associated with a particular key, the key may be used to index into hash table 120 and acquire the latch 126 corresponding to the key’s associated hash bucket 124 before reading or writing is performed. If a latch 126 cannot be acquired for a database transaction 102, processing the database transaction 102 may be delayed until the latch 126 is released.
- latches 126 may have one of three possible states: available, shared acquired, and exclusively acquired. If no transaction 102 is currently accessing a record chain 110, its latch 126 is available for acquiring.
- latches 126 may be acquired in ascending order of buckets 124 to prevent deadlock. Although acquisition of latches 126 may be discussed primarily with respect to read and write operations, latches 126 may also be acquired when performing other operations such as defragmentation, garbage collection, flushing records 112 to the persistent store of database 108, etc. In some embodiments, latches 126 may also serve as a concurrency control mechanism for active transaction list 130 and skip list 140.
- Active transaction list 130 is a data structure that tracks various metadata for active transactions 102.
- the metadata for a given transaction 102 includes a transaction identifier for the transaction 102 and one or more pointers usable to access records 112 associated with the transaction 102.
- list 130 enables a transaction 102’s records 112 to be identified based on its transaction identifier, which may be helpful when, for example, determining which records 112 should be removed if the transaction 102 is being rolled back.
- the metadata may also include an indication of whether a transaction is active or committed, which may be used to determine if its records 112 can be marked for flushing to database 108.
- Skip list 140 in one embodiment, is a data structure that maintains an ordering of keys in records 112 to allow forward and reverse scanning of keys.
- database 108 may be configured such that records 112 for committed transactions 102 are flushed in ascending key order (as well as version order); skip list 140 may allow this ordering to be quickly and easily determined.
- skip list 140 includes indirect pointers for accessing records 112 of skip list 140. That is, rather than have direct pointers between records 112 (i.e., pointers specifying the memory addresses of records 112), skip list 140 includes indirect pointers to the hash buckets 124, which include the direct pointers to chains 110.
- scan engine 150 is a component of transaction manager 104 that is executable to scan skip list 140 and may implement the fast scan algorithm discussed in more detail below starting with Fig. 5. Accordingly, as noted above, scan engine 150 may divide skip list 140 into sections (referred to below as “anchor ranges”) based on prefixes of the keys such that a given a section includes keys having the same prefix.
- scan engine 150 maintains an index that can be referenced to determine the relevant section for a given key prefix so that a scan can then be initiated within that section. As will be discussed below, in some embodiments, this index is integrated into hash table 120 by adding pointers in hash buckets 124 to records 112 in respective sections (referred to as “anchor records”) so that a given section can be located quickly for a given key prefix. Once a relevant section is determined for a given key prefix, scan engine 150 may sequentially scan through records implementing a technique referred to below as “climbing the mountain.” In various embodiments, scan engine 150 may also be responsible for maintaining skip list 140 including performing record insertion (and record removal) for skip list 140.
- scan engine 150 may “walk backwards” from the location to determine what pointers in other records 112 should be updated when the record 112 is inserted as well as determine the pointers to include in the skip list tower of the record 112.
- scan engine 150 manages a thread pool of executing threads in order to implement parallel fast scans.
- record chain 110 may include a collection of key-value records 112A-112C, a collision record 220, and a lock record 230. Records 112 may further include a key 212, value 213, transaction identifier 214, commit identifier 215, purge flag 216, lock 217, skip list pointers 218, and record-chain pointer 219.
- chain 110 may include more (or fewer) records 112, 220, or 230 than shown; a given record 112 may also include more (or fewer) elements 212-219 than shown.
- record chain 110 is implemented using a linked list such that each key-value record 112 includes a pointer 219 identifying the next record 112 in the chain 110.
- each key-value record 112 includes a pointer 219 identifying the next record 112 in the chain 110.
- the added record 112 may then include a pointer 219 to the record that was previously at the head.
- the record 112 becomes older, it migrates toward the tail (record 112B or lock record 230 in Fig. 2) until its transaction 102 commits. Then, it may be flushed to database 108’s persistent storage and removed.
- transaction identifier 214 may identify, not only the transaction 102 to which the record 112 is associated, but also indicate the ordering in which transactions 102 were received. Accordingly, since record 112B is further from the head than record 112A, transaction ID 214B may correspond to an earlier transaction 102 than transaction ID 214A. If the transaction 102 corresponding to transaction ID 214B is to be rolled back, transaction manager 104 may locate record 112B by referencing direct pointer 202 to identify the head of chain 110 and traverse through records 112A and 220 until finding the record 112B having the corresponding transaction ID 214B. Record 112B may then be removed and pointer 222 A modified to have the same address as pointer 219B.
- the commit identifiers 215 for its records 112 may be set to reflect the commitment and mark the record 112 as being ready for flushing to database 108’s persistent storage. Records 112 may later be scanned by a process of transaction manager 104 to identify which records 112 have commit identifiers 215 and to determine which records 112 can be flushed to database 108.
- transaction manager 104 sets a purge flag 216 to indicate that the record 112 is ready for purging from buffer data structure 106.
- a purge engine may then read this flag 216 in order determine whether the record 112 should be purged from buffer data structure 106.
- individual records 112 may also include their own respective locks 217 to provide additional coherency control.
- a separate lock record 230 may also be inserted into record chains 110 to create a lock tied to a particular key when there is no corresponding value.
- Skip list pointers 218, in one embodiment, are the pointers that form skip list 140.
- pointers 218 within a given record 112 may form a pointer tower that implements a linked hierarchy of data records sequences, with each successive sequence skipping over fewer records 112 than the previous sequence.
- pointers 218 are also implemented using indirect pointers through which key-value records 112 are linked together in skip list 140 without using direct pointers to the physical addresses of records 112. Instead, pointers 218 reference the hash buckets 124 that point to the record chains 110 including records 112.
- using indirect pointers greatly simplifies pointer management because only one direct pointer may be maintained for a given record 112. That is, since the location of the hash bucket 124 remains in the same, the indirect pointer is not updated if a record 112 is moved, for example, to a later position in a record chain 110.
- skip list 140 may be used to maintain an ordering of keys 212 stored in records 112, which may be used to flush records 112 of committed transactions 102 in ascending key order.
- skip list pointers 218 within a record 112 form a tower 300 that point to towers 300 in other records 112.
- traversal proceeds forward along another pointer 314. This process then continues onward until a match is identified for the record 112 being searched for. An example of this traversal will be discussed below with Fig. 4.
- Fig. 4 a block diagram of a slow scan 400 is depicted.
- slow scan 400 is a less efficient algorithm than the fast scan algorithm described herein as it can use a far greater number of memory accesses to scan for a particular key 212.
- an example skip list 140 may be constructed from records 112 sorted in ordering of keys 212 A-W.
- the skip list 140 includes eight levels (shown as levels 1-8) of forward pointers 314 allowing for movement in ascending key order and another level (shown as level -1) of backward pointers allowing for movement in descending key order.
- slow scan 400 is being performed to locate a record 112 having key 212 of S (or simply “record S”).
- scan 400 begins with skip-list traversal at the top of the sentinel tower 300 on the left in which a first memory access is performed to read the skip list pointer 218 at level 8, which includes a bucket ID 312 of 20.
- a second memory access is then performed to read the record 112 pointed to by bucket #20, which is a record 112 having a key K.
- the traversal continues along level 8 where a record W is read during a third memory access. Because key S is less than key W, the traversal returns to record K in a fourth memory access to read the skip list pointer 218 for the level 7, the next level down.
- this process continues for another twenty memory accesses until record R is identified as having a pointer 218 of bucket #17 to record — not including the additional memory accesses for using indirect pointers or the multiple accesses to move down a record chain 110.
- slow scan 400 may be performed multiple times to insert multiple records 112 associated with a given transaction.
- skip list 140 may include much taller skip list towers 300 (e.g., ones having 33 levels) and be substantially wider. All these memory accesses can affect system performance. In many instances, the fast scan algorithm discussed next uses far fewer memory accesses.
- fast scan 500 may be an algorithm used by scan engine 150 to more quickly scan skip list 140 in response to receiving a request to identify a relevant location for a given key.
- fast scan 500 may include more (or fewer steps) than shown. For example, step 540 may not be included if scan 500 is being performed for something other than record insertion.
- fast scan 500 begins in step 510 with scan engine 150 using a prefix of a particular key to identify an anchor range within skip list 140 including the relevant location for the particular key.
- an “anchor range” is a set/range of records in a skip list that have the same key prefix and is used to anchor where a skip list scan is initiated.
- scan engine 150 may use key prefixes of varying lengths in order to produce anchor ranges having varying widths so that a scan can be initiated from the narrowest anchor range available. As will be discussed below with Fig.
- scan engine 150 may identify the anchor range for a particular key prefix using an index that associates key prefixes to anchor ranges by including respective pointers to “anchor records” in those ranges.
- the anchor record for a particular anchor range is the record 112 having the lowest key 212 for that anchor range while still having the same prefix.
- this scan begins with scan engine 150 “climbing the mountain” from an anchor record to a record 112 having a skip list tower 300 that overshoots the relevant location.
- the anchor record may not include a tower with sufficient height to allow the relevant location to be determined using only a skip-list traversal.
- scan engine 150 may scan forward from the anchor record using the highest available tower pointers until a key-value record 112 is identified that includes a skip list tower 300 with a pointer 314 that overshoots the location (i.e., the key-value record 112 pointed to by the pointer 314 has a later key in key order than the particular key).
- the relevant location may be identified as part of this forward scan without performing step 530.
- step 530 once an overshooting tower 300 has been identified, scan engine 150 performs a local skip list traversal from the overshooting tower 300 to identify the relevant location for the particular key.
- this local skip list traversal may be implemented in a manner similar to slow scan 400, but without starting at a sentinel tower. If scan 500 is being performed to access a record 112, scan 500 may conclude with scan engine 150 accessing the contents of the record 112 once the traversal identifies its location. If a record 112 is being inserted at the location, scan 500 may proceed to step 540 to determine how skip list 140 should be modified to facilitate the insertion.
- step 540 scan engine 150 walks backwards from the identified location in the skip list 140 to identify skip-list pointers 314 relevant to the record insertion. As will be discussed with Fig. 9A and 9B, scan engine 150 scans sequentially backwards until it has identified an overshooting pointer for each level in the tower 300 of the record 112 being inserted. Once all of these relevant pointers 314 have been identified, scan engine 150 may then insert the new record 112 at the determined location and include, in the new record 112, the tower 300 with the identified pointers 314. Scan engine 150 then updates the relevant pointers 314 in the other towers 300 to point to the newly inserted record 112.
- fast scan 500 results in far fewer memory accesses than slow scan 400 as a skip list traversal is only performed on a small subset of the total records 112 in skip list 140.
- additional memory accesses may be performed to facilitate this local skip list traversal, employing techniques such as climbing the mountain and walking backwards for relevant pointer identification ensures that the number of additional memory access, in many instances, falls well below those needed to perform a slow scan 400.
- Fig. 6A a block diagram of key space partitioning 600A is depicted.
- skip list 140 includes a set of records 112 having respective keys 212 and accessible from buckets 124 having various bucket identifiers 302. Records 112 are also arranged in key order and have respective skip list towers 300.
- key space partitioning 600A uses two different key prefixes 610 to divide the key space: a shorter key prefix 610A and a longer key prefix 610B. As can be see, shorter key prefix 610A results in a wider anchor range 620A with a greater number of records 112.
- range 620A has an anchor record 622A in bucket 19 as key 212 “DEBHQTT” is the lowest key for “DE” key prefix 610A.
- Longer key prefix 610B results in a narrower anchor range 620B having fewer records 112 and having an anchor record 622B as key 212 “DEJKEVM” is the lowest key for the “DEJK” key prefix 610B.
- anchor ranges 620 A and 620B are shown as having separate anchor records 622A and 622B, it is worth noting that an anchor record 622 can participate in multiple anchor ranges 620 based on different length prefixes 610 of the record 112’ s key 212.
- anchor record 622A is also the anchor record for the longer “DEBH” key prefix 610.
- scan engine 150 is performing a scan for a location 624 corresponding to the key 212 “DEJKSBC” for a record 112 being inserted into skip list 140.
- scan engine 150 initially selects the longest available prefix 610 of the particular key 212 to determine whether a relevant anchor range 620 of skip list 140 exists as this can result in fewer records 112 being scanned.
- the longer key prefix 610B of “DEJK” does have a corresponding anchor range 620B, so scan engine 150 can initiate a scan from anchor record 622B.
- scan engine 150 in response to determining that a relevant anchor range 620 does not exist in skip list 140 for the longest prefix 610, scan engine 150 selects the next longest prefix 610 of the particular key 212. In this example, scan engine 150 is able to fall back and use shorter key prefix 610A as “DE” is still a component of the key 212. Although a scan from anchor record 622 A through wider anchor range 620 A results in more records 112 being scanned than narrower anchor range 620B, the number of scanned records is still far fewer than those scanned when performing slow scan 400 in many instances.
- scan engine 150 can fall back further and use slow scan 400 to identify location 624. Once this record 112 is inserted, however, it can become the anchor record 622 for both the DF prefix 610 and the DFAC prefix 610 enabling faster scans for those prefixes 610 in the future.
- partitioning 600A uses the same key prefix lengths across the entire key space, in some embodiments, different key prefix lengths may be used for different portions of the key space as will be discussed next.
- Fig. 6B a block diagram of another key space partitioning 600B of is depicted.
- different portions of the key space may use differing anchor plans 650 for dividing up the key space using differing numbers of prefixes 610.
- an anchor plan 650A using three different prefixes 610 is used for keys 212 having an initial letter of A-L
- an anchor plan 650B using four different prefixes 610 is used for keys 212 having an initial letter of M-S
- an anchor plan 650C using two different prefixes 610 is used for keys 212 having an initial letter of T-Z.
- scan engine 150 may initially access anchor plans 650 to determine which lengths of prefixes should be used for a given key 212. For example, if a scan is being performed for a key 212 having an initial letter G, scan engine 150 may determine from anchor plan 650A that an initial prefix 610 of ten characters should be used, followed by a prefix 610 of eight characters, and then a prefix of three characters.
- scan engine 150 may use a machine learning algorithm to determine appropriate anchor plans 650 for various divisions of the key space.
- Such alterations may not be devastating for database system 10 as scan engine 150 can initially fall back to using slow scan 400 for inserting records 112 until a sufficient number of records 112 have been inserted under a new anchor plan 650 making fast scan 500 viable again.
- the temporary performance loss for changing anchor plans likely affects only a small number of scans, such alterations may not be a big problem.
- Fig. 7A a block diagram of an anchor record identification 700 is depicted.
- scan engine 150 maintains an index that associates key prefixes 610 to key -value records 112 in respective anchor ranges 620 having the same key prefixes and accesses the index to identify a relevant anchor record 622 from where to initiate a scan.
- this index is implemented by including anchor entries 710 in hash buckets 124.
- Each anchor entry 710 further includes a bucket identifier 312 for an anchor record 622 and its corresponding key prefix 610.
- the index may be implemented differently.
- anchor entries 710 may be located elsewhere such as being appended in record chains 110, anchor entries 710 may include more (or fewer) components, a key prefix 610 may be replaced with a hash value generated from the key prefix 610 (such as depicted in Fig. 7B), the index may be implemented using a data structure other than hash table 120, etc.
- scan engine 150 implements anchor record identification 700 by applying hash function 122 to a given prefix 610 to determine the bucket identifier 312 for the relevant hash bucket 124 (e.g., bucket 124N). If a corresponding anchor entry 710 exists in the bucket 124 (meaning that a corresponding anchor range 620 exists for that prefix 610), the included bucket identifier 312 for the anchor record 622 is read to determine the hash bucket 124 relevant to the anchor record 622. This bucket identifier 312 is then used to access anchor records 612’s hash bucket 124 (e.g., bucket 124A). Scan engine 150 then traverses the pointer in the bucket 124 to the anchor chain 110 including the anchor record 622.
- hash function 122 to a given prefix 610 to determine the bucket identifier 312 for the relevant hash bucket 124 (e.g., bucket 124N). If a corresponding anchor entry 710 exists in the bucket 124 (meaning that a corresponding anchor range 620 exists for that prefix
- scan engine 150 may proceed to use the next largest key prefix 610 for the particular key 212 and continually repeat this process until a relevant anchor record 622 can be identified — or proceed to perform a slow scan 400 if no anchor entry 710 can be identified for any key prefix 610 of the key 212.
- scan engine 150 creates an anchor entry 710 for an anchor range 620 when an initial record 112 gets added to the range 620 — the initial record 112 becoming the anchor record 622.
- scan engine 150 updates anchor record bucket identifier 312 in the anchor entry 710 to point to hash buckets 124 of these new anchor records 622 (or the records 112 with the next lowest keys 212). If each record 112 for a particular anchor range 620 is purged, scan engine 150 may remove the corresponding anchor entry 710 from its hash bucket 124.
- multiple anchor entries 710 may be modified for a given key 212 if its corresponding record 112 becomes the anchor record 622 for multiple prefixes 610.
- two prefixes 610 may map to the same hash bucket 124 due to a hash collision.
- scan engine 150 may compare the prefix 610 being hashed to the key prefix 610 in an anchor entry 710 in order to confirm that the anchor entry 710 is the correct one for the prefix 610 being hashed.
- a given hash bucket 124 may include multiple anchor entries 710 if two prefixes 610 are associated with the same hash bucket 124.
- an anchor entry 710 for a colliding prefix 610 may be stored in a collision record 220 pointed to by the hash bucket 124.
- other techniques may be used to handle collisions such as preceding to use the next longest prefix 610 for a particular key 212 being scanned.
- Fig. 7B a block diagram illustrating an example of anchor record identification 700 is depicted.
- scan engine 150 is again attempting to determine a relevant location 624 for the key 212 ““DEJKSBC.”
- scan engine 150 may initially apply hash function 122 to longer prefix 610B “DEJK” to obtain a bucket identifier 312 “31.”
- scan engine 150 may compare a hash value of “DEJK” with a hash value 702 “739813” in order to determine that the anchor entry 710 is the correct one.
- scan engine 150 may read the bucket identifier 312 “6” and then access the corresponding bucket 124, where scan engine 150 is able to access the anchor record 622 for narrower range 620 and initiate a scan. If, however, no corresponding entry 720 existed (or the hash value 702 did not match), scan engine 150 may apply hash function 122 to the shorter prefix 610A “DE” and access the corresponding entry 710 to determine the anchor record 622 for wider anchor range 620 A where a scan can be initiated as will be discussed next.
- Fig. 8 a block diagram of “climbing the mountain” 800 is depicted.
- scan engine 150 may read the tower 300 in that record 121 to determine one or more pointers 314 for scanning forward in the anchor range 620 for a relevant location 624.
- the tower 300 is not sufficiently tall enough to perform a skip list traversal (i.e., it does not have a pointer 314 pointing to a record 112 located after location 624).
- scan engine 150 could scan sequentially forward through records 112, this approach is inefficient and does not leverage the logarithmic scanning afforded by skip lists.
- scan engine 150 uses climbing the mountain 800.
- this technique begins with scan engine 150 “climbing up the mountain” — meaning that the scan engine 150 scans forward from the anchor record 622 along the highest available pointers 314 in skip list 140’ s towers 300 until an overshooting tower 300 (i.e., tower 300 with a pointer 314 that points past/overshoots the location 624) can be identified. Accordingly, in the example depicted in Fig. 8, scan engine 150 may traverse the pointer 314 “17” in the anchor record 622, the pointer 314 “29” in the next record 112, and the pointer 314 “42” in the following record 112 as these are the highest available pointers in towers 300 of these records 112.
- scan engine 150 may traverse the next highest pointer 314 “15,” traverse pointer 314 “9,” and then, after overshooting, determine location 624.
- a local traversal 810 also affords logarithmic scanning, the process of climbing up and climbing down the mountain is an efficient approach for determining a relevant location 624 once a relevant anchor range 620 has been identified.
- Fig. 9A a block diagram of “walking backwards” 900 is depicted.
- the relevant pointers 314 for facilitating a record insertion may be determined as the scan 400 is being performed.
- scan engine 150 may not access all relevant pointers 314 while performing climbing the mountain 800, scan engine 150 may perform walking backwards 900 to account for this.
- scan engine 150 determines the tower 300’ s height using a power of 2 back off — meaning that skip list tower heights are assigned logarithmically with ’A being one tall (not including the lowest, reverse level), 14 being two tall, and so on.
- scan engine 150 may determine a tower 300’ s height by generating a random number and counting the number of consecutive Is (or 0s) from the most significant bit (or least significant bit). Once the tower height is determined, scan engine 150 may scan sequentially backwards until engine 150 has identified an overshooting pointer 314 for each level in the new tower 300. Accordingly, in the example depicted Fig. 9A, a tower height of four forward pointers 314 is being used.
- scan engine 150 may perform a record insertion as discussed next.
- Method 1010 is one embodiment of a method performed by a computing system, such as database system 10, which may be executing scan engine 150. In some instances, performance of method 1010 may improve the performance of scanning a skip list.
- accessing the index includes applying a hash function (e.g., hash function 122) to the prefix to identify a hash bucket (e.g., hash function 124) including a pointer (e.g., bucket identifier 312) to the initial key-value record in the particular portion and traversing (e.g., local traversal 810) the pointer to the initial key -value record in the particular portion.
- a hash function e.g., hash function 122
- a pointer e.g., bucket identifier 312
- a skip list (e.g., skip list 140) is stored that maintains an ordering of keys (e.g., keys 212) for key-value records (e.g., records 112) of a database (e.g., database 108).
- the skip list maintains the ordering of keys for key-value records of database transactions awaiting commitment by the database.
- the determining includes applying a hash function (e.g., hash function 122) to the prefix to index into a hash table (e.g., hash table 120) that includes a pointer (e.g., bucket identifier 312) to an initial key-value record (e.g., anchor record 622) in the range and initiating the scan from the initial key -value record in the range.
- a hash function e.g., hash function 122
- a pointer e.g., bucket identifier 312
- an initial key-value record e.g., anchor record 622
- the skip list is divided (e.g., key space partitioning 600) into sections (e.g., anchor ranges 620) based on prefixes (e.g., prefixes 610) of the keys.
- the dividing includes maintaining a hash table (e.g., hash table 120) that associates the prefixes with pointers (e.g., anchor record bucket identifiers 312) to the sections within the skip list.
- a location (e.g., location 624) associated with a particular key is determined, by initiating a scan for the location within a section corresponding to a prefix of the particular key.
- the determining includes scanning (e.g., climbing up in Fig. 8) forward within the section until a key-value record is identified that includes a skip list tower (e.g., overshooting tower 300 in Fig. 8) with a pointer that points past the location and performing a skip list traversal (e.g., local traversal 810) from the identified key -value record to determine the location.
- a skip list tower e.g., overshooting tower 300 in Fig. 8
- method 1060 further includes scanning backwards (e.g., walking backwards 900) along a lowest level in the skip list to identify pointers to insert in a skip list tower (e.g., new tower 300 in Fig. 9B) for a key -value record being inserted into the skip list and inserting the key-value record with the skip list tower into the skip list.
- a skip list tower e.g., new tower 300 in Fig. 9B
- MTS 1100 an exemplary multi-tenant database system (MTS) 1100, which may implement functionality of as database system 10, is depicted.
- MTS 1100 includes a database platform 1110, an application platform 1120, and a network interface 1130 connected to a network 1140.
- Database platform 1110 includes a data storage 1112 and a set of database servers 1114A-N that interact with data storage 1112
- application platform 1120 includes a set of application servers 1122A-N having respective environments 1124.
- MTS 1100 is connected to various user systems 1150A-N through network 1140.
- techniques of this disclosure are implemented in non-multitenant environments such as client/server environments, cloud computing environments, clustered computers, etc.
- Database platform 1110 is a combination of hardware elements and software routines that implement database services for storing and managing data of MTS 1100, including tenant data.
- database platform 1110 includes data storage 1112.
- Data storage 1112 includes a set of storage devices (e.g., solid state drives, hard disk drives, etc.) that are connected together on a network (e.g., a storage attached network (SAN)) and configured to redundantly store data to prevent data loss.
- SAN storage attached network
- data storage 1112 is used to implement a database 108 comprising a collection of information that is organized in a way that allows for access, storage, and manipulation of the information.
- Data storage 1112 may implement a single database, a distributed database, a collection of distributed databases, a database with redundant online or offline backups or other redundancies, etc.
- data storage 1112 may store one or more database records 112 having respective data payloads (e.g., values for fields of a database table) and metadata (e.g., a key value, timestamp, table identifier of the table associated with the record, tenant identifier of the tenant associated with the record, etc.).
- a database record 112 may correspond to a row of a table.
- a table generally contains one or more data categories that are logically arranged as columns or fields in a viewable schema.
- each record of a table may contain an instance of data for each category defined by the fields.
- a database may include a table that describes a customer with fields for basic contact information such as name, address, phone number, fax number, etc.
- a record therefore for that table may include a value for each of the fields (e.g., a name for the name field) in the table.
- Another table might describe a purchase order, including fields for information such as customer, product, sale price, date, etc.
- standard entity tables are provided for use by all tenants, such as tables for account, contact, lead and opportunity data, each containing pre-defined fields.
- MTS 1100 may store, in the same table, database records for one or more tenants — that is, tenants may share a table.
- database records in various embodiments, include a tenant identifier that indicates the owner of a database record.
- tenant identifier indicates the owner of a database record.
- the data stored at data storage 1112 includes buffer data structure 106 and a persistent storage organized as part of a log-structured merge-tree (LSM tree).
- a database server 1114 may initially write database records into a local in-memory buffer data structure 106 before later flushing those records to the persistent storage (e.g., in data storage 1112).
- the database server 1114 may write the database records 112 into new files that are included in a “top” level of the LSM tree.
- the database records may be rewritten by database servers 1114 into new files included in lower levels as the database records are moved down the levels of the LSM tree.
- as database records age and are moved down the LSM tree they are moved to slower and slower storage devices (e.g., from a solid state drive to a hard disk drive) of data storage 1112.
- the database transaction request may instruct database server 1114 to write one or more database records for the LSM tree — database servers 1114 maintain the LSM tree implemented on database platform 1110.
- database servers 1114 implement a relational database management system (RDMS) or object oriented database management system (OODBMS) that facilitates storage and retrieval of information against data storage 1112.
- RDMS relational database management system
- OODBMS object oriented database management system
- database servers 1114 may communicate with each other to facilitate the processing of transactions.
- database server 1114A may communicate with database server 1114N to determine if database server 1114N has written a database record into its in-memory buffer for a particular key.
- Application platform 1120 in various embodiments, is a combination of hardware elements and software routines that implement and execute CRM software applications as well as provide related data, code, forms, web pages and other information to and from user systems 1150 and store related data, objects, web page content, and other tenant information via database platform 1110.
- application platform 1120 communicates with database platform 1110 to store, access, and manipulate data.
- application platform 1120 may communicate with database platform 1110 via different network connections.
- one application server 1122 may be coupled via a local area network and another application server 1122 may be coupled via a direct network link.
- Transfer Control Protocol and Internet Protocol are exemplary protocols for communicating between application platform 1120 and database platform 1110, however, it will be apparent to those skilled in the art that other transport protocols may be used depending on the network interconnect used.
- Application servers 1122 are hardware elements, software routines, or a combination thereof capable of providing services of application platform 1120, including processing requests received from tenants of MTS 1100.
- Application servers 1122 in various embodiments, can spawn environments 1124 that are usable for various purposes, such as providing functionality for developers to develop, execute, and manage applications. Data may be transferred into an environment 1124 from another environment 1124 and/or from database platform 1110. In some cases, environments 1124 cannot access data from other environments 1124 unless such data is expressly shared. In some embodiments, multiple environments 1124 can be associated with a single tenant.
- Application platform 1120 may provide user systems 1150 access to multiple, different hosted (standard and/or custom) applications, including a CRM application and/or applications developed by tenants.
- application platform 1120 may manage creation of the applications, testing of the applications, storage of the applications into database objects at data storage 1112, execution of the applications in an environment 1124 (e.g., a virtual machine of a process space), or any combination thereof.
- application platform 1120 may add and remove application servers 1122 from a server pool at any time for any reason, there may be no server affinity for a user and/or organization to a specific application server 1122.
- an interface system (not shown) implementing a load balancing function (e.g., an F5 Big-IP load balancer) is located between the application servers 1122 and the user systems 1150 and is configured to distribute requests to the application servers 1122.
- the load balancer uses a least connections algorithm to route user requests to the application servers 1122.
- Other examples of load balancing algorithms such as are round robin and observed response time, also can be used. For example, in certain embodiments, three consecutive requests from the same user could hit three different servers 1122, and three requests from different users could hit the same server 1122.
- MTS 1100 provides security mechanisms, such as encryption, to keep each tenant’s data separate unless the data is shared. If more than one server 1114 or 1122 is used, they may be located in close proximity to one another (e.g., in a server farm located in a single building or campus), or they may be distributed at locations remote from one another (e.g., one or more servers 1114 located in city A and one or more servers 1122 located in city B). Accordingly, MTS 1100 may include one or more logically and/or physically connected servers distributed locally or across one or more geographic locations. [0096] One or more users (e.g., via user systems 1150) may interact with MTS 1100 via network 1140.
- User system 1150 may correspond to, for example, a tenant of MTS 1100, a provider (e.g., an administrator) of MTS 1100, or a third party. Each user system 1150 may be a desktop personal computer, workstation, laptop, PDA, cell phone, or any Wireless Access Protocol (WAP) enabled device or any other computing device capable of interfacing directly or indirectly to the Internet or other network connection. User system 1150 may include dedicated hardware configured to interface with MTS 1100 over network 1140.
- WAP Wireless Access Protocol
- User system 1150 may execute a graphical user interface (GUI) corresponding to MTS 1100, an HTTP client (e.g., a browsing program, such as Microsoft's Internet ExplorerTM browser, Netscape's NavigatorTM browser, Opera’s browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like), or both, allowing a user (e.g., subscriber of a CRM system) of user system 1150 to access, process, and view information and pages available to it from MTS 1100 over network 1140.
- GUI graphical user interface
- Each user system 1150 may include one or more user interface devices, such as a keyboard, a mouse, touch screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display monitor screen, LCD display, etc. in conjunction with pages, forms and other information provided by MTS 1100 or other systems or servers.
- GUI graphical user interface
- disclosed embodiments are suitable for use with the Internet, which refers to a specific global internetwork of networks. It should be understood, however, that other networks may be used instead of the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
- VPN virtual private network
- the capacity of a particular user system 1150 might be determined one or more permission levels associated with the current user. For example, when a salesperson is using a particular user system 1150 to interact with MTS 1100, that user system 1150 may have capacities (e.g., user privileges) allotted to that salesperson. But when an administrator is using the same user system 1150 to interact with MTS 1100, the user system 1150 may have capacities (e.g., administrative privileges) allotted to that administrator.
- capacities e.g., user privileges
- MTS 1100 (and additional instances of MTSs, where more than one is present) and their components are operator configurable using application(s) that include computer code executable on processing elements.
- various operations described herein may be performed by executing program instructions stored on a non-transitory computer- readable medium and executed by processing elements.
- the program instructions may be stored on a non-volatile medium such as a hard disk, or may be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of staring program code, such as a compact disk (CD) medium, digital versatile disk (DVD) medium, a floppy disk, and the like.
- program code may be transmitted and downloaded from a software source, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known.
- computer code for implementing aspects of the disclosed embodiments can be implemented in any programming language that can be executed on a server or server system such as, for example, in C, C+, HTML, Java, JavaScript, or any other scripting language, such as VBScript.
- User systems 1150 may communicate with MTS 1100 using TCP/IP and, at a higher network level, use other common Internet protocols to communicate, such as HTTP, FTP, AFS, WAP, etc.
- HTTP HyperText Transfer Protocol
- user system 1150 might include an HTTP client commonly referred to as a “browser” for sending and receiving HTTP messages from an HTTP server at MTS 1100.
- HTTP server might be implemented as the sole network interface between MTS 1100 and network 1140, but other techniques might be used as well or instead.
- the interface between MTS 1100 and network 1140 includes load sharing functionality, such as round-robin HTTP request distributors to balance loads and distribute incoming HTTP requests evenly over a plurality of servers.
- Computer system 1200 includes a processor subsystem 1280 that is coupled to a system memory 1220 and I/O interfaces(s) 1240 via an interconnect 1260 (e.g., a system bus). I/O interface(s) 1240 is coupled to one or more I/O devices 1250.
- processor subsystem 1280 that is coupled to a system memory 1220 and I/O interfaces(s) 1240 via an interconnect 1260 (e.g., a system bus).
- I/O interface(s) 1240 is coupled to one or more I/O devices 1250.
- I/O interfaces 1240 may be any of various types of interfaces configured to couple to and communicate with other devices, according to various embodiments.
- I/O interface 1240 is a bridge chip (e.g., Southbridge) from a front-side to one or more back-side buses.
- I/O interfaces 1240 may be coupled to one or more I/O devices 1250 via one or more corresponding buses or other interfaces.
- I/O devices 1250 include storage devices (hard drive, optical drive, removable flash drive, storage array, SAN, or their associated controller), network interface devices (e.g., to a local or wide-area network), or other devices (e.g., graphics, user interface devices, etc.).
- computer system 1200 is coupled to a network via a network interface device 1250 (e.g., configured to communicate over WiFi, Bluetooth, Ethernet, etc.).
- This disclosure may discuss potential advantages that may arise from the disclosed embodiments. Not all implementations of these embodiments will necessarily manifest any or all of the potential advantages. Whether an advantage is realized for a particular implementation depends on many factors, some of which are outside the scope of this disclosure. In fact, there are a number of reasons why an implementation that falls within the scope of the claims might not exhibit some or all of any disclosed advantages. For example, a particular implementation might include other circuitry outside the scope of the disclosure that, in conjunction with one of the disclosed embodiments, negates or diminishes one or more the disclosed advantages. Furthermore, suboptimal design execution of a particular implementation (e.g., implementation techniques or tools) could also negate or diminish disclosed advantages.
- references to a singular form of an item i.e., a noun or noun phrase preceded by “a,” “an,” or “the” are, unless context clearly dictates otherwise, intended to mean “one or more.” Reference to “an item” in a claim thus does not, without accompanying context, preclude additional instances of the item.
- a “plurality” of items refers to a set of two or more of the items.
- w, x, y, and z thus refers to at least one element of the set [w, x, y, z], thereby covering all possible combinations in this list of elements. This phrase is not to be interpreted to require that there is at least one instance of w, at least one instance of x, at least one instance of y, and at least one instance of z.
- labels may precede nouns or noun phrases in this disclosure. Unless context provides otherwise, different labels used for a feature (e.g., “first circuit,” “second circuit,” “particular circuit,” “given circuit,” etc.) refer to different instances of the feature. Additionally, the labels “first,” “second,” and “third” when applied to a feature do not imply any type of ordering (e.g., spatial, temporal, logical, etc.), unless stated otherwise.
- a determination may be solely based on specified factors or based on the specified factors as well as other, unspecified factors.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2024543478A JP2025503111A (en) | 2022-01-24 | 2023-01-10 | Fast skip list scan and insertion |
| CN202380020415.2A CN118647987A (en) | 2022-01-24 | 2023-01-10 | Fast skip table scan and insertion |
| EP23705158.6A EP4469914A1 (en) | 2022-01-24 | 2023-01-10 | Fast skip-list scan and insert |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263267089P | 2022-01-24 | 2022-01-24 | |
| US63/267,089 | 2022-01-24 | ||
| US17/938,078 | 2022-10-05 | ||
| US17/938,078 US20230237035A1 (en) | 2022-01-24 | 2022-10-05 | Fast Skip-List Scan and Insert |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023141378A1 true WO2023141378A1 (en) | 2023-07-27 |
Family
ID=85227029
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/060371 Ceased WO2023141378A1 (en) | 2022-01-24 | 2023-01-10 | Fast skip-list scan and insert |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023141378A1 (en) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020118682A1 (en) * | 2000-12-22 | 2002-08-29 | Myongsu Choe | Apparatus and method for performing high-speed IP route lookup and managing routing/forwarding tables |
-
2023
- 2023-01-10 WO PCT/US2023/060371 patent/WO2023141378A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020118682A1 (en) * | 2000-12-22 | 2002-08-29 | Myongsu Choe | Apparatus and method for performing high-speed IP route lookup and managing routing/forwarding tables |
Non-Patent Citations (2)
| Title |
|---|
| CHRISTOS CHRYSAFIS ET AL: "FoundationDB Record Layer: A Multi-Tenant Structured Datastore", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 January 2019 (2019-01-14), XP081124963, DOI: 10.1145/3299869.3314039 * |
| TIAN PAN ET AL: "Fast Content Store Lookup Using Locality-Aware Skip List in Content-Centric Networks", 2016 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), IEEE, 10 April 2016 (2016-04-10), pages 187 - 192, XP032957285, DOI: 10.1109/INFCOMW.2016.7562069 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12001409B2 (en) | Merges using key range data structures | |
| US12229119B2 (en) | Multiple index scans | |
| US8214411B2 (en) | Atomic deletion of database data categories | |
| US20110145209A1 (en) | Atomic deletion of database data categories | |
| US11709814B2 (en) | Building of tries over sorted keys | |
| US12282470B2 (en) | Skip-list checkpoint creation | |
| US11347713B2 (en) | Version-based table locking | |
| US12019610B2 (en) | Mechanisms for truncating tenant data | |
| US20180060362A1 (en) | Method and system for implementing distributed lobs | |
| US12086142B2 (en) | Database virtual partitioning | |
| US12164494B2 (en) | Key permission distribution | |
| US11892992B2 (en) | Unique identification management | |
| US20240320204A1 (en) | Index for multi-level data structures | |
| US20060122963A1 (en) | System and method for performing a data uniqueness check in a sorted data set | |
| US20230237035A1 (en) | Fast Skip-List Scan and Insert | |
| US11940963B2 (en) | Mechanisms for handling schema changes | |
| WO2023141378A1 (en) | Fast skip-list scan and insert | |
| US11625386B2 (en) | Fast skip list purge | |
| US12386832B1 (en) | Synthetic data generation for query plans | |
| US20250245273A1 (en) | Similarity information in skip lists | |
| US12487979B2 (en) | Subrange truncation of records |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23705158 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024543478 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380020415.2 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023705158 Country of ref document: EP Effective date: 20240826 |