[go: up one dir, main page]

WO2025002580A1 - Database controller for continuous data protection in a distributed database system and method thereof - Google Patents

Database controller for continuous data protection in a distributed database system and method thereof Download PDF

Info

Publication number
WO2025002580A1
WO2025002580A1 PCT/EP2023/068077 EP2023068077W WO2025002580A1 WO 2025002580 A1 WO2025002580 A1 WO 2025002580A1 EP 2023068077 W EP2023068077 W EP 2023068077W WO 2025002580 A1 WO2025002580 A1 WO 2025002580A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cdp
database controller
nodes
dataset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2023/068077
Other languages
French (fr)
Inventor
Bar David
Assaf Natanzon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/EP2023/068077 priority Critical patent/WO2025002580A1/en
Publication of WO2025002580A1 publication Critical patent/WO2025002580A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Definitions

  • the disclosure generally relates to continuous data protection, CDP, and more particularly, the disclosure relates to a database controller configured to be utilized for the CDP in a distributed database system.
  • the disclosure also relates to a method for a database controller configured to be utilized for the CDP in the distributed database system.
  • Databases commonly rely on backup services for various purposes, including disaster recovery and providing convenient solutions for development and testing environments.
  • the backup process for the databases can be done in a straightforward manner by tracking and replicating written data.
  • the demand for scalability, redundancy, and advanced features has given rise to Distributed Database systems such as Cassandra, MongoDB, and the like.
  • the distributed database systems provide various interfaces, including KeyValue store, object store, Sequential Query Language, SQL, and the like, to interact with the written data. Additionally, distributed database systems often offer dynamic scaling capabilities, allowing users to increase storage capacity or adjust the number of participating nodes as required.
  • the distributed database systems typically distribute entries, known as shards, among participating nodes to optimize resource utilization and provide consistent, linearly scaling performance to users.
  • Cassandra a No Sequential Query Language, NoSQL distributed database
  • Cassandra routes user data to neighbouring nodes based on the user's replication factor settings.
  • the replication factor is a user-defined number of copies of user data, with each copy residing on a different node. For example, by setting a replication factor of 3, Cassandra ensures that the data is stored on three different nodes, protecting against the failure of up to two nodes simultaneously without any data loss.
  • Cassandra offers support for a quorum, which defines the number of participating nodes based on the replication factor settings that need to acknowledge an input/output I/O operation before confirming it to the user.
  • Cassandra along with the replication factor, offers users flexibility in terms of fault tolerance and reliability. For example, if a user desires a highly durable database, they can set the replication factor and quorum to the maximum, ensuring that all data is copied to every node and acknowledging user I/O only after all nodes process the I/O operations, thereby impacting performance of the database.
  • the distributed database systems bring forth a set of challenges in terms of backup and restoration, especially when global consistency is a desired requirement and dynamic scaling is involved.
  • Achieving Continuous Data Protection, CDP for the distributed database systems poses these challenges.
  • the CDP is a method for creating a copy of data with the capability to restore the data to any previous point in time.
  • the distributed nature of these databases presents challenges in deriving a complete view from individual parts or shards. Naive approaches for achieving the CDP often require compromising user experience by freezing the I/O operations on nodes involved in the CDP to create a globally consistent dataset. Freezing of the I/O operations may have a negative impact on the user experience.
  • the CDP should adapt dynamically to the fluctuations to prevent resource waste or fallback to a Snapshot Data Protection, SDP scenario.
  • SDP is a method of protecting volumes of data using snapshots, typically in an incremental manner.
  • the source nodes refer to the nodes from which the data is backed up or replicated to other nodes.
  • An existing method for generating incremental snapshots in the distributed database systems involves copying a directory containing a database journal on a file system or using specific database snapshot tools like those provided by Cassandra and MongoDB.
  • data-dependent transactions such as T1 and T2
  • T1 and T2 occur, with the data-dependent transaction T2 happening after the data-dependent transaction T1 and each being backed up from different source nodes
  • the backup may contain T2 but not Tl.
  • the source nodes are responsible for providing the data that needs to be backed up. Inconsistency in the distributed nature of the backup poses a risk when restoring from such backups, potentially leading to failures in user applications.
  • Another existing method generates a global-consistent snapshot by performing an I/O freeze on nodes and draining outstanding I/O operations. While this method ensures global consistency, the main disadvantage of this method is a lack of a global-consistent backup, which may not accurately reflect the point in time at which the data was created. The performance of ongoing operations is disrupted due to the freeze and drain operations necessary to achieve global consistency.
  • a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system.
  • the distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi).
  • the database controller is further configured to download each of the datasets in the CDP set from the one or more nodes to the CDP node.
  • the database controller is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node.
  • the database controller is further configured to download the dataset from the preferred node to the CDP node.
  • the database controller is configured to provide an efficient and reliable CDP in the distributed database system.
  • the database controller minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller reduces network traffic, and improves the overall efficiency of the distributed database system.
  • the database controller ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system.
  • data trickles asynchronously to the one or more nodes based on settings of replication factor.
  • the replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset.
  • the database controller obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing efficiency of the distributed database system.
  • the database controller is further configured to receive a list from a source node of the nodes.
  • the list includes the CDP set indicating datasets that have been changed and which nodes each data set is stored on.
  • the database controller is further configured to determine if the dataset is stored in more than one node based on the list.
  • the distributed database system allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated.
  • the database controller identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system.
  • the database controller is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node.
  • the database controller is further configured to determine that a node is a preferred node by determining that the node is underutilized.
  • the database controller is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.
  • the database controller is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.
  • the database controller is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.
  • the database controller is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.
  • the database controller is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node.
  • the database controller is further configured to determine that the node is a close neighbour to the CDP node by determining that the node has fewer network hops to the CDP node than the other nodes storing the same dataset.
  • the database controller is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node.
  • the same region is a same network region as the CDP node.
  • the same region is a same geographic region as the CDP node.
  • a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system.
  • the distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi).
  • the method includes downloading each of the datasets in the CDP set from the one or more nodes to the CDP node.
  • the method includes determining if the dataset is stored in more than one node for each dataset.
  • the method includes determining a preferred node if so.
  • the method includes downloading the dataset from the preferred node to the CDP node.
  • This method provides an efficient and reliable CDP in the distributed database system.
  • This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic and improves the overall efficiency of the distributed database system.
  • This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system.
  • the data trickles asynchronously to the one or more nodes based on settings of replication factor.
  • the replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.
  • a computer program product includes program instructions for performing the method when executed by one or more processors in a database controller system.
  • the database controller is configured to provide Continuous Data Protection, CDP for the distributed database system.
  • CDP Continuous Data Protection
  • the database controller creates datasets on each of the source nodes individually and efficiently transmits them to a varying number of destination nodes.
  • FIG. 1 is a block diagram that illustrates a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure;
  • FIG. 2 illustrates an exemplary implementation of a dataset assembly with different nodes configured with a database controller in accordance with an implementation of the disclosure
  • FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure; and
  • FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented.
  • a computer system e.g., a database controller
  • Implementations of the disclosure provide a database controller configured to be utilized for Continuous Data Protection, CDP in a distributed database system, a method for a database controller configured to be utilized for CDP in the distributed database system, and a computer program product including program instructions for performing the method.
  • a process, a method, a system, a product, or a device that includes a series of steps or units is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
  • CDP Continuous Data Protection
  • Replication factor is a user-defined parameter specifying a number of copies of user data across different nodes.
  • Recovery Point Object is a maximum length of time from last data restoration point.
  • Network region is a geographical proximity between the different nodes communicating in a network.
  • the network region may imply the cost of data transfer when referring to cloud providers, such as Amazon Web Services, AWS.
  • cloud providers such as Amazon Web Services, AWS.
  • the different nodes in the same region can communicate data with lower prices as opposed to the different nodes in different regions.
  • FIG. 1 is a block diagram that illustrates a database controller 106 configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system 100 in accordance with an implementation of the disclosure.
  • the distributed database system 100 includes one or more nodes 102A-N and a continuous data protection, CDP node 104.
  • Each of the one or more nodes 102A-N is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi).
  • the database controller 106 is further configured to download each of the datasets in the CDP set from the one or more nodes 102A-N to the CDP node 104.
  • the database controller 106 is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node.
  • the database controller 106 is further configured to download the dataset from the preferred node to the CDP node 104.
  • the database controller 106 is configured to provide an efficient and reliable CDP in the distributed database system 100.
  • the database controller 106 minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller 106 reduces network traffic and improves the overall efficiency of the distributed database system 100.
  • the database controller 106 ensures global consistency for point-in-time, PIT by considering datasets from all source nodes.
  • the data trickles asynchronously to the one or more nodes 102A-N based on settings of replication factor.
  • the replication factor may determine a number of copies of the datasets, such as backups, stored in the distributed database system 100 for each dataset.
  • the database controller 106 obtains changes from the preferred node by leveraging asynchronous trickling of data in the distributed database system 100, thereby enhancing the efficiency of the distributed database system 100.
  • the database controller 106 is further configured to receive a list from a source node of the one or more nodes 102A-N, the list includes the CDP set indicating datasets that have been changed and which nodes each dataset is stored on.
  • the database controller 106 is further configured to determine if the dataset is stored in more than one node based on the list.
  • the distributed database system 100 allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated.
  • the database controller 106 identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system 100.
  • the database controller 106 is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node 104.
  • the database controller 106 is further configured to determine that a node is a preferred node by determining that the node is underutilized.
  • the database controller 106 is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.
  • the utilization level may be a threshold level, including any of 50%, 80%, or 90%.
  • the database controller 106 is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.
  • the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.
  • the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.
  • the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node 104.
  • the database controller 106 is further configured to determine that the node is the close neighbour to the CDP node 104 by determining that the node has fewer network hops to the CDP node 104 than the other nodes storing the same dataset.
  • the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node 104.
  • the same region is a same network region as the CDP node 104.
  • the same region is a same geographic region as the CDP node 104.
  • the database controller 106 is further configured to determine that the node is the preferred node by determining user-defined criteria.
  • the user-defined criteria may be a cost of operation per node, which is a price of each byte transferred is different on each node.
  • the user- defined criteria may also by the user who wants to reduce traffic on a specific node which is used for complex computation purposes or running other applications on the specific node.
  • FIG. 2 illustrates an exemplary implementation of a dataset assembly with one or more nodes 202A-C configured with a database controller 206 in accordance with an implementation of the disclosure.
  • the exemplary implementation of the data assembly, including the one or more nodes 202A-C resides on different regions, including region 1 208A, region 2208B, and region 3 208C.
  • the one or more nodes 202 A-C include a node A 202 A, a node B 202B, a node C 202C, and a Continuous Data Protection, CDP node 204.
  • the node A 202A may reside in the region 1 208A
  • the node B 202B and the CDP node 204 may reside in the region 2 208B
  • the node C 202C may reside in the region 3 208C.
  • the node A 202A may be a source node of the node B 202B, and the node C 202C.
  • the node A 202 A stores one or more datasets ⁇ (kl, vl), (k2, v2), (k3, v3) ⁇
  • the node B 202B stores the one or more datasets ⁇ (k2, v2) ⁇
  • the node C 202C stores the one or more datasets ⁇ (k3, v3) ⁇ .
  • the node A 202 A may create the one or more datasets from changes to keys ⁇ kl, k2, k3 ⁇ , which refers to a unique identifier that is used to identify and retrieve specific data within the dataset.
  • the changes to the keys ⁇ kl, k2, k3 ⁇ may be updates or modifications to the values of the keys.
  • the changes may be replicated to the CDP node 204.
  • the database controller 206 is configured to determine the node B 202B as the preferred node as the node B 202B and the CDP node 204 reside in the same region, that is region 2 208B.
  • the database controller 206 is configured to download the dataset ⁇ (k2, v2) ⁇ in a CDP set.
  • the CDP set may be a dataset stored in the CDP node 204.
  • the database controller 206 is configured to download the dataset ⁇ (k2, v2) ⁇ from the preferred node to the CDP node 204.
  • the node C 202C may be an underutilized node. Thereby, the database controller 206 is configured to determine the node C 202C as the preferred node.
  • the database controller 206 is configured to download the dataset ⁇ (k3, v3) ⁇ in the CDP set.
  • the database controller 206 is configured to download the dataset ⁇ (k3 , v3) ⁇ from the preferred node to the CDP node 204.
  • FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure.
  • the distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi).
  • each of the datasets in the CDP set from the one or more nodes is downloaded to the CDP node by (i) if the dataset is stored in more than one node is determined for each dataset, and if so (ii) a preferred node is determined.
  • the dataset from the preferred node is downloaded to the CDP node.
  • This method provides an efficient and reliable CDP in the distributed database system.
  • This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic, and improves the overall efficiency of the distributed database system.
  • This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system.
  • the data trickles asynchronously to the one or more nodes based on settings of replication factor.
  • the replication factor may determine a number of copies such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.
  • Algorithm for creating the CDP set involves several steps. Firstly, the source node captures changes in the keys within the CDP set, monitoring the keys individually and periodically through interception of Input/Output commands. Once the changes are detected, the source node transmits the keys along with a list that specifies the CDP set indicating datasets that have been changed and the nodes where each dataset is stored. Upon receiving the key, the target node calculates the preferred node for data retrieval. Optionally, each key is associated with a list of replicated nodes. When the target node initiates replication, the target node traverses the list of replicated nodes to determine the preferred node to fetch the data corresponding to the key.
  • the target node obtains information on how the nodes operate based on CDP settings or by transmitting a ping to check response times or relying on preconfigured data.
  • the target node can interact with an overall controlling entity. Once the preferred nodes are identified, the target node fetches data from the preferred nodes. If any data remains after fetching from the preferred nodes, the target node retrieves leftover data from the source node, which ensures that the CDP set is complete.
  • FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented.
  • the computer system 400 includes at least one processor 404 that is connected to a bus 402, wherein the computer system 400 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s).
  • the computer system 400 also includes a memory 406.
  • Control logic (software) and data are stored in the memory 406 which may take a form of random-access memory (RAM).
  • a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
  • the computer system 400 may also include a secondary storage 410.
  • the secondary storage 410 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory.
  • the removable storage drives at least one of reads from and writes to a removable storage unit in a well-known manner.
  • Computer programs, or computer control logic algorithms may be stored in at least one of the memory 406 and the secondary storage 410. Such computer programs, when executed, enable the computer system 400 to perform various functions as described in the foregoing.
  • the memory 406, the secondary storage 410, and any other storage are possible examples of computer-readable media.
  • the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 404, a graphics processor coupled to a communication interface 412, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 404 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
  • the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system.
  • the computer system 400 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
  • the computer system 400 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 400 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 408.
  • a network for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system (100). The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller is further configured to download each of the datasets in the CDP set from the one or more nodes to the CDP node. The database controller is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller is further configured to download the dataset from the preferred node to the CDP node.

Description

DATABASE CONTROLLER FOR CONTINUOUS DATA PROTECTION IN A DISTRIBUTED DATABASE SYSTEM AND METHOD THEREOF
TECHNICAL FIELD
The disclosure generally relates to continuous data protection, CDP, and more particularly, the disclosure relates to a database controller configured to be utilized for the CDP in a distributed database system. The disclosure also relates to a method for a database controller configured to be utilized for the CDP in the distributed database system.
BACKGROUND
Databases commonly rely on backup services for various purposes, including disaster recovery and providing convenient solutions for development and testing environments. The backup process for the databases can be done in a straightforward manner by tracking and replicating written data. The demand for scalability, redundancy, and advanced features has given rise to Distributed Database systems such as Cassandra, MongoDB, and the like. The distributed database systems provide various interfaces, including KeyValue store, object store, Sequential Query Language, SQL, and the like, to interact with the written data. Additionally, distributed database systems often offer dynamic scaling capabilities, allowing users to increase storage capacity or adjust the number of participating nodes as required.
The distributed database systems typically distribute entries, known as shards, among participating nodes to optimize resource utilization and provide consistent, linearly scaling performance to users. For instance, Cassandra, a No Sequential Query Language, NoSQL distributed database, achieves horizontal scalability through keyspace partitioning and data distribution across the participating nodes. Cassandra routes user data to neighbouring nodes based on the user's replication factor settings. The replication factor is a user-defined number of copies of user data, with each copy residing on a different node. For example, by setting a replication factor of 3, Cassandra ensures that the data is stored on three different nodes, protecting against the failure of up to two nodes simultaneously without any data loss. Additionally, Cassandra offers support for a quorum, which defines the number of participating nodes based on the replication factor settings that need to acknowledge an input/output I/O operation before confirming it to the user. Cassandra, along with the replication factor, offers users flexibility in terms of fault tolerance and reliability. For example, if a user desires a highly durable database, they can set the replication factor and quorum to the maximum, ensuring that all data is copied to every node and acknowledging user I/O only after all nodes process the I/O operations, thereby impacting performance of the database.
The distributed database systems bring forth a set of challenges in terms of backup and restoration, especially when global consistency is a desired requirement and dynamic scaling is involved. Achieving Continuous Data Protection, CDP for the distributed database systems poses these challenges. The CDP is a method for creating a copy of data with the capability to restore the data to any previous point in time. The distributed nature of these databases presents challenges in deriving a complete view from individual parts or shards. Naive approaches for achieving the CDP often require compromising user experience by freezing the I/O operations on nodes involved in the CDP to create a globally consistent dataset. Freezing of the I/O operations may have a negative impact on the user experience.
In the distributed database system, performance and storage capacity can fluctuate due to the dynamic addition and removal of nodes. The CDP should adapt dynamically to the fluctuations to prevent resource waste or fallback to a Snapshot Data Protection, SDP scenario. The SDP is a method of protecting volumes of data using snapshots, typically in an incremental manner.
Additionally, factors like cloud regions, network proximity, resource consumption, and the cost of data transfer should also be considered when implementing the CDP. Replicating data to a local node and storage within the same region or network proximity is more efficient than replicating the data over a Wide Area Network, WAN. However, local replication may not always be feasible, and scaling the CDP nodes linearly with source nodes presents challenges. The source nodes refer to the nodes from which the data is backed up or replicated to other nodes.
An existing method for generating incremental snapshots in the distributed database systems involves copying a directory containing a database journal on a file system or using specific database snapshot tools like those provided by Cassandra and MongoDB. However, due to the distributed nature of the backup introduces challenges; for example, if data-dependent transactions, such as T1 and T2, occur, with the data-dependent transaction T2 happening after the data-dependent transaction T1 and each being backed up from different source nodes, there is a possibility that the backup may contain T2 but not Tl. The source nodes are responsible for providing the data that needs to be backed up. Inconsistency in the distributed nature of the backup poses a risk when restoring from such backups, potentially leading to failures in user applications. Another existing method generates a global-consistent snapshot by performing an I/O freeze on nodes and draining outstanding I/O operations. While this method ensures global consistency, the main disadvantage of this method is a lack of a global-consistent backup, which may not accurately reflect the point in time at which the data was created. The performance of ongoing operations is disrupted due to the freeze and drain operations necessary to achieve global consistency.
Therefore, there arises a need to address the aforementioned technical problem/drawbacks of utilizing the CDP in the distributed database system.
SUMMARY
It is an object of the disclosure to provide a database controller configured to be utilized for Continuous Data Protection, CDP in a distributed database system and a method for a database controller configured to be utilized for the CDP in a distributed database system while avoiding one or more disadvantages of prior art approaches.
This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.
According to a first aspect, there is provided a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller is further configured to download each of the datasets in the CDP set from the one or more nodes to the CDP node. The database controller is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller is further configured to download the dataset from the preferred node to the CDP node.
The database controller is configured to provide an efficient and reliable CDP in the distributed database system. The database controller minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller reduces network traffic, and improves the overall efficiency of the distributed database system. The database controller ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset. The database controller obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing efficiency of the distributed database system.
Optionally, the database controller is further configured to receive a list from a source node of the nodes. The list includes the CDP set indicating datasets that have been changed and which nodes each data set is stored on. The database controller is further configured to determine if the dataset is stored in more than one node based on the list. The distributed database system allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated. The database controller identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system.
Optionally, the database controller is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node.
Optionally, the database controller is further configured to determine that a node is a preferred node by determining that the node is underutilized.
Optionally, the database controller is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.
Optionally, the database controller is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset. Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.
Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.
Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node.
Optionally, the database controller is further configured to determine that the node is a close neighbour to the CDP node by determining that the node has fewer network hops to the CDP node than the other nodes storing the same dataset.
Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node.
Optionally, the same region is a same network region as the CDP node.
Optionally, the same region is a same geographic region as the CDP node.
According to a second aspect, there is provided a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The method includes downloading each of the datasets in the CDP set from the one or more nodes to the CDP node. The method includes determining if the dataset is stored in more than one node for each dataset. The method includes determining a preferred node if so. The method includes downloading the dataset from the preferred node to the CDP node.
This method provides an efficient and reliable CDP in the distributed database system. This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic and improves the overall efficiency of the distributed database system. This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, the data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.
According to a third aspect, a computer program product includes program instructions for performing the method when executed by one or more processors in a database controller system.
Therefore, in contradistinction to the existing solutions, the database controller is configured to provide Continuous Data Protection, CDP for the distributed database system. The database controller creates datasets on each of the source nodes individually and efficiently transmits them to a varying number of destination nodes.
These and other aspects of the disclosure will be apparent from the implementation(s) described below.
BRIEF DESCRIPTION OF DRAWINGS
Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram that illustrates a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure;
FIG. 2 illustrates an exemplary implementation of a dataset assembly with different nodes configured with a database controller in accordance with an implementation of the disclosure;
FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure; and
FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented.
DETAILED DESCRIPTION OF THE DRAWINGS
Implementations of the disclosure provide a database controller configured to be utilized for Continuous Data Protection, CDP in a distributed database system, a method for a database controller configured to be utilized for CDP in the distributed database system, and a computer program product including program instructions for performing the method.
To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.
Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
Definitions:
Distributed Database is a database, such as Cassandra and MongoDB, which uses multiple nodes for facilitating data sharding, load balancing, high availability, and other essential features. Continuous Data Protection, CDP is a method for creating a copy of data with a capability of restoring the data to any previous Point in Time, PIT.
Replication factor is a user-defined parameter specifying a number of copies of user data across different nodes.
Recovery Point Object, RPO is a maximum length of time from last data restoration point.
Network region is a geographical proximity between the different nodes communicating in a network. The network region may imply the cost of data transfer when referring to cloud providers, such as Amazon Web Services, AWS. The different nodes in the same region can communicate data with lower prices as opposed to the different nodes in different regions.
FIG. 1 is a block diagram that illustrates a database controller 106 configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system 100 in accordance with an implementation of the disclosure. The distributed database system 100 includes one or more nodes 102A-N and a continuous data protection, CDP node 104. Each of the one or more nodes 102A-N is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller 106 is further configured to download each of the datasets in the CDP set from the one or more nodes 102A-N to the CDP node 104. The database controller 106 is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller 106 is further configured to download the dataset from the preferred node to the CDP node 104.
The database controller 106 is configured to provide an efficient and reliable CDP in the distributed database system 100. The database controller 106 minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller 106 reduces network traffic and improves the overall efficiency of the distributed database system 100. The database controller 106 ensures global consistency for point-in-time, PIT by considering datasets from all source nodes. In the distributed database system 100, the data trickles asynchronously to the one or more nodes 102A-N based on settings of replication factor. The replication factor may determine a number of copies of the datasets, such as backups, stored in the distributed database system 100 for each dataset. The database controller 106 obtains changes from the preferred node by leveraging asynchronous trickling of data in the distributed database system 100, thereby enhancing the efficiency of the distributed database system 100.
Optionally, the database controller 106 is further configured to receive a list from a source node of the one or more nodes 102A-N, the list includes the CDP set indicating datasets that have been changed and which nodes each dataset is stored on. The database controller 106 is further configured to determine if the dataset is stored in more than one node based on the list. The distributed database system 100 allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated. The database controller 106 identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system 100.
Optionally, the database controller 106 is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node 104.
Optionally, the database controller 106 is further configured to determine that a node is a preferred node by determining that the node is underutilized.
Optionally, the database controller 106 is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level. The utilization level may be a threshold level, including any of 50%, 80%, or 90%.
Optionally, the database controller 106 is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.
Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset. Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.
Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node 104.
Optionally, the database controller 106 is further configured to determine that the node is the close neighbour to the CDP node 104 by determining that the node has fewer network hops to the CDP node 104 than the other nodes storing the same dataset.
Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node 104.
Optionally, the same region is a same network region as the CDP node 104.
Optionally, the same region is a same geographic region as the CDP node 104.
Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining user-defined criteria. The user-defined criteria may be a cost of operation per node, which is a price of each byte transferred is different on each node. The user- defined criteria may also by the user who wants to reduce traffic on a specific node which is used for complex computation purposes or running other applications on the specific node.
FIG. 2 illustrates an exemplary implementation of a dataset assembly with one or more nodes 202A-C configured with a database controller 206 in accordance with an implementation of the disclosure. The exemplary implementation of the data assembly, including the one or more nodes 202A-C resides on different regions, including region 1 208A, region 2208B, and region 3 208C. The one or more nodes 202 A-C include a node A 202 A, a node B 202B, a node C 202C, and a Continuous Data Protection, CDP node 204. The node A 202A may reside in the region 1 208A, the node B 202B and the CDP node 204 may reside in the region 2 208B, and the node C 202C may reside in the region 3 208C.
The node A 202A may be a source node of the node B 202B, and the node C 202C. Optionally, the node A 202 A stores one or more datasets {(kl, vl), (k2, v2), (k3, v3)}, the node B 202B stores the one or more datasets {(k2, v2)}, and the node C 202C stores the one or more datasets {(k3, v3){. The node A 202 A may create the one or more datasets from changes to keys {kl, k2, k3 }, which refers to a unique identifier that is used to identify and retrieve specific data within the dataset. The changes to the keys {kl, k2, k3 } may be updates or modifications to the values of the keys. The changes may be replicated to the CDP node 204. The database controller 206 is configured to determine the node B 202B as the preferred node as the node B 202B and the CDP node 204 reside in the same region, that is region 2 208B.
The database controller 206 is configured to download the dataset {(k2, v2)} in a CDP set. The CDP set may be a dataset stored in the CDP node 204. The database controller 206 is configured to download the dataset {(k2, v2)} from the preferred node to the CDP node 204. The node C 202C may be an underutilized node. Thereby, the database controller 206 is configured to determine the node C 202C as the preferred node.
The database controller 206 is configured to download the dataset { (k3, v3)} in the CDP set. The database controller 206 is configured to download the dataset { (k3 , v3)} from the preferred node to the CDP node 204.
FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). At a step 302, each of the datasets in the CDP set from the one or more nodes is downloaded to the CDP node by (i) if the dataset is stored in more than one node is determined for each dataset, and if so (ii) a preferred node is determined. At a step 304, the dataset from the preferred node is downloaded to the CDP node.
This method provides an efficient and reliable CDP in the distributed database system. This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic, and improves the overall efficiency of the distributed database system. This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, the data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.
Algorithm for creating the CDP set involves several steps. Firstly, the source node captures changes in the keys within the CDP set, monitoring the keys individually and periodically through interception of Input/Output commands. Once the changes are detected, the source node transmits the keys along with a list that specifies the CDP set indicating datasets that have been changed and the nodes where each dataset is stored. Upon receiving the key, the target node calculates the preferred node for data retrieval. Optionally, each key is associated with a list of replicated nodes. When the target node initiates replication, the target node traverses the list of replicated nodes to determine the preferred node to fetch the data corresponding to the key. The target node obtains information on how the nodes operate based on CDP settings or by transmitting a ping to check response times or relying on preconfigured data. Optionally, the target node can interact with an overall controlling entity. Once the preferred nodes are identified, the target node fetches data from the preferred nodes. If any data remains after fetching from the preferred nodes, the target node retrieves leftover data from the source node, which ensures that the CDP set is complete.
FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 400 includes at least one processor 404 that is connected to a bus 402, wherein the computer system 400 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s). The computer system 400 also includes a memory 406.
Control logic (software) and data are stored in the memory 406 which may take a form of random-access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user. The computer system 400 may also include a secondary storage 410. The secondary storage 410 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drives at least one of reads from and writes to a removable storage unit in a well-known manner.
Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 406 and the secondary storage 410. Such computer programs, when executed, enable the computer system 400 to perform various functions as described in the foregoing. The memory 406, the secondary storage 410, and any other storage are possible examples of computer-readable media.
In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 404, a graphics processor coupled to a communication interface 412, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 404 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
Furthermore, the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 400 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
Furthermore, the computer system 400 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 400 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 408.
It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.
Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A database controller (106, 206) configured to be utilized for Continuous Data Protection, CDP, for a CDP set comprising a plurality of datasets (ki, vi) in a distributed database system (100), wherein the distributed database system (100) comprises a plurality of nodes (102A-N, 202 A-C) and a CDP node (104, 204), wherein each node is configured to store one or more data sets (ki, vi) out of the plurality of data sets (ki, vi), wherein the database controller (106, 206) is further configured to download each of the datasets in the CDP set from the plurality of nodes (102A-N, 202 A-C) to the CDP node (104, 204), wherein the database controller (106) is characterized in that the database controller (106, 206) is further configured to download each of the datasets in the CDP set by for each dataset determining if the dataset is stored in more than one node, and if so determine a preferred node and download the dataset from the preferred node to the CDP node (104, 204).
2. The database controller (106, 206) according to claim 1, wherein the database controller (106, 206) is further configured to receive a list from a source node of the plurality of nodes (102A-N, 202 A-C), wherein the list comprises the CDP set indicating datasets that have been changed and which nodes each data set is stored on, and wherein the database controller (106, 206) is further configured to determine if the dataset is stored in more than one node based on the list.
3. The database controller (106, 206) according to claim 2, wherein the database controller (106, 206) is further configured to determine that the dataset is only stored in the source node, and if so receive download the dataset from the source node to the CDP node (104, 204).
4. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that a node is a preferred node by determining that the node is underutilized.
5. The database controller (106, 206) according to claim 4, wherein the database controller (106, 206) is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.
6. The database controller (106, 206) according to claim 4 or 5, wherein the database controller (106, 206) is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.
7. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.
8. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.
9. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node (104, 204).
10. The database controller (106, 206) according to claim 9, wherein the database controller (106, 206) is further configured to determine that the node is the close neighbour to the CDP node (104, 204) by determining that the node has fewer network hops to the CDP node (104, 204) than the other nodes storing the same dataset.
11. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node (104, 204).
12. The database controller (106, 206) according to claim 11, wherein the same region is a same network region as the CDP node (104, 204).
13. The database controller (106, 206) according to claim 11 or 12, wherein the same region is a same geographic region as the CDP node (104, 204).
14. A method for a database controller (106, 206) configured to be utilized for Continuous Data Protection, CDP, for a CDP set comprising plurality of datasets (ki, vi) in a distributed database system (100), wherein the distributed database system (100) comprises a plurality of nodes (102A-N, 202 A-C) and a CDP node (104, 204), wherein each node is configured to store one or more data sets (ki, vi) out of the plurality of data sets (ki, vi), the method comprises downloading each of the datasets in the CDP set from the plurality of nodes (102A-N, 202 A-C) to the CDP node (104, 204), wherein the data base controller (106, 206) is characterized in that the method further comprises downloading each of the datasets in the CDP set by for each dataset determining if the dataset is stored in more than one node, and if so determining a preferred node, and downloading the dataset from the preferred node to the CDP node (104, 204).
15. A computer program product comprising program instructions for performing the method according to claim 14, when executed by one or more processors in a database controller (106, 206) system.
PCT/EP2023/068077 2023-06-30 2023-06-30 Database controller for continuous data protection in a distributed database system and method thereof Pending WO2025002580A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/068077 WO2025002580A1 (en) 2023-06-30 2023-06-30 Database controller for continuous data protection in a distributed database system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/068077 WO2025002580A1 (en) 2023-06-30 2023-06-30 Database controller for continuous data protection in a distributed database system and method thereof

Publications (1)

Publication Number Publication Date
WO2025002580A1 true WO2025002580A1 (en) 2025-01-02

Family

ID=87155580

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/068077 Pending WO2025002580A1 (en) 2023-06-30 2023-06-30 Database controller for continuous data protection in a distributed database system and method thereof

Country Status (1)

Country Link
WO (1) WO2025002580A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126200A1 (en) * 1996-08-02 2003-07-03 Wolff James J. Dynamic load balancing of a network of client and server computer
US20060190761A1 (en) * 2005-02-18 2006-08-24 Oracle International Corporation Method and mechanism of improving system utilization and throughput
US20140025704A1 (en) * 2012-07-19 2014-01-23 International Business Machines Corporation Query method for a distributed database system and query apparatus
US8849758B1 (en) * 2010-12-28 2014-09-30 Amazon Technologies, Inc. Dynamic data set replica management
EP3163446A1 (en) * 2014-06-26 2017-05-03 Hangzhou Hikvision System Technology Co., Ltd. Data storage method and data storage management server
US20170124148A1 (en) * 2015-10-29 2017-05-04 International Business Machines Corporation Index table based routing for query resource optimization
US20170286518A1 (en) * 2010-12-23 2017-10-05 Eliot Horowitz Systems and methods for managing distributed database deployments
US20220050754A1 (en) * 2020-08-13 2022-02-17 EMC IP Holding Company LLC Method to optimize restore based on data protection workload prediction
US11269731B1 (en) * 2017-11-22 2022-03-08 Amazon Technologies, Inc. Continuous data protection

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126200A1 (en) * 1996-08-02 2003-07-03 Wolff James J. Dynamic load balancing of a network of client and server computer
US20060190761A1 (en) * 2005-02-18 2006-08-24 Oracle International Corporation Method and mechanism of improving system utilization and throughput
US20170286518A1 (en) * 2010-12-23 2017-10-05 Eliot Horowitz Systems and methods for managing distributed database deployments
US8849758B1 (en) * 2010-12-28 2014-09-30 Amazon Technologies, Inc. Dynamic data set replica management
US20140025704A1 (en) * 2012-07-19 2014-01-23 International Business Machines Corporation Query method for a distributed database system and query apparatus
EP3163446A1 (en) * 2014-06-26 2017-05-03 Hangzhou Hikvision System Technology Co., Ltd. Data storage method and data storage management server
US20170124148A1 (en) * 2015-10-29 2017-05-04 International Business Machines Corporation Index table based routing for query resource optimization
US11269731B1 (en) * 2017-11-22 2022-03-08 Amazon Technologies, Inc. Continuous data protection
US20220050754A1 (en) * 2020-08-13 2022-02-17 EMC IP Holding Company LLC Method to optimize restore based on data protection workload prediction

Similar Documents

Publication Publication Date Title
US11429305B2 (en) Performing backup operations using replicas
US10942812B2 (en) System and method for building a point-in-time snapshot of an eventually-consistent data store
US9727273B1 (en) Scalable clusterwide de-duplication
US8055937B2 (en) High availability and disaster recovery using virtualization
CN106302702B (en) Data fragment storage method, device and system
US10102086B2 (en) Replicated database distribution for workload balancing after cluster reconfiguration
US20100131554A1 (en) System and method for publishing messages asynchronously in a distributed database
US20190384678A1 (en) System and method for managing backup and restore of objects over cloud platforms
US10423476B2 (en) Aggressive searching for missing data in a DSN memory that has had migrations
KR20170133866A (en) Apparatus and method for data migration
US10372554B1 (en) Verification and restore of replicated data using a cloud storing chunks of data and a plurality of hashes
Liu et al. Popularity-aware multi-failure resilient and cost-effective replication for high data durability in cloud storage
US10484179B1 (en) Data consistency in an encrypted replication environment
US20240012721A1 (en) Device and method for multi-source recovery of items
CN111316256A (en) Take snapshots of blockchain data
US10445191B1 (en) Integration of restore service with data protection system
WO2021257263A1 (en) Techniques for generating a consistent view of an eventually consistent database
US10481833B2 (en) Transferring data encoding functions in a distributed storage network
US9684668B1 (en) Systems and methods for performing lookups on distributed deduplicated data systems
WO2024036829A1 (en) Data fusion method and apparatus, and device and storage medium
WO2025002580A1 (en) Database controller for continuous data protection in a distributed database system and method thereof
KR102315070B1 (en) Key-value data orchestration system on edge computing environment
CN110692043A (en) System and method for load balancing backup data
US10122789B1 (en) Log information transmission integrity
US20220129446A1 (en) Distributed Ledger Management Method, Distributed Ledger System, And Node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23738461

Country of ref document: EP

Kind code of ref document: A1