WO2025002580A1

WO2025002580A1 - Database controller for continuous data protection in a distributed database system and method thereof

Info

Publication number: WO2025002580A1
Application number: PCT/EP2023/068077
Authority: WO
Inventors: Bar David; Assaf Natanzon
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2025-01-02
Anticipated expiration: 2025-12-30

Abstract

A database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system (100). The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller is further configured to download each of the datasets in the CDP set from the one or more nodes to the CDP node. The database controller is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller is further configured to download the dataset from the preferred node to the CDP node.

Description

DATABASE CONTROLLER FOR CONTINUOUS DATA PROTECTION IN A DISTRIBUTED DATABASE SYSTEM AND METHOD THEREOF

TECHNICAL FIELD

The disclosure generally relates to continuous data protection, CDP, and more particularly, the disclosure relates to a database controller configured to be utilized for the CDP in a distributed database system. The disclosure also relates to a method for a database controller configured to be utilized for the CDP in the distributed database system.

BACKGROUND

Databases commonly rely on backup services for various purposes, including disaster recovery and providing convenient solutions for development and testing environments. The backup process for the databases can be done in a straightforward manner by tracking and replicating written data. The demand for scalability, redundancy, and advanced features has given rise to Distributed Database systems such as Cassandra, MongoDB, and the like. The distributed database systems provide various interfaces, including KeyValue store, object store, Sequential Query Language, SQL, and the like, to interact with the written data. Additionally, distributed database systems often offer dynamic scaling capabilities, allowing users to increase storage capacity or adjust the number of participating nodes as required.

The distributed database systems typically distribute entries, known as shards, among participating nodes to optimize resource utilization and provide consistent, linearly scaling performance to users. For instance, Cassandra, a No Sequential Query Language, NoSQL distributed database, achieves horizontal scalability through keyspace partitioning and data distribution across the participating nodes. Cassandra routes user data to neighbouring nodes based on the user's replication factor settings. The replication factor is a user-defined number of copies of user data, with each copy residing on a different node. For example, by setting a replication factor of 3, Cassandra ensures that the data is stored on three different nodes, protecting against the failure of up to two nodes simultaneously without any data loss. Additionally, Cassandra offers support for a quorum, which defines the number of participating nodes based on the replication factor settings that need to acknowledge an input/output I/O operation before confirming it to the user. Cassandra, along with the replication factor, offers users flexibility in terms of fault tolerance and reliability. For example, if a user desires a highly durable database, they can set the replication factor and quorum to the maximum, ensuring that all data is copied to every node and acknowledging user I/O only after all nodes process the I/O operations, thereby impacting performance of the database.

The distributed database systems bring forth a set of challenges in terms of backup and restoration, especially when global consistency is a desired requirement and dynamic scaling is involved. Achieving Continuous Data Protection, CDP for the distributed database systems poses these challenges. The CDP is a method for creating a copy of data with the capability to restore the data to any previous point in time. The distributed nature of these databases presents challenges in deriving a complete view from individual parts or shards. Naive approaches for achieving the CDP often require compromising user experience by freezing the I/O operations on nodes involved in the CDP to create a globally consistent dataset. Freezing of the I/O operations may have a negative impact on the user experience.

In the distributed database system, performance and storage capacity can fluctuate due to the dynamic addition and removal of nodes. The CDP should adapt dynamically to the fluctuations to prevent resource waste or fallback to a Snapshot Data Protection, SDP scenario. The SDP is a method of protecting volumes of data using snapshots, typically in an incremental manner.

Additionally, factors like cloud regions, network proximity, resource consumption, and the cost of data transfer should also be considered when implementing the CDP. Replicating data to a local node and storage within the same region or network proximity is more efficient than replicating the data over a Wide Area Network, WAN. However, local replication may not always be feasible, and scaling the CDP nodes linearly with source nodes presents challenges. The source nodes refer to the nodes from which the data is backed up or replicated to other nodes.

An existing method for generating incremental snapshots in the distributed database systems involves copying a directory containing a database journal on a file system or using specific database snapshot tools like those provided by Cassandra and MongoDB. However, due to the distributed nature of the backup introduces challenges; for example, if data-dependent transactions, such as T1 and T2, occur, with the data-dependent transaction T2 happening after the data-dependent transaction T1 and each being backed up from different source nodes, there is a possibility that the backup may contain T2 but not Tl. The source nodes are responsible for providing the data that needs to be backed up. Inconsistency in the distributed nature of the backup poses a risk when restoring from such backups, potentially leading to failures in user applications. Another existing method generates a global-consistent snapshot by performing an I/O freeze on nodes and draining outstanding I/O operations. While this method ensures global consistency, the main disadvantage of this method is a lack of a global-consistent backup, which may not accurately reflect the point in time at which the data was created. The performance of ongoing operations is disrupted due to the freeze and drain operations necessary to achieve global consistency.

Therefore, there arises a need to address the aforementioned technical problem/drawbacks of utilizing the CDP in the distributed database system.

SUMMARY

It is an object of the disclosure to provide a database controller configured to be utilized for Continuous Data Protection, CDP in a distributed database system and a method for a database controller configured to be utilized for the CDP in a distributed database system while avoiding one or more disadvantages of prior art approaches.

This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.

According to a first aspect, there is provided a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller is further configured to download each of the datasets in the CDP set from the one or more nodes to the CDP node. The database controller is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller is further configured to download the dataset from the preferred node to the CDP node.

The database controller is configured to provide an efficient and reliable CDP in the distributed database system. The database controller minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller reduces network traffic, and improves the overall efficiency of the distributed database system. The database controller ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset. The database controller obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing efficiency of the distributed database system.

Optionally, the database controller is further configured to receive a list from a source node of the nodes. The list includes the CDP set indicating datasets that have been changed and which nodes each data set is stored on. The database controller is further configured to determine if the dataset is stored in more than one node based on the list. The distributed database system allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated. The database controller identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system.

Optionally, the database controller is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node.

Optionally, the database controller is further configured to determine that a node is a preferred node by determining that the node is underutilized.

Optionally, the database controller is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.

Optionally, the database controller is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset. Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.

Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.

Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node.

Optionally, the database controller is further configured to determine that the node is a close neighbour to the CDP node by determining that the node has fewer network hops to the CDP node than the other nodes storing the same dataset.

Optionally, the database controller is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node.

Optionally, the same region is a same network region as the CDP node.

Optionally, the same region is a same geographic region as the CDP node.

According to a second aspect, there is provided a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The method includes downloading each of the datasets in the CDP set from the one or more nodes to the CDP node. The method includes determining if the dataset is stored in more than one node for each dataset. The method includes determining a preferred node if so. The method includes downloading the dataset from the preferred node to the CDP node.

This method provides an efficient and reliable CDP in the distributed database system. This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic and improves the overall efficiency of the distributed database system. This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, the data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies, such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.

According to a third aspect, a computer program product includes program instructions for performing the method when executed by one or more processors in a database controller system.

Therefore, in contradistinction to the existing solutions, the database controller is configured to provide Continuous Data Protection, CDP for the distributed database system. The database controller creates datasets on each of the source nodes individually and efficiently transmits them to a varying number of destination nodes.

These and other aspects of the disclosure will be apparent from the implementation(s) described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram that illustrates a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure;

FIG. 2 illustrates an exemplary implementation of a dataset assembly with different nodes configured with a database controller in accordance with an implementation of the disclosure;

FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure; and

FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the disclosure provide a database controller configured to be utilized for Continuous Data Protection, CDP in a distributed database system, a method for a database controller configured to be utilized for CDP in the distributed database system, and a computer program product including program instructions for performing the method.

To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

Definitions:

Distributed Database is a database, such as Cassandra and MongoDB, which uses multiple nodes for facilitating data sharding, load balancing, high availability, and other essential features. Continuous Data Protection, CDP is a method for creating a copy of data with a capability of restoring the data to any previous Point in Time, PIT.

Replication factor is a user-defined parameter specifying a number of copies of user data across different nodes.

Recovery Point Object, RPO is a maximum length of time from last data restoration point.

Network region is a geographical proximity between the different nodes communicating in a network. The network region may imply the cost of data transfer when referring to cloud providers, such as Amazon Web Services, AWS. The different nodes in the same region can communicate data with lower prices as opposed to the different nodes in different regions.

FIG. 1 is a block diagram that illustrates a database controller 106 configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system 100 in accordance with an implementation of the disclosure. The distributed database system 100 includes one or more nodes 102A-N and a continuous data protection, CDP node 104. Each of the one or more nodes 102A-N is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). The database controller 106 is further configured to download each of the datasets in the CDP set from the one or more nodes 102A-N to the CDP node 104. The database controller 106 is further configured to download each of the datasets in the CDP set by determining, for each dataset, if the dataset is stored in more than one node and if so determine a preferred node. The database controller 106 is further configured to download the dataset from the preferred node to the CDP node 104.

The database controller 106 is configured to provide an efficient and reliable CDP in the distributed database system 100. The database controller 106 minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, the database controller 106 reduces network traffic and improves the overall efficiency of the distributed database system 100. The database controller 106 ensures global consistency for point-in-time, PIT by considering datasets from all source nodes. In the distributed database system 100, the data trickles asynchronously to the one or more nodes 102A-N based on settings of replication factor. The replication factor may determine a number of copies of the datasets, such as backups, stored in the distributed database system 100 for each dataset. The database controller 106 obtains changes from the preferred node by leveraging asynchronous trickling of data in the distributed database system 100, thereby enhancing the efficiency of the distributed database system 100.

Optionally, the database controller 106 is further configured to receive a list from a source node of the one or more nodes 102A-N, the list includes the CDP set indicating datasets that have been changed and which nodes each dataset is stored on. The database controller 106 is further configured to determine if the dataset is stored in more than one node based on the list. The distributed database system 100 allows the addition or removal of source nodes without disrupting the global consistency condition. When the source node is removed, the global consistency among the remaining nodes is re-evaluated. The database controller 106 identifies and selects underutilized nodes as the preferred nodes, ensuring optimal data retrieval by choosing most suitable nodes, thereby achieving enhanced utilization and improved performance of the distributed database system 100.

Optionally, the database controller 106 is further configured to determine that the dataset is only stored in the source node, and if so receive, download the dataset from the source node to the CDP node 104.

Optionally, the database controller 106 is further configured to determine that a node is a preferred node by determining that the node is underutilized.

Optionally, the database controller 106 is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level. The utilization level may be a threshold level, including any of 50%, 80%, or 90%.

Optionally, the database controller 106 is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.

Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset. Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.

Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node 104.

Optionally, the database controller 106 is further configured to determine that the node is the close neighbour to the CDP node 104 by determining that the node has fewer network hops to the CDP node 104 than the other nodes storing the same dataset.

Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node 104.

Optionally, the same region is a same network region as the CDP node 104.

Optionally, the same region is a same geographic region as the CDP node 104.

Optionally, the database controller 106 is further configured to determine that the node is the preferred node by determining user-defined criteria. The user-defined criteria may be a cost of operation per node, which is a price of each byte transferred is different on each node. The user- defined criteria may also by the user who wants to reduce traffic on a specific node which is used for complex computation purposes or running other applications on the specific node.

FIG. 2 illustrates an exemplary implementation of a dataset assembly with one or more nodes 202A-C configured with a database controller 206 in accordance with an implementation of the disclosure. The exemplary implementation of the data assembly, including the one or more nodes 202A-C resides on different regions, including region 1 208A, region 2208B, and region 3 208C. The one or more nodes 202 A-C include a node A 202 A, a node B 202B, a node C 202C, and a Continuous Data Protection, CDP node 204. The node A 202A may reside in the region 1 208A, the node B 202B and the CDP node 204 may reside in the region 2 208B, and the node C 202C may reside in the region 3 208C.

The node A 202A may be a source node of the node B 202B, and the node C 202C. Optionally, the node A 202 A stores one or more datasets {(kl, vl), (k2, v2), (k3, v3)}, the node B 202B stores the one or more datasets {(k2, v2)}, and the node C 202C stores the one or more datasets {(k3, v3){. The node A 202 A may create the one or more datasets from changes to keys {kl, k2, k3 }, which refers to a unique identifier that is used to identify and retrieve specific data within the dataset. The changes to the keys {kl, k2, k3 } may be updates or modifications to the values of the keys. The changes may be replicated to the CDP node 204. The database controller 206 is configured to determine the node B 202B as the preferred node as the node B 202B and the CDP node 204 reside in the same region, that is region 2 208B.

The database controller 206 is configured to download the dataset {(k2, v2)} in a CDP set. The CDP set may be a dataset stored in the CDP node 204. The database controller 206 is configured to download the dataset {(k2, v2)} from the preferred node to the CDP node 204. The node C 202C may be an underutilized node. Thereby, the database controller 206 is configured to determine the node C 202C as the preferred node.

The database controller 206 is configured to download the dataset { (k3, v3)} in the CDP set. The database controller 206 is configured to download the dataset { (k3 , v3)} from the preferred node to the CDP node 204.

FIG. 3 is a flow diagram that illustrates a method for a database controller configured to be utilized for Continuous Data Protection, CDP, for a CDP set including one or more datasets (ki, vi) in a distributed database system in accordance with an implementation of the disclosure. The distributed database system includes one or more nodes and a CDP node. Each node is configured to store one or more data sets (ki, vi) out of the one or more data sets (ki, vi). At a step 302, each of the datasets in the CDP set from the one or more nodes is downloaded to the CDP node by (i) if the dataset is stored in more than one node is determined for each dataset, and if so (ii) a preferred node is determined. At a step 304, the dataset from the preferred node is downloaded to the CDP node.

This method provides an efficient and reliable CDP in the distributed database system. This method minimizes data transmission requirements by fetching data from preferred nodes. By retrieving the necessary and relevant data only, this method reduces network traffic, and improves the overall efficiency of the distributed database system. This method ensures global consistency for point-in-time, PIT by considering datasets from all source nodes in the distributed database system. In the distributed database system, the data trickles asynchronously to the one or more nodes based on settings of replication factor. The replication factor may determine a number of copies such as backups, stored in the distributed database system for each dataset. This method obtains changes from the preferred node by leveraging the asynchronous trickling of data, thereby enhancing the efficiency of the distributed database system.

Algorithm for creating the CDP set involves several steps. Firstly, the source node captures changes in the keys within the CDP set, monitoring the keys individually and periodically through interception of Input/Output commands. Once the changes are detected, the source node transmits the keys along with a list that specifies the CDP set indicating datasets that have been changed and the nodes where each dataset is stored. Upon receiving the key, the target node calculates the preferred node for data retrieval. Optionally, each key is associated with a list of replicated nodes. When the target node initiates replication, the target node traverses the list of replicated nodes to determine the preferred node to fetch the data corresponding to the key. The target node obtains information on how the nodes operate based on CDP settings or by transmitting a ping to check response times or relying on preconfigured data. Optionally, the target node can interact with an overall controlling entity. Once the preferred nodes are identified, the target node fetches data from the preferred nodes. If any data remains after fetching from the preferred nodes, the target node retrieves leftover data from the source node, which ensures that the CDP set is complete.

FIG. 4 is an illustration of a computer system (e.g., a database controller) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 400 includes at least one processor 404 that is connected to a bus 402, wherein the computer system 400 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s). The computer system 400 also includes a memory 406.

Control logic (software) and data are stored in the memory 406 which may take a form of random-access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user. The computer system 400 may also include a secondary storage 410. The secondary storage 410 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drives at least one of reads from and writes to a removable storage unit in a well-known manner.

Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 406 and the secondary storage 410. Such computer programs, when executed, enable the computer system 400 to perform various functions as described in the foregoing. The memory 406, the secondary storage 410, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 404, a graphics processor coupled to a communication interface 412, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 404 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).

Furthermore, the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 400 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.

Furthermore, the computer system 400 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 400 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 408.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.

Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A database controller (106, 206) configured to be utilized for Continuous Data Protection, CDP, for a CDP set comprising a plurality of datasets (ki, vi) in a distributed database system (100), wherein the distributed database system (100) comprises a plurality of nodes (102A-N, 202 A-C) and a CDP node (104, 204), wherein each node is configured to store one or more data sets (ki, vi) out of the plurality of data sets (ki, vi), wherein the database controller (106, 206) is further configured to download each of the datasets in the CDP set from the plurality of nodes (102A-N, 202 A-C) to the CDP node (104, 204), wherein the database controller (106) is characterized in that the database controller (106, 206) is further configured to download each of the datasets in the CDP set by for each dataset determining if the dataset is stored in more than one node, and if so determine a preferred node and download the dataset from the preferred node to the CDP node (104, 204).

2. The database controller (106, 206) according to claim 1, wherein the database controller (106, 206) is further configured to receive a list from a source node of the plurality of nodes (102A-N, 202 A-C), wherein the list comprises the CDP set indicating datasets that have been changed and which nodes each data set is stored on, and wherein the database controller (106, 206) is further configured to determine if the dataset is stored in more than one node based on the list.

3. The database controller (106, 206) according to claim 2, wherein the database controller (106, 206) is further configured to determine that the dataset is only stored in the source node, and if so receive download the dataset from the source node to the CDP node (104, 204).

4. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that a node is a preferred node by determining that the node is underutilized.

5. The database controller (106, 206) according to claim 4, wherein the database controller (106, 206) is further configured to determine that the node is underutilized by determining that a processor and/or a memory of the node is used to a degree falling under a utilization level.

6. The database controller (106, 206) according to claim 4 or 5, wherein the database controller (106, 206) is further configured to determine that the node is underutilized by determining that the processor and/or the memory of the node is used to a lesser degree than in the other nodes storing the same dataset.

7. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node has more processing power than the other nodes storing the same dataset.

8. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node has more free memory than the other nodes storing the same dataset.

9. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node is a close neighbour to the CDP node (104, 204).

10. The database controller (106, 206) according to claim 9, wherein the database controller (106, 206) is further configured to determine that the node is the close neighbour to the CDP node (104, 204) by determining that the node has fewer network hops to the CDP node (104, 204) than the other nodes storing the same dataset.

11. The database controller (106, 206) according to any preceding claim, wherein the database controller (106, 206) is further configured to determine that the node is the preferred node by determining that the node is in a same region as the CDP node (104, 204).

12. The database controller (106, 206) according to claim 11, wherein the same region is a same network region as the CDP node (104, 204).

13. The database controller (106, 206) according to claim 11 or 12, wherein the same region is a same geographic region as the CDP node (104, 204).

14. A method for a database controller (106, 206) configured to be utilized for Continuous Data Protection, CDP, for a CDP set comprising plurality of datasets (ki, vi) in a distributed database system (100), wherein the distributed database system (100) comprises a plurality of nodes (102A-N, 202 A-C) and a CDP node (104, 204), wherein each node is configured to store one or more data sets (ki, vi) out of the plurality of data sets (ki, vi), the method comprises downloading each of the datasets in the CDP set from the plurality of nodes (102A-N, 202 A-C) to the CDP node (104, 204), wherein the data base controller (106, 206) is characterized in that the method further comprises downloading each of the datasets in the CDP set by for each dataset determining if the dataset is stored in more than one node, and if so determining a preferred node, and downloading the dataset from the preferred node to the CDP node (104, 204).

15. A computer program product comprising program instructions for performing the method according to claim 14, when executed by one or more processors in a database controller (106, 206) system.