WO2015055502A2

WO2015055502A2 - Method of partitioning storage in a distributed data storage system and corresponding device

Info

Publication number: WO2015055502A2
Application number: PCT/EP2014/071658
Authority: WO
Inventors: Erwan Le Merrer; Gilles Tredan; Yizhong Liang
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2013-10-18
Filing date: 2014-10-09
Publication date: 2015-04-23
Anticipated expiration: 2016-04-18
Also published as: WO2015055502A3

Abstract

A method and device for efficient partitioning of storage of big data in a distributed storage system, capable of handling input streams of data. Related data items that are received from the input data stream, are assigned to a partition if they are not already part of any partition; if any of the related data items is already in a partition, the other one of the related data items is assigned to the partition if there is enough place in the partition. If both related data items do not already exist in any partition, they are assigned to the partition in the distributed data storage system that has the lowest number of data items. Thus, on-line partitioning takes place as data items arrive, allowing the partitioning to be well-balanced and limiting the creation of edges between partitions.

Description

Method of partitioning storage in a distributed data storage system and corresponding device.

1. Field

The present disclosure relates to the technical field of cloud computing and in particular to the technical field of the partitioning of graphs. 2. Background

Relations between items or people or between items and people can be expressed in graphs. User-movie graphs, link users to movie titles of movies that they have viewed; relationship graphs describe social networks, web graphs describe links between web pages. Graphs comprise nodes (or "data items") that are interconnected by edges (or "links"). For "big data" graphs, e.g. that comprise data from millions of users, cloud storage and cloud computing is an appropriate solution to ensure scalability. The advantages of cloud storage and computing are that it allows virtually infinite storage and computing capacity as storage and computing is distributed over thousands of devices connected in a network. Cloud storage allows parallelization of computing. Cloud storage of a big data graph implies that the graph is split or partitioned in so-called partitions, so that each of the storage devices only stores a number of nodes of the graph. Then, each of the storage devices can compute partial results over the graph nodes stored in its partition. However, load balancing is to be taken care of when partitioning a graph. The objectives of load balancing are twofold: each partition should have a reasonable number of graph nodes, so that each storage device gets an equal share of graph nodes, while the number of edges between partitions should remain limited, in order to reduce interdependency between partitions. Partition interdependency should be kept low, because partition interdependency requires communication between storage devices, thereby reducing the efficiency of the storage device for computing results. If the distribution of the graph nodes over the storage devices is well-balanced, the workload for each storage device for computing intermediate results over the graph nodes in their partition is evenly distributed and each storage device can work independently, thereby improving performance of parallel computing offered by the distributed storage. Graph partitioning is a notorious NP-complete problem. Two objectives (number of nodes per partition and number of edges with other partitions) are to be taken into account and a close to optimum solution is to be searched. Searching such optimum satisfying solution is difficult when new input data arrives in a continuous stream of data, as is for example the case with the venue of Internet of Things (loT). With loT it is conceivable that billions of devices stream data; but also in already existing cases, for example a continuous stream of user-content data from millions of Video on Demand (VoD) users. Faced with such a continuous stream of input data, offline (or 'static') partitioning methods are no longer a satisfying solution. This is especially true for applications that require low-latency, such as on-line recommendation web applications. As the graph structure continuously changes due to continuous node and edge creations, repartitioning from scratch is inefficient because too time consuming; as soon as the graph is repartitioned, the repartitioning has become stale due to the stream of new data that has arrived during the time required for the repartitioning. Document Stanton, Kliot,: "Streaming graph partitioning for large distributed graphs" In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD '12, New York, NY, USA, ACM (2012) 1222-1230, describes taking as an input a stream of events (data items), and distributing those events as graph nodes to partitions, in the hope that after a long time, local decisions made by the machine receiving the stream are good and do not cause heavy imbalance and lots of crossing edges between the partitions. Once a node has been placed on a partition, it is never moved from that partition; however, if some events arrive in an order that is not favorable to the algorithm, the choice made by the receiving machine may be far from optimal, and resulting in a bad (far from optimum) partitioning. According to Stanton, nodes are placed on machines, and are left there despite graph evolution; as the graph evolves when data continues to arrive, the graph may become badly balanced after a while It would thus be desirable to handle the real-time requirement that arises when one wants to handle input data streams, and be able to repartition some poorly balanced partitions of the graph while it is evolving. There is thus a need for further optimization of graph partitioning techniques so that it is adapted to input data streams. 3. Summary

The aim of the present disclosure is that of solving at least some of the drawbacks of the prior art.

To this aim the present disclosure comprises a method for partitioning storage in a distributed data storage system, the distributed data storage system storing data items and relations between the data items, the distributed data storage system comprising a plurality of storage devices, each storage device of the plurality being arranged for storing data items in at least one storage partition, the relations being stored in the distributed data storage system as data representative of links between data items, the method receiving, from an input stream of data items, interrelated data items /^' and j, the method being characterized in that it is implemented by a partitioning device, and the method comprising: if it is determined that the interrelated data items /^' and j are stored in any of the at least one partition, adding data representative of a link between the data items /^' and j in the distributed data storage system; if it is determined that one of the interrelated data items /^' or j is not already stored in any of the at least one partition, assigning the one of the interrelated data items /^' or j that is not already stored to a partition Pi of the already stored data item /^' or to a partition Pj of the already stored data item j if (404) the number of data items in the partition Pi or Pj does not exceed a predetermined threshold, otherwise assigning the one of the interrelated data items /^' or j that is not already stored in any of the at least one partition to a partition that comprises a lowest number of data items among any partition, and adding data representative of a link between the interrelated data items /^' and j in the distributed data storage system; and if it is determined that both of the related data items /^' and j are not already stored in any of the at least one partition, assigning the interrelated data items i and j to a partition that comprises a lowest number of data items among any partition, and adding data representative of a link between the interrelated data items /^' and j in the distributed data storage system.

According to a variant embodiment, the method further comprises: for each partition in the distributed data storage system, selecting data items, referred to as top K worst nodes, that have the most links to other data items in other partitions among all partitions in the distributed data storage system; and assign each of the top K worst nodes to a partition that has the most connections with that node. According to a variant embodiment, the method further comprises: selecting a partition having a largest number of data items among all partitions in the distributed data storage system; selecting, among the data items in the selected partition, the data items having the most links to data items in other partitions; determining a destination partition for inserting of the selected data items; and transferring the selected data items to the determined destination partition for inserting.

According to a variant embodiment, the method further comprises: determining a first average computing time for computing results over data items stored in each partition; recording the current partitioning in a memory; recording the reception of input stream of data items in a buffer; in each iteration of the method, alternatively executing the steps of claim 2 or 3; determining a second average computing time for computing results over data items stored in each partition; if a difference between the first and the second average computing time exceeds a predetermined threshold, rollback to the partitioning recorded in the memory; inserting the data items received in the buffer into the input stream.

The present disclosure also comprises a device for partitioning storage in a distributed data storage system, the distributed data storage system storing data items and interrelations between the data items, the distributed data storage system being arranged to comprise a plurality of storage devices, each storage device of the plurality being arranged for storing data items in at least one partition, the interrelations being stored in the distributed data storage system as data representative of links between data items, the device comprising: a network interface configured for receiving interrelated data items /^' and j from an input stream of data items; a processor assembly configured for determining if the interrelated data items /^' and j are stored in any of the at least one partition, and for adding data representative of a link between the data items /^' and j in the distributed storage system; the processor assembly being further configured for determining if one of the interrelated data items /^' or j is not already stored in any of the at least one partition, and for assigning the one of the interrelated data items /^' or j that is not already stored to a partition Pi of the already stored data item /^' or to a partition Pj of the already stored data item j if the number of data items in the partition Pi or Pj does not exceed a predetermined threshold, otherwise for assigning (406) the one of the interrelated data items /^' or j that is not already stored in any of the at least one partition to a partition that comprises a lowest number of data items among any partition, and for adding a link between the interrelated data items /^' and j in the distributed data storage system; and the processor assembly being further configured for determining if both of the interrelated data items /^' and j are not already stored in any of the at least one partition, for assigning the interrelated data items i and j to a partition that comprises a lowest number of data items among any partition, and for adding a link between the interrelated data items /^' and j in the distributed data storage system.

The present disclosure also relates to a computer program product downloadable from a communication network and/or recorded on a medium readable by computer and/or executable by a processor, comprising program code instructions for implementing the method according to claim 1.

The present disclosure further relates to a non-tra nsitory computer-readable medium comprising a computer program product recorded thereon and capable of being run by a processor, including program code instructions for implementing the method according to claim 1. While not explicitly described, the present embodiments may be employed in any combination or sub-combination.

4. Brief description of Drawings

Other characteristics and advantages of the different described example embodiments will appear when reading the following description and the annexed drawings. The embodiments described hereafter are merely provided as examples and are not meant to be restrictive.

Figure 1 illustrates a process of graph partitioning and its consequences related to edge cut and load balancing;

Figure 2 illustrates the partitioning method according to an example embodiment of the present disclosure;

Figure 3 illustrates a further example embodiment;

Figure 4 is a device implementing an example embodiment of the method according to the present disclosure;

Figure 5 is a flow chart illustrating an example embodiment of the present disclosure;

Figure 6 is a network implementing a distributed storage data system according to an example embodiment of the present disclosure.

5. Description of embodiments

According to different embodiments there is proposed a partitioning method and device that takes each new node/edge created, to place it "greedily" in a partition. The example method and device operate over a continuous flow or 'stream' of 'events' or data items arriving for example at a data center, that can be seen as a continuous stream of updates to the graph. The term "greedy", when applied to an algorithm, means an algorithm that follows a problem solving heuristic of making a locally optimal choice at each stage in the hope of finding a global optimum. "Greedy" heuristics may yield locally optimal solutions that approximate a global optimal solution in reasonable time. In the context of the example embodiment, the greedy partitioning allows to obtain good partitioning results in real time, that is, the time scale of the stream of events arriving at input. In other terms, according to the example embodiment, a growing graph is partitioned in real time or On the fly' or again On line'. This approach is to be opposed to Off line' or 'static' partitioning. The method and device of the example embodiment propose a dynamic and incremental way of handling growing graph additions that arrive at a data center, to be dynamically placed in, added to or assigned to, a graph partition. The growing graph additions can be seen as a continuous stream of data or events and we can therefore use the wording 'streamed graph'. The partition is thus done in real time, as events arrive. For example, a friend addition in Facebook, or a href link added on a web page, a VoD movie seen by a user. The greedy placement of graph edges/nodes according to the method of the example embodiment can be complemented according to variant embodiments with periodic repartitioning.

Good graph partitioning is often assessed by a small number of graph edges that cross partitions, see Figure 1. Graph 100 can be partitioned for example as in 110, or 120 (other partitions are possible, but not shown). Partitions are illustrated as being contained within a box (rectangle), with a letter P. The number of edges that cross partitions should be reduced as much as possible, while trying to maintain equivalent sizes. PI in 110 comprises 3 edges or links that cross partitions PI and P2. Partition PI comprises 4 nodes, while partition P2 comprises 7 nodes. P3 in 120 comprises 2 edges that cross partitions P3 and P4. As P2, P3 comprises 7 nodes. As PI, P4 comprises 4 nodes. When comparing 110 with 120, one can say that the different partitioning has obtained equally balanced partitions; but, they do not comprise an equal number of edges or links crossing the partitions. 120 comprises fewer edges than 110. It is easy to understand, that simple partitioning by placing graph nodes at random in partitions would result in many crossing edges, thus creating many inter-partition dependencies, which would slow down parallel computation, because in order to compute a result over a partition, it is necessary to obtain data from many other partitions. Good partitioning thus enforces self-contained or dense graphs within partitions, i.e. with tightly connected nodes and few edges to other partitions, which is advantageous for parallel computing, since very few data from the outside is needed when computing a result over a partition. This is especially true when a partition is stored by a same device; the reduction of connection of the partition with other partitions, allows to reduce the number of network communications and data transfers required when computations are to be done over the nodes in the partition.

In Figure 2 210, 220 and 230 represent a graph that is partitioned in three partitions. 200 represents a web application, transmitting a data item e,_j to a decisional block 201. The decisional block is for example implemented by a partitioner that is responsible for the partitioning. The partitioner 201 takes the decision to add nodes i and j to partition 230 and to add edge or link ij to partition 230 because neither nodes i nor j exist in any of the partitions and partition 230 has capacity to receive some additional nodes, depicted in partition 230 as dashed rounds. An edge or link can be implemented as a reference in a table, for example in a table of nodes or data items, in the table of data item i, there is a reference to data item j. The edge ij is depicted as a dashed line connecting nodes i and j. Thus, in order to be able to take a decision where to add nodes or edges, the partitioner has information of which node is in which partition. The decisions made by the partitioner can be resumed as follows: if 3i and 3j then

// Data items i and j already exist in any

// partition.

Add link whatever their respective partitions .

else if 3i and j then

// Data item i, but not data item j, does

// already exist in a partition.

Add j to the partition of i if that partition has enough capacity. Otherwise, place j in the less occupied partition. Add link (i, ) . else //i.e. i and j

II Neither data item I nor data item j does

// already exist in a partition.

Place both i and j at the least occupied partition . Where 3/ respectively 0/ stands for "node i exists in a partition" and "node i does not exist in a partition". The function Add link is a function hosted by the partitioner, which informs the devices that host i and j to link those nodes via an edge.

Computationally, this is a lightweight partitioning method, to which we will refer to as the "stream-greedy strategy", and requires the partitioner to perform membership tests and cardinality operations over a mapping table, for finding the least occupied partition. The stream-greedy partitioner of the exam ple embodiment is a one-pass partitioner. The partitioner is capable of handling streamed events and build a graph and partition it in real-time. In the described partitioning method edges, once placed, are never moved from one partition to another. According to an advantageous variant embodiment, the stream-greedy method is enhanced by periodically improving the cut ratio or the load balancing ratio of partitions. The cut ratio is the number of edges that a partition has with other partitions, as compared with that of other partitions. The load balancing ratio is the number of nodes in a partition as compared with that of other partitions. Variant embodiments exist to define cut and load balancing ratio, such as, to calculate the cut ratio, divide the number of cut edges by the number of total edges in the whole graph; and for the load balancing ratio, the number of nodes in the least occupied partition divided by the number of nodes in the most occupied partition.

To improve the cut ratio, the following method applies, which we call optimizeCut:

1. For each partition, select the top K worst nodes (i.e. the nodes that have the most connections to other nodes in other partitions);

2. Assign these top K worst nodes to partitions that share the most neighbors (i.e. for each of these top K nodes, find a partition that has the most connections with that node);

3. Transfer the nodes to the partitions found; and

4. Update the information stored by the partitioner in order to reflect the new partitioning.

Steps 1, 2 and 4 can be executed by the partitioner or alternatively by the partitions themselves; for instance, selected nodes are moved to partitions comprising nodes with which they have the most edges. To improve the load balancing ratio, the following method applies, which we call opt imi z eLoad:

1. Select the partitions that have the largest number of nodes;

2. Of these partitions, select the nodes according to a selection method described hereafter;

3. transfer the nodes to a destination partition that minimizes cut and gives good load balancing, e.g. using LDG (Linear Deterministic Greedy),that in practice transfers a node to a partition where it has the most edges, weighting this choice by a linear penalty function based on the capacity of the partition, and

Steps 1, 3 and 4 can be executed by the partitioner or alternatively by the devices storing the partitions themselves; for instance, selected nodes are moved to partitions comprising nodes with which they have the most edges.

The above mentioned selection method can, for example, be:

• RANDOM : select nodes in a random manner;

• DEGREE: select nodes according to number of edges with other nodes (highest first);

· ECCENTRIC: select less central nodes (e.g. Pagerank centrality, Betweenness centrality, closeness centrality);

• X-CORE: select nodes that have a number of connections with other nodes that is below a determined threshold.

Either one of the above described optimization methods can be executed by the partitioner on a regular basis, e.g. with a regular delay, and/or based on the evolution of the graph in terms of number of events arrived, and/or based on the evolution of the improvement of the optimization as measured after each run of the optimization methods. The stream greedy strategy for partitioning is thus efficient, but periodic reconfiguration may improve the partitioning by applying one or more of the above discussed methods for reduction of cut ratio or of load balancing ratio.

The following section describes a further variant embodiment of the stream greedy strategy for partitioning with reduction of cut ratio and/or load balancing ratio. In particular, a self-tuning improvement is proposed that is based on an average computing time of an application that exploits the data stored in the partitions. Based on this average computing time, it is possible to determine if the average computing time is lower or not than before reconfiguration of the partitioning. The optimization problem posed is then to reach a certain graph partitioning that minimizes the application's execution time. As finding a particular graph partitioning is an NP- complete problem (notably the problem of finding the best partitioning for a particular load balancing/cut tradeoff), local search optimization is to be relied on. Considering the average computing time measurement and the two above discussed improvement variants, i.e. cut ratio and load balancing ratio, we propose the following optimization method that is illustrated in Figure 3. This figure shows, horizontally, improvement of cut ratio (310) and vertically, improvement of load balancing ratio (300). As we are handling streamed events, and a graph constantly evolves. We therefore cannot decide which optimization would have been best to apply i.e. cut ratio or load balancing ratio, based on an evaluation of the average computing time on a "static", non-evolving graph of which the partitioning is optimized according to either one of the different optimization strategies. We thus make a random choice in either one direction, and then act as a function of evaluated average application computing time. This is shown in figure 3, where either of the arrows from one point to another is a random choice. Broken arrows indicate hypothetical other choices that could have been made for hypothetical other improvements. Point 320 is the start point. Ellipse 321 indicates a "sweet spot" that is reached after periodic optimizations. Each optimization is indicated by a dot. When no optimization is further possible in any of the two directions, the configuration has reached the sweet spot 321. Few primitives are needed to implement this self-tuning variant embodiment.

Snapshot ( ) : records the current partitioning;

getComputeTime ( ) : get the average computing time over last period; commit ( ) : keep the current partitioning;

rollback ( ) : return to the partitioning of the snapshot;

buffer ( ) : record all incoming events;

· flushbuffer (): consume buffered incoming events;

random(cutRatio, loadBalanceRatio ) : returns 'cut' or 'load' based on a random selection;

optimizeCut ( ) : optimize partitioning for cut ratio;

optimizeLoad ( ) : optimize partitioning for load balancing ratio.

With these primitives, we propose an optimization method of using a blind hill climbing method that uses application feedback, as illustrated by the following pseudo code fragment:

While True do

cTimeBefore = getComputeTime () ;

snapshot ( ) ;

buffer ( ) ;

if random(cutRatio, loadBalanceRatio ) == cut then optimizeCut ( ) ;

else optimizeLoad () ;

cTimeAfter = getComputeTime () ;

if ( (cTimeAfter - cTimeBefore) div cTimeBefore > threshold) then

rollBack ( ) ;

else commit ( ) ;

flushBuffer ( ) ;

endWhile The value cTimeAfter minus cTimeBefore / cTimeBefore is compared to a threshold, that can be set to avoid spurious rollback operations of the partitioning when the differences between these two is not significant. Alternatively, the method 'random' can be replaced by an alternative selection of the cutRatio or the loadBalanceRatio during each iteration of the hill climbing method.

Figure 4 is a flow chart illustrating a particular example embodiment of the method according to the present disclosure. In a step 400, variables and parameters are initialized that are used for the method. In a step 401, interrelated data items /^' and j are received from an input stream of data items. In a step 402, it is verified if the received interrelated data items /^' and j are stored in any of the partitions of the distributed data storage system. If so, a link between said data items /^' and j in said distributed storage system in a step 408. In a step 403, it is verified if one of the interrelated data items /^' or j is already stored in any partition of the distributed data storage system. If that is the case, it is verified in a step 404 if the partition that comprises the one of said interrelated data items /^' or j that is not already in a partition, is full, i.e. if the number of data items in that partition is under a predetermined threshold. If the number is under the predetermined threshold, the data item /^' or j that is not already in a partition is assigned in a step 405 to the partition that comprises the one of the interrelated data items /^' or j, and a link between /^' and j is added to the distributed data storage system in step 408. If the number is superior or equal to the threshold, a least full partition of the distributed storage system is determined, and the data item /^' or j that is not already in a partition is assigned to the determined least full partition in a step 406, after which a link is added between data items I and j in the distributed data storage system in step 408. Finally, if in step 403 it is determined that neither i nor j are already in any partition, a least full partition of the distributed storage system is determined, and the data item /^' and j are assigned to the determined least full partition in a step 407, after which a link is added between data items /^' and j in the distributed data storage system in step 408. After step 408, the method returns to step 401 where it is ready to receive new data items from the input stream. Figure 5 is an example of a storage partitioner device implementing an example embodiment of the method according to the present disclosure. The device 500 comprises a network interface 501, that is connected to the network (not shown) that comprises the distributed data storage system (not shown), and over which data items (e.g. i and j) are received as an input stream. The device further comprises a CPU 502, or Central Processing Unit, that is capable of executing program instructions that are stored in a memory 503. The memory comprises progra m instructions that, when executed by CPU 502, realize the method according to the example embodiment of the present disclosure. The device 500 further comprises an optional I/O unit 505 that allow to connect via I/O connection 508 to input/output devices connected to the device 500, such as a keyboard (not shown), a display device (not shown), and an external storage device (not shown). The device 500 further comprises a clock unit that allows among others synchronized operation of the components of the device, and that serves for timing purposes for the communications transmitted and received over network connection 507 and I/O connection 508. Finally, the device 500 comprises an internal communication bus 506 that allows intercommunication between the components of the device. CPU 502 functions as a processing means for determining if the interrelated data items /^' and j are stored in any partition, and if so, to add a link between said data items /^' and j in the distributed storage system. The CPU further functions as a processing means for determining if one of the interrelated data items /^' or j is not already stored in any of the at least one partition, and if so, the CPU assigns the interrelated data items /^' or j that is not already stored in a partition to a partition Pi of the already stored data item /^' or to a partition Pj of the already stored data item j, if the number of data items in the partition Pi or Pj does not exceed a predetermined threshold, otherwise the CPU assigns the one of the interrelated data items /^' or j that is not already stored, to a partition of the distributed data storage system that comprises a lowest number of data items among any partition of the distributed data storage system, and adds a link between the interrelated data items /^' and j in the distributed data storage system. Finally, CPU 502 functions as a processing means for determining if both of the interrelated data items /^' and j are not already stored in any of the at least one partition, and if so, assigns the interrelated data items i and j to a partition that comprises a lowest number of data items among any partition, and adds a link between said interrelated data items /^' and j in the distributed storage system. Figure 6 is a distributed storage data system according to an example embodiment of the present disclosure. The distributed data storage system 600 comprises three storage devices 601, 602, and 603 and a partitioner device 500, which devices are interconnected in a network 604. The partitioner device 500 is for example the device 500 of figure 5. The partitioner 500 is connected via its network connection 507 to network 604, and via the network 604 to storage device 601 via its network connection 608, to storage device 602 via its network connection 610, and to storage device 603 via its network connection 609. Lines 605, 606 and 607 represent links between data items. Data items are represented as black dots, and the lines between the dots represent interrelations between the data items. Each of the storage devices 601 to 603 comprises a partition of a graph of interrelations between the data items stored in the storage devices. Storage device 601 comprises a partition Px 210. Storage device 602 comprises a partition Py 220, and storage device 603 comprises a partition Pz 230. The partitions 210, 220 and 230 and the interrelations between the data items in these partitions have already been discussed within the context of figure 2. The links between data items in different partitions 605, 606 and 607 ("internal" links) and the links between the data items in each partition ("external" links) are not physical connections but logical connections, and are for example implemented as pointers in a table or as URLs. According to a particular example embodiment of the present disclosure, each of the storage devices comprises a table indexing each of the data items that it stores and comprises the links between the data items (internal and external). According to a variant embodiment, the partitioner 500 comprises such an indexing table. According to a variant embodiment, these embodiments are mixed so that the partitioner 500 and the storage devices 601-603 together comprise information related to the location of a data item (i.e. which data item is stored on which storage device) and the links between data items (i.e. which data item is linked to which other data item). As will be appreciated by one skilled in the art, aspects of the present principles can be embodied as a system, method or computer readable medium. Accordingly, aspects of the present principles can take the form of an entirely hardware embodiment, en entirely software embodiment (including firmware, resident software, micro-code and so forth), or an embodiment combining hardware and software aspects that can all generally be defined to herein as a "circuit", "module" or "system". Furthermore, aspects of the present principles can take the form of a computer readable storage medium. Any combination of one or more computer readable storage medium(s) can be utilized.

Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative system components and/or circuitry embodying the principles of the present disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable storage media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

A computer readable storage medium can take the form of a computer readable program product embodied in one or more computer readable medium(s) and having computer readable program code embodied thereon that is executable by a computer. A computer readable storage medium as used herein is considered a non- transitory storage medium given the inherent capability to store the information therein as well as the inherent capability to provide retrieval of the information there from. A computer readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples of computer readable storage mediums to which the present principles can be applied, is merely an illustrative and not exhaustive listing as is readily appreciated by one of ordinary skill in the art: a portable computer diskette; a hard disk; a read-only memory (ROM); an erasable programmable read-only memory (EPROM or Flash memory); a portable compact disc read-only memory (CD-ROM); an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.

Claims

1. A method of partitioning storage in a distributed data storage system (600), the distributed data storage system storing data items and relations between said data items, the distributed data storage system comprising a plurality of storage devices (601,602,603), each storage device of said plurality being arranged for storing data items in at least one storage partition (210,220,230), said relations being stored in said distributed data storage system as data representative of links (605,606,607) between data items, said method receiving (401), from an input stream of data items, interrelated data items /^' and j, the method being characterized in that it is implemented by a partitioning device (500), and in that the method comprises:

if (402) said interrelated data items /^' and j are stored in any of said at least one partition, adding (408) data representative of a link between said data items /^' and j in said distributed data storage system;

if (403) one of said interrelated data items /^' or j is not already stored in any of said at least one partition, assigning (405) said one of said interrelated data items /^' or j that is not already stored to a partition Pi of the already stored data item /^' or to a partition Pj of the already stored data item j if (404) the number of data items in said partition Pi or Pj does not exceed a predetermined threshold, otherwise assigning (406) said one of said interrelated data items /^' or j that is not already stored in any of said at least one partition to a partition that comprises a lowest number of data items among any partition, and adding (408) data representative of a link between said interrelated data items /^' and j in said distributed data storage system; and

if both of said related data items /^' and j are not already stored in any of said at least one partition, assigning (407) said interrelated data items i and j to a partition that comprises a lowest number of data items among any partition, and adding (408) data representative of a link between said interrelated data items /^' and j in said distributed data storage system.

2. The method of partitioning storage in a distributed data storage system according to claim 1, wherein the method further comprises: for each partition in said distributed data storage system, selecting data items, referred to as top K worst nodes, that have the most links to other data items in other partitions among all partitions in said distributed data storage system;

assign each of the top K worst nodes to a partition that has the most connections with that node.

3. The method of partitioning storage in a distributed data storage system according to claim 1 or 2, wherein the method further comprises:

selecting a partition having a largest number of data items among all partitions in said distributed data storage system;

selecting, among the data items in the selected partition, the data items having the most links to data items in other partitions;

determining a destination partition for inserting of said selected data items; transferring said selected data items to said determined destination partition for inserting.

4. The method of partitioning storage in a distributed data storage system according to claim 1, wherein the method further comprises:

determining a first average computing time for computing results over data items stored in each partition;

recording the current partitioning in a memory;

recording the reception of input stream of data items in a buffer;

in each iteration of the method, alternatively executing the steps of claim 2 or

3;

determining a second average computing time for computing results over data items stored in each partition;

if a difference between said first and said second average computing time exceeds a predetermined threshold, rollback to the partitioning recorded in said memory;

inserting the data items received in said buffer into said input stream.

5. A device (500) for partitioning storage in a distributed data storage system, the distributed data storage system storing data items and interrelations between said data items, the distributed data storage system being arranged to comprise a plurality of storage devices, each storage device of said plurality being arranged for storing data items in at least one partition, said interrelations being stored in said distributed data storage system as data representative of links between data items, wherein said device comprises:

a network interface (501) configured for receiving interrelated data items /^' and j from an input stream of data items;

a processor assembly (502) configured for determining if said interrelated data items /^' and j are stored in any of said at least one partition, and for adding data representative of a link between said data items /^' and j in said distributed storage system;

the processor assembly being further configured for determining if one of said interrelated data items /^' or j is not already stored in any of said at least one partition, and for assigning said one of said interrelated data items /^' or j that is not already stored to a partition Pi of the already stored data item /^' or to a partition Pj of the already stored data item j if the number of data items in said partition Pi or Pj does not exceed a predetermined threshold, otherwise for assigning (406) said one of said interrelated data items /^' or j that is not already stored in any of said at least one partition to a partition that comprises a lowest number of data items among any partition, and for adding a link between said interrelated data items /^' and j in said distributed data storage system; and

the processor assembly being further configured for determining if both of said interrelated data items /^' and j are not already stored in any of said at least one partition, for assigning said interrelated data items i and j to a partition that comprises a lowest number of data items among any partition, and for adding a link between said interrelated data items /^' and j in said distributed data storage system.

6. A computer program product downloadable from a communication network and/or recorded on a medium readable by computer and/or executable by a processor, comprising program code instructions for implementing the method according to claim 1.

7. A non-transitory computer-readable medium comprising a computer program product recorded thereon and capable of being run by a processor, including program code instructions for implementing the method according to claim 1.