WO2013153029A1

WO2013153029A1 - Method and system for managing and processing data in a distributed computing platform

Info

Publication number: WO2013153029A1
Application number: PCT/EP2013/057302
Authority: WO
Inventors: Andreu Urruela Planas; Ken GUNNAR ZANGELIN; José Gregorio ESCALADA SARDINA
Original assignee: Telefonica SA
Current assignee: Telefonica SA
Priority date: 2012-04-12
Filing date: 2013-04-08
Publication date: 2013-10-17
Anticipated expiration: 2014-10-12

Description

Method and system for managing and processing data in a distributed

computing platform

Field of the art

The present invention generally relates, in a first aspect, to a method for managing and processing data in a distributed computing platform, and more specifically to a method for optimizing the performance of data managing of distributed computing platforms.

A second aspect of the invention relates to a system arranged for implementing the method of the first aspect.

Prior State of the Art

With the huge amount of data generated and available nowadays, new applications (or revisited old ones) are appearing to process and take advantage of all this information. But these new applications cannot run in a single machine, so solutions are looking for scalability through dividing the problem among a number of machines.

MapReduce, or map&reduce, is a programming model for expressing distributed computations on massive datasets and an execution framework for large- scale data processing. It allows distributing very large data problems on a cluster of machines.

MapReduce (MR) is a paradigm evolved from functional programming and applied to distributed systems. It was presented in 2004 by Google [1]. It is meant for processing problems whose solution can be expressed in commutative and associative functions.

Basically, MR offers an abstraction for processing large datasets on a set of machines, configured in a cluster.

All data of these datasets is stored, processed and distributed in the form of key-value pairs, where both the key and the value can be of any data type.

From the field of functional programming, it is proved that any problem whose solution can be expressed in terms of commutative and associative functions, can be expressed in two types of functions: map (named also map in the MR paradigm) and fold (named reduce in the MR paradigm). Any job must be expressed as a sequence of these functions. These functions have a restriction: they operate on some input data, and produce a result without side effects, i.e. without modifying neither the input data nor any global state. This restriction is the key point to allow an easy parallelization.

Given a list of elements, map takes as an argument a function f (that takes a single argument) and applies it to all elements in a list (the top part of the Figure 1 ). Fold takes as arguments a function g (that takes two arguments) and an initial value: g is first applied to the initial value and the first item in the list, the result of which is stored in an intermediate variable. This intermediate variable and the next item in the list serve as the arguments to a second application of g, the results of which are stored in the intermediate variable. This process repeats until all items in the list have been consumed; fold then returns the final value of the intermediate variable. Typically, map and fold are used in combination. The output of one function is the input of the next one (as functional programming avoids state and mutable data, all the computation must progress by passing results from one function to the next one), and this type of functions can be cascaded until finishing the job.

In the map type of function, a user-specified computation is applied over all input records in a dataset. As the result depends only on the input data, the task can be splitted among any number of instances (the mappers), each of them working on a subset of the input data, and can be distributed among any number of machines. These operations occur in parallel. Every key-value pair in the input data is processed, and they can produce none, one or multiple key-value pairs, with the same or different information. They yield intermediate output that is then dumped to the reduce functions.

The reduce phase has the function to aggregate the results disseminated in the map phase. In order to do so, all the results from all the mappers are sorted by the key element of the key-value pair, and the operation is distributed among a number of instances (the reducers, also running in parallel among the available machines). The platform guarantees that all the key-value pairs with the same key are presented to the same reducer. This phase has so the possibility to aggregate the information emitted in the map phase.

The job to be processed can be divided in any number of implementations of these two-phase cycles.

The platform provides the framework to execute these operations distributed in parallel in a number of CPUs. The only point of synchronization is at the output of the map phase, were all the key-values must be available to be sorted and redistributed. This way, the developer has only to care about the implementation (according to the limitations of the paradigm) of the map and reduce functions, and the platform hides the complexity of data distribution and synchronization. Basically, the developer can access the combined resources (CPU, disk, memory) of the whole cluster, in a transparent way. The utility of the paradigm arises when dealing with big data problems, where a single machine has not enough memory to handle all the data or its local disk would be big and fast enough to cope with all the data.

The entire process can be presented in a simple, typical example: word frequency computing in a large set of documents. A simple word count algorithm in MapReduce is shown in Figure 2. This algorithm counts the number of occurrences of every word in a text collection. Input key-value pairs take the form of (docid, doc) pairs stored on the distributed file system, where the former is a unique identifier for the document, and the latter is the content of the document. The mapper takes an input key-value pair, tokenizes the document, and emits an intermediate key-value pair for every word. The key would be a string (the word itself) while the value is the count of the occurrences of the word (an integer). In an initial approximation, it will be a "1 " (denoting that we've seen the word once). The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. Therefore, the reducer simply needs to sum up all counts (ones) associated with each word, and to emit final key-value pairs with the word as the key, and the count as the value.

This paradigm has had a number of different implementations: the already presented by Google, patent US 7,650,331 , the open source project Apache Hadoop, that is the most prominent and widely used implementation, and a number of implementations of the same concept: Sector/Sphere, Disco (written in Eriang), Twister (a lightweight MapReduce framework) . Microsoft also developed a framework for parallel computing, Dryad, which is a kind of superset of MapReduce.

These implementations have to solve a number of problems (task scheduling, scalability, fault tolerance...). One of these problems is how to ensure that every task will have the input data available as soon as it is needed, without making network and disk input/output the system bottleneck (a difficulty inherent in big-data problems).

Most of these implementations (Google, Hadoop, Sphere, Dryad...) rely for data managing on a distributed file-system. Data files are splitted in large chunks (e.g. 64MB), and these chunks are stored and replicated in a number of data nodes. Tables keep track on how data files are splitted and where the chunk replicas reside. When scheduling a task, the distributed file system can be queried about the nodes where the input data is stored, and then one of those (or one close to any of them) is selected to execute the operation. This way, network traffic is reduced.

There is a master node (named GFS master in the Google's proposal, and

NodeName in the Hadoop's terminology), that keeps track of the location (the chunkserver in the GFS, or DataNode in HDFS) where every chunk of every file is stored. When reading or writing a file to the distributed filesystem, the application must ask the master node about the server where every chunk lives or will be stored.

The sequence of requests for both cases, as presented in Hadoop documentation, is presented in Figure 4 and Figure 5.

Problems with existing solutions:

The present invention will concentrate on Hadoop and its HDFS, because it is the most used MapReduce implementation, and as it is of open source nature, it offers lot of documentation and information, and it is also easy to benchmark and evaluate and compare its performance. The main virtues and problems of this solution are similar to the ones of other platforms that depend on a distributed file system, particularly the Google MapReduce proposal [1] Hadoop is based on.

The Hadoop documentation itself [2] presents some types of applications where HDFS is not very well suited:

• Low-latency data access: HDFS is optimized to deliver a high throughput of data, and this may be at the expense of latency. Still worse, in many situations, time response cannot be reduced by increasing the number of machines in the cluster.

· Lots of small files: Since the NameNode (where the distribution of block files are kept, Figure 4, Figure 5) holds file system metadata in memory, the limit to the number of files in a file system is governed by the amount of memory on the NameNode.

• When appending more data to a data set, there is no warranty that all the blocks will reside in the same host, so it is not possible to guarantee data locality for next operations. Hadoop is mainly focused on batch processing, and in this scenario these drawbacks are not very hampering (but they could be when trying to apply MapReduce paradigm to other types of applications).

Commercial MapReduce solutions are mainly meant for very large clusters, where fault tolerance is a must. The architecture of the whole platform must be flexible enough to switch and reconfigure when there is a machine failure, and for this strategy the distributed file system may be the right tool.

The cost to be paid for this flexibility is a suboptimal distribution of data and tasks, with an impact on global performance. There is no guarantee that an available slot for a new task will be found in any of the machines where the input data replicas are stored, so an extra data movement could be needed to launch the task.

Furthermore, in these frameworks, the intermediate data produced after a map operation is first stored in the local disk of the nodes of the cluster (Figure 6). Then the master scheduler copies these files to the distributed file system, and performs the shuffling and sorting if needed. This strategy simplifies the management, particularly for run-time failures handling, but incurs in an extra data movement.

All these features provide the fault tolerance requirement, but at a cost in performance (in the scenario of large clusters, this cost is greatly compensated by the possibility of handling thousands of commodity nodes, without problems because of machine or disk failures).

However, there is a medium size cluster niche where, while keeping fault tolerance with less robust methods, a perfect distribution of input data and operations can be achieved so the data always live on the same machine. There is also a niche for streaming (in contrast to batch) operations, where low latency is a must. The invention proposed solves these problems and can be gainfully applied to these scenarios.

Another space for improvement comes from analyzing the operational restrictions of this implementation, taking into account the needs of a particular business.

Although (as it has been presented) any job that can be described in terms of commutative and associative operations can be implemented according to the MapReduce paradigm, its strict definition can make some operations very difficult to implement. For example, a rather common operation when dealing with databases or tables is a JOIN clause that combines records by using values from tuples with a key common to each. This type of operation should be implemented as a reduce operation, but as reduce operations receive only a dataset of key-value pairs, that implementation cannot be as straightforward as desirable.

In the standard implementation (Hadoop), there are three strategies to work with relational JOINS [3] In order to simplify, the explanation will suppose only two tables to be joined:

• reduce-side join: The basic idea behind this solution is to repartition the two datasets by the join key. In the mapper, a composite key is created consisting of the join key and an id identifying the value, different for each of the tables being joined. Now, the sort order of the keys must be defined to first sort by the join key, and then sort the other element of the composite key, grouping them by the table of origin. At the reducer, it is guaranteed that all values from one table will arrive before the values of the other, so the former can be kept in memory to combine with the latter. The approach, besides requiring an extra map to create the new intermediary key, isn't very efficient since it requires shuffling both datasets across the network.

• map-side join: In this solution, there is a map over one of the datasets (usually the larger one) and inside this function, the mapper reads the corresponding part of the other dataset to perform the merge join. Previously, both datasets have to been partitioned and sorted in the same way. No reducer is required. It is more efficient than a reduce-side join, since there is no need to shuffle the datasets over the network. The efficiency of this solution depends on how easy it is to have both datasets partitioned and sorted with the same functions. In some problems, this will come naturally from the same sequence of MapReduce operations, but if it does not, then it will incur also in the need of an identity mapper and reducer to repartition the data.

• memory-backed join: This approach consists on loading the smaller dataset into memory in every mapper, populating an associative array to facilitate random access to tuples based on the join key. Mappers are then applied to the other (larger) dataset, and for each input key-value pair, the mapper probes the in-memory dataset to see if there is a tuple with the same join key. If there is, the join is performed. The problem with this approach is the need of the dataset to fit into memory. If it does not, then it must be partitioned, either directly, or with a distributed key-value store. A complete analysis of JOIN operations in MapReduce platforms [4] shows that though there a number of strategies that can improve JOIN performance, this invention reveal the limitations of the MapReduce programming model for implementing joins. Description of the Invention

It is necessary to offer an alternative to the state of the art which covers the gaps found therein, particularly related to the lack of proposals which really allow the optimization of the performance of data managing in a distributed computing platform.

To that end, the present invention provides, in a first aspect, a method for managing and processing data in a distributed computing platform, said distributed computing platform comprising at least one cluster with a plurality of nodes with computing capacity, the method performing at least map and reduce operations on said data.

On contrary to the known proposals, in the method of the first aspect of the invention, each of said plurality of nodes has said data locally available and map, reduce, parsing and/or storing operations are performed on data accessed locally without accessing a remote or centralised storage unit.

Other embodiments of the method of the first aspect of the invention are described according to appended claims 2 to 9, and in a subsequent section related to the detailed description of several embodiments.

A second aspect of the present invention generally comprises a distributed computing platform system for managing and processing data, said distributed computing platform comprising at least one cluster with a plurality of nodes with computing capacity for performing at least map and reduce operations on said data, said plurality of nodes comprise local storage means and are configured to perform map, reduce, parsing and/or storing operations on data retrieved from the local storage means of the respective node, without accessing a remote or centralised storage unit.

Another embodiment of the second aspect of the invention is described according to appended claim 1 1 , and in a subsequent section related to the detailed description of several embodiments.

Brief Description of the Drawings

The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached drawings, which must be considered in an illustrative and non-limiting manner, in which:

Figure 1 shows an example of a functional programming diagram, with map (f) and fold (g) functions.

Figure 2 shows an example of the word count algorithm implementation in MapReduce.

Figure 3 shows the GFS architecture.

Figure 4 shows the sequences of requests for flow reading a file in HDFS.

Figure 5 shows the sequences of requests for flow writing a file in HDFS.

Figure 6 shows an example of the Hadoop data distribution.

Figure 7 shows the logical flow of data management, according to an embodiment of the present invention.

Figure 8 shows the data distribution diagram, according to an embodiment of the present invention.

Figure 9 shows the logical data flow at the output of a distributed operation. Information is sent immediately to the destination worker, instead of being stored locally like in Hadoop implementation

Figure 10 shows an example of the key-values consolidation.

Figure 1 1 shows an example of the local data file structure

Figure 12 show an example of the Reading of key-values with associated hash group HGN on a dataset with three files in the worker local disk.

Figure 13 shows an example of the implementation of input reduce operation.

Detailed Description of Several Embodiments

The present invention proposes a method to organize and manage the internal data in a MapReduce platform that minimizes the network traffic and guarantees that the input data is locally available (in the local storage) to the machine where the next operation is executed and efficiently processed.

This invention also includes a new functionality for MapReduce operations, allowing to directly implement internal and external JOI Ns in a single reduce operation, with two datasets aligned by key.

The main objectives of the internal data management are:

• For every operation (map or reduce) to be run on a machine of the distributed cluster, input data is available locally in that machine (input data is stored in a hard disk attached to that machine). • The internal data structure allows an efficient implementation of input/output handling operations.

• Strategy for distributing data across nodes is implemented in the data handling operations themselves, so there is no need of a centralized process to locate a file (the equivalent of a HDFS NodeName or a GFS master), avoiding in this way a single point of failure.

• The internal data structure allows efficient appends to a data-set.

• The data distribution across the cluster allows map and reduce operations to have multiple input data streams.

· These multiple input data streams, in the case of a reduce operation, aligned by key, allows to implement directly JOINs, both internal and external, with any number of tables, without any extra data-shuffling, and without limitations in size requirements.

• The data management allows latency as low as possible. Moreover, this low time response scales linearly with the number of nodes in the cluster.

This data managing strategy is meant to distribute data among a cluster of machines, where every node in the cluster will have all the needed data locally available, either to be processed or store.

This method can optimized the performance of a Big-Data processing cluster implementing the MapReduce paradigm.

This new method empowers also a new implementation of straightforward JOIN operations in the MapReduce platform.

Operations management:

Any operation (parse, map, reduce) is distributed among the nodes in the cluster. Each node (or worker) executes a number of instances of these operations. Each of these instances implies a sequence of sub-operations:

1. Input data is read. If needed (for reduce operations), a sort on the key is performed.

2. The body of the operation itself is run. This is the piece of code written by the developer that forms part of a particular solution.

3. Output data is distributed to the nodes of the platform.

4. In each node, data generated from everywhere, is collected and stored locally. The first three steps are performed distributed across all the workers in the platform. With the proposed invention, it is always guaranteed that data and processing are always local to the worker, and distribution of the data is performed on memory.

The invention relies on how sub-operations 1 ), 3) and 4) are performed and on how datasets, consisting in key-value pairs groups, are created, distributed and read. The logical sequences of data handling and distribution are depicted in Figure 7 and explained in the following datasets management section.

Datasets management

Output data is distributed to the nodes of the platform:

At the output of an operation (either map or reduce), key-value pairs are generated and emitted to a specific dataset (in the proposed invention, an operation can have multiple output datasets, so different key-value pairs types can be produced from the same operation).

Key-value pairs are distributed to their storage node without passing by the local disk of the emitting node, as this sub-operation is performed in memory and data is sent through the platform network layer. This process is performed without need to communicate with a central controller, as there is a fixed strategy for data distribution in the platform, and every worker has this strategy.

The schema is showed in Figure 8. When an instance of an operation (parse, map or reduce) running in a worker (#2, for example) generates and emits a key-value pair (dotted line), it is distributed through the platform network layer to the final storage node (or nodes, if redundancy is working). The distribution criterion is based in the key field.

A hash function is computed on the key. The space of possible results of the hash function is partitioned in order to manage a more practical number of groups. After the hash on the key and partitioning functions, the key-value pair has an associated number, that will be used to identify to which worker it is going to be sent, and also will determine how it will be stored there. This number is called the hash group.

A standard function can be used to compute hash and to perform the partitioning of the data, globally optimized and associated to the type of the key, or it can be user provided. This gives the developer the freedom to control the data (and so the task) distribution for specific problems Knowing the hash group of the key, the platform (and every worker) also knows which worker the data belongs to (and so, which node must execute the operations on the data with this same hash group). The data is thus sent to the predefined worker (or workers) (Figure 9). The output is managed in memory, and then sent through the network, without being written down to disk. This is one of the core aspects of the invention. In other solutions ([2]), the output of the operation is written down to local disk. Then it is sorted, shuffled and copied to the distributed file system, where it is merged. This extra data movement has a big impact on performance. In each node, data generated from everywhere, is collected and stored locally:

In each node (Figure 10), key-values produced at all the operation instances running in the other workers (and itself), with a hash group assigned to that worker, are received by a dedicated process (Consolidator). For every hash group, key-values are kept in memory until either the source operations running in the workers is finished, either a chunk large enough to be dumped to a file has been collected (i.e., 1 GByte) .

All the key-values must be available in memory because they have to be written to the disk file as a whole, and have to be sorted before. Then it is possible to dump them to a disk file, so a local file holding all the key-values accumulated to be processed (or replicated) at the worker is created. If there are more key-value pairs pending, they will be accumulated in the next file.

Using this file size limitation (i.e., 1 GB) can produce data-set fragmentation, a compacting command is provided that can read back all the data-set files generated at each worker, and reorganize the hash groups in a new set of disk files. Fragmentation of the hash groups among a large number of disk files can have a negative impact on performance, so this operation is very useful when the same dataset will be used as input to different operations. The key-value pair distribution strategy, together with this possibility of compacting hash groups, are the base for allowing appends to the dataset keeping a good performance at the same time.

The pre-known (although flexible and modifiable) distribution of hash group values among the workers guarantees the locality of operations and its corresponding input data. This simplifies the task of distribution and it boosts performance. This is another core aspect of the invention.

Input data is read: The structure of the disk files holding the datasets has been designed to offer the best performance for the input data reading sub-operation.

The structure of a local data file is the same for all the workers and for all the key-value datasets (Figure 1 1 ). The file has three parts:

· A header with information about the file contents and its type and coding

• One structure per hash group identifier (typically 65536, but this number can be changed in the platform when dealing with very large datasets). Each structure has:

o The offset to the file section where the key-values which correspond to the selected hash group are stored

o The size of the set of key-value pairs belonging to the hash- group (this information could be computed from the offset of the next hash-group, but the extra size cost simplifies the management.

· ΑΙΙ the sequence of key-values that corresponds to the hash groups represented in the file.

The files are intended to be 1 GB (configurable) large. If the dataset key-values to be processed by the worker are larger than the file size, its section of the dataset will be held in more than one file. If there are not enough key-values to fill the 1 GB file, it can be smaller.

As presented in Figure 1 1 , for all the dataset files in every worker, space is reserved to represent all the possible hash groups (from 0 to 65536). The distribution strategy guarantees that in every worker, only key-values having specific hash groups will arrive and be stored, and so only a subset of this space will be used (different subset in different workers). This represents a small penalty, but simplifies data handling and offers a simple way to work with duplicated files across the cluster nodes, and so it gives some fault tolerance.

When the key-values have to be read, in order to be processed, the worker retrieves its designated hash groups. Knowing the size of the data associated to every hash group, and the memory size available for the operation, it reads the maximum size of key-values in a single chunk, reducing so the number of I/O operations and maximizing throughput. In each file, the chunk corresponding to the requested key- values is large enough to allow an efficient implementation of the reading function.

In Figure 12, they key-values of a dataset have been stored in three files in the worker local disk. In order to retrieve the data corresponding to a set of hashgroups (from hashgroup N to hashgroup N+k, for a total size of 1 GB) the platform has to read on the three files. As one hash group is distributed among more and more disk files, the reading function may be less and less efficient (many readings of small size). The compacting functionality presented previously performs a defragmentation of the hash group distribution, and allows recovering the reading efficiency.

If needed (if the operation to be run is a Reduce), a local sort is run on every hash group. Each sort operation is very efficient since it is performed with the key- values from each hash group, and is performed in memory after reading the hash group.

With this architecture, there is no restriction to the number of input datasets an operation can have. In particular, this property simplifies and speeds up operations like JOINs, very frequent in database oriented, big-data applications. This structure is key to the data management in the platform. This functionality is fully described in next section, as it constitutes one of the main claims of the invention.

Multiple inputs, multiple outputs operations:

In the previous section, a new data management strategy for MapReduce key- value datasets has been presented, with a fixed key-value pairs distributed by hash- group.

This distribution strategy guarantees that all the key-value pairs with a common key, from all the datasets in the platform, are stored in the local disk of the same machine.

Another innovation in this MapReduce implementation is to allow any map or reduce operation to have several inputs or output datasets (multiple inputs map operations do not make much sense).

An immediate consequence of having all the key-value pairs with a common key, from all the datasets in the platform, stored in the local disk of the same machine, is that it is very simple and efficient to implement multiple input reduce operations (Figure 13). All the key-value pairs from all the datasets in the platform, with a common key, are kept in the same machine, so reduce operations will always have local access to any required dataset.

Obviously, the only requirement is that all input datasets must have the same data type as key, so the key can be compared among them. JOIN operations implementation:

As a result of both previous innovations, reduce operations with several input datasets will have the datasets automatically partitioned and sorted in the same way, so the basic condition to perform a JOIN type operation is automatically met in the platform. This is possible because the dataset distribution strategy has put all the key- values with a common key, from its input datasets, in the local disk of the same machine. So there is no conflict (as it would be in a platform based on a distributed file system, like Apache Hadoop and HDFS) in assigning reducers to machines where local data lives.

This implementation of JOINs is as efficient as a standard reduce operation and has no size restrictions (unlike map-side or reduce-side JOINs in Hadoop).

This is easily illustrated with an example. The job is to cross Call Data Records (CDRs) with the clients table, in order to detect patterns associated to demographical or contract information. There is a table of CDRs, and a table of Users, both having as key fields the Userld (Table 1 and Table 2).

The first part of the job (to accumulate the duration of all the calls from a user) can be implemented in just one reduce operation. In Telefonica's platform formalism: reduce accum_per_Contract

in CDR # key: Userld, value: CDRInfo in User # key: Userld, value: Userlnfo Consum # key: Contract, value: Consumption

The task could be implemented as a reduce operation that has as first input the sequence of CDRs, and as second input the User table. CDRs and User have been partitioned and sorted, and for each instance of accum_per_Contract, it is run in a machine where all the needed key-value pairs have been previously stored. Furthermore, each instance is invoked with all the key-value pairs corresponding to a unique key, and only those. So the JOIN operation is really simple:

class accum_per_Contract : public samson::Reduce

{

public:

void run (samson::KVSetStruct^* inputs, samson::KVWriter ^*writer)

{

//This function is invoked once per key...

if ((inputs[0].num_kvs == 0) || (inputs[1].num_kvs == 0))

{

return;

}

//... so in each run, only info from one userld will be processed

//If there are more than 1 entry un User table, it is an error

if (inputs[1].num_kvs != 1 )

{

samson::system::Ulnt userld;

userld. parse( inputs[1].kvs[0]->key );

OLM_E(("Userld %lu with multiple entries in User table: Userld. value)); return;

}

samson::system::Ulnt userld;

CDRInfo cdr;

Userlnfo user;

//Extract user information from table

userld. parse(inputs[1].kvs[0]->key);

user.parse(inputs[1].kvs[0]->value);

//And now it just processes every CDR from that user

//Platform guarantees that all her/his CDRs (if any),

//and only those, will be available in the stream

//of input[0].

size_t dur_accum = 0;

for (size_t i = 0 ; i < inputs[0].num_kvs ; ++i)

{

cdr.parse(inputs[0].kvs[i]->value);

dur_accum += cdr.duration;

}

//And now it emits the selected information fron the Userlnfo,

//with the accumulated duration

writer->emit( 0 , &user.Contract, &dur_accum); Embodiments of the Invention:

The invention has been implemented in a Big-Data platform, a MapReduce framework (in experimental prototype status). With this framework, different problems relevant to Telefonica business and other typical problems have been programmed and evaluated.

A few examples of problems implemented:

• Social Network Analysis (SNA): Analysis of CDRs to detect user communities.

• Top_hits: Analysis of web navigation logs, in order to detect the most active hits per category, and recommend them according to the user profile.

• Navigation Traffic (OSN): Analysis of patterns navigation in mobile users, classified by type of URL, type of device, etc.

• Mobility: Analysis of CDRs to detect mobility patterns.

Advantages of the Invention

The invention is meant to solve, in a very efficient way, the problem of data managing in a medium size MapReduce (or another parallel paradigm, like Dryad) platform.

This solution offers the best performance for some special types of applications that are not so well solved with the MapReduce solutions available in the market.

The main advantages of the invention are:

1. When running a map or reduce operation in a machine, needed data are always found in the machine's local disk.

2. When generating new key-value pairs, data are not necessarily stored on an intermediate disk, but they are emitted to the final worker machine, where they will be stored (for memory requirements, perhaps they can then be temporarily kept on disk, until the final sort and consolidation). 3. Advantages 1 ) and 2) guarantee a low latency response, a must when working on streaming or almost real-time scenarios.

4. As data distribution strategy is implemented in the data handling operations themselves, there is no need of a centralized process to locate a file (the equivalent of a HDFS NodeName or a GFS master), avoiding in this way a single point of failure.

5. Advantage 4) allows a more robust distributed implementation.

6. Time response scales linearly with the number of machines in the cluster.

7. The output key-value pairs of new operations can be appended to the datasets, keeping the consistency and locality of the data, without blurring through the cluster nodes.

8. Advantages 6) and 7) allow the implementation of streaming solutions on big data problems, closing the gap between batch processing and real-time processing.

9. Map and Reduce operations can have any number of inputs and outputs, simplifying thus the number of cycles to be run to complete a job.

10. In reduce operations; all the input datasets have the key-value pairs with the same key stored in the same machine, so input data are always local.

1 1 . Advantages 8) and 9) enable the possibility to have very efficient and straightforward JOIN operations, implemented as a reduce. The strategy works for inner and outer JOINS, and has no memory limit to the table sizes.

ACRONYMS

CDR Call Data Record

GFS Google File System

HDFS Hadoop Distributed File System MR Map-Reduce

SNA Social Network Analysis

URL Uniform Resource Locator

REFERENCES

[1] Jeffrey Dean and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters". In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150, San Francisco, California, 2004.

[2] Tom White, Hadoop, The definitive Guide, Chapter 3, O'Reilly, Yahoo Press, 2010

[3] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan & Claypool, 2010.

[4] Spyros Blanas et al, A Comparison of Join Algorithms for Log Processing in MapReduce, SIGMOD 2010, Indianapolis

Claims

1. - A method for managing and processing data in a distributed computing platform, said distributed computing platform comprising at least one cluster with a plurality of nodes with computing capacity, the method performing at least map and reduce operations on said data, wherein the method is characterized in that each of said plurality of nodes has said data locally available and map, reduce, parsing and/or storing operations are performed on data accessed locally without accessing a remote or centralised storage unit.

2. - The method of claim 1 , wherein said plurality of nodes define a strategy for said data distribution across said plurality of nodes depending on the task performed.

3. - The method of claim 2, wherein said plurality of nodes distribute said data to the final storage node of said plurality of nodes.

4. - The method of claim 2, wherein said data is distributed to a at least part of said plurality of nodes.

5.- The method of claim 2, wherein said data is distributed to all of said plurality of nodes.

6. - The method of any of previous claims, wherein said strategy for said data distribution is based in a key field.

7. - The method of claim 2, comprising performing said data distribution in a memory of said plurality of nodes.

8. - The method of claim 1 , further comprising performing said reduce operations on a plurality of different data.

9. - The method of claim 8, wherein said plurality of different data operated being the same between them are stored in the same node of said plurality of nodes.

10.- A distributed computing platform system for managing and processing data, said distributed computing platform comprising at least one cluster with a plurality of nodes with computing capacity for performing at least map and reduce operations on said data, wherein the system is characterized in that said plurality of nodes comprise local storage means and are configured to perform map, reduce, parsing and/or storing operations on data retrieved from the local storage means of the respective node, without accessing a remote or centralised storage unit.

1 1.- The distributed computing platform system of claim 10, wherein implements a method according to any of the previous claims.