WO2022250910A1

WO2022250910A1 - Scaling deep graph learning in distributed setting

Info

Publication number: WO2022250910A1
Application number: PCT/US2022/027744
Authority: WO
Inventors: Anand Padmanabha Iyer; Swapnil Sunilkumar GANDHI
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-05-23
Filing date: 2022-05-05
Publication date: 2022-12-01
Anticipated expiration: 2023-11-23
Also published as: EP4348507A1

Abstract

Methods, systems, and computer programs are presented for generating embeddings for nodes in a graph neural network (GNN). One method includes operations for assigning each node from a graph of the GNN to one machine from a group of machines, assigning each feature from a group of features to one machine, and calculating, at each machine, an embedding for each node assigned to the machine. Further, calculating an embedding for a first node in a first machine comprises calculating a computation graph for the first node, obtaining data for the features assigned to the first machine, computing partial activations for a first layer of the graph, pulling partial activations for the first node from other machines, and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

Description

SCALING DEEP GRAPH LEARNING IN DISTRIBUTED SETTING

TECHNICAL FIELD

The subject matter disclosed herein generally relates to methods, systems, and machine-readable storage media for improving the operation of predictive systems based on Graph Neural Networks (GNNs).

BACKGROUND

GNNs is one of the fastest growing subareas in deep learning (e.g., Deep Neural Networks (DNNs)), which has become a powerful tool for several challenging applications in diverse fields such as computer vision, speech recognition, and natural language processing, producing results on par with human experts. Recently, there has been a significant interest in GNNs that operate on graph structured data, and due to the expressiveness of graphs in capturing the rich relational information between input elements, GNNs have enabled breakthroughs in many important domains including recommendation systems, knowledge graphs, and drug discovery.

A key challenge in GNNs is in scaling computations for large input graphs. Some systems may have graphs with data sizes that could be several terabytes or even petabytes of data. Processing this large amount of data requires a large amount of computer resources and long training periods. What is needed are systems that can speed up the processing of large GNNs.

BRIEF DESCRIPTION OF THE DRAWINGS

Various of the appended drawings merely illustrate example embodiments of the present disclosure and cannot be considered as limiting its scope.

Figure 1 illustrates the iterative graph propagation for learning in a GNN, according to some example embodiments.

Figure 2 illustrates the embedding of nodes in a GNN, according to some example embodiments. Figure 3 illustrates an architectural framework for distributed graph processing, according to some example embodiments.

Figure 4 illustrates an example of building a two-layer neighborhood graph, according to some example embodiments.

Figure 5 illustrates a scheme for partitioning a graph structure and corresponding features for distributed processing, according to some example embodiments.

Figure 6 illustrates a P³ (Pipelined Push-Pull) method for partitioning a graph structure and corresponding features for distributed processing, according to some example embodiments. Figure 7 illustrates the process for generating embeddings, according to some example embodiments.

Figure 8 illustrates the pipeline processing for creating the embeddings, according to some example embodiments.

Figure 9 illustrates the performance improvements achieved using the P³ method, according to some example embodiments.

Figure 10 illustrates the training and use of a machine-learning model, according to some example embodiments.

Figure 11 is a flowchart of a method for generating embeddings for nodes in a GNN, according to some example environments.

Figure 12 is a block diagram illustrating an example of a machine upon or by which one or more example process embodiments described herein may be implemented or controlled.

DETAILED DESCRIPTION

Example methods, systems, and computer programs are directed to generating embeddings in GNNs such that the embeddings of the nodes in the GNN represent the proximity among related nodes. Examples merely typify possible variations. Unless explicitly stated otherwise, components and functions are optional and may be combined or subdivided, and operations may vary in sequence or be combined or subdivided. In the following description, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of example embodiments. It will be evident to one skilled in the art, however, that the present subject matter may be practiced without these specific details.

In one aspect, methods and systems are provided for scaling the training of GNNs in a distributed environment. The scalability challenges in GNNs are fundamentally different from those in existing deep-learning and distributed graph processing, where commonly used techniques, such as intelligent partitioning of the graph, do not yield desired results. A new approach for distributed GNN training is presented that eliminates overhead when partitioning the processing in different machines and using a new pipelined push-pull parallelism-based execution for the learning task. The system provides an Application Programming Interface (API) that captures different classes of GNN architectures. When used with a caching strategy, the evaluation shows that the system accelerates processing by a factor of seven when compared with existing distributed GNN frameworks.

A system is presented, referred to as P³ for Pipelined Push-Pull, with an execution strategy that combines intra-layer model parallelism with data parallelism to significantly reduce network communications and allow for pipelining computing and communication operations. P³ scales to large graphs gracefully and achieves significant performance benefits.

Methods, systems, and computer programs are presented for generating embeddings for nodes in a GNN. One method includes operations for assigning each node from a graph of the GNN to one machine from a group of machines, assigning each feature from a group of features to one machine, and calculating, at each machine, an embedding for each node assigned to the machine. Further, calculating an embedding for a first node in a first machine comprises calculating a computation graph for the first node, obtaining data for the features assigned to the first machine, computing partial activations for a first layer of the graph, pulling partial activations for the first node from other machines, and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

Figure 1 illustrates the iterative graph propagation for learning in a GNN, according to some example embodiments. A graph (e.g., 102, 104) is a structure amounting to a set of objects in which some pairs of the objects are related. The objects correspond to mathematical abstractions called nodes 106 (also referred to as vertices or points) and each of the related pairs of vertices is called an edge (also referred to as link or line). Typically, a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges.

A neural network (NN) is a computing system based on consideration of biological neural networks of animal brains. Such systems progressively improve performance, which is referred to as learning, to perform tasks, typically without task-specific programming. For example, in image recognition, a neural network may be taught to identify images that contain an object by analyzing example images that have been tagged with a name for the object and, having learned the object and name, may use the analytic results to identify the object in untagged images. A neural network is based on a collection of connected units called neurons, where each connection, called a synapse, between neurons can transmit a unidirectional signal with an activating strength that varies with the strength of the connection. The receiving neuron can activate and propagate a signal to downstream neurons connected to it, typically based on whether the combined incoming signals, which are from potentially many transmitting neurons, are of sufficient strength, where strength is a parameter.

A DNN is a stacked neural network, which is composed of multiple layers. The layers are composed of nodes, which are locations where computation occurs, loosely patterned on a neuron in the human brain, which fires when it encounters sufficient stimuli. A node combines input from the data with a set of coefficients, or weights, that either amplify or dampen that input, which assigns significance to inputs for the task the algorithm is trying to learn. These input-weight products are summed, and the sum is passed through what is called a node’s activation function, to determine whether and to what extent that signal progresses further through the network to affect the ultimate outcome. A DNN uses a cascade of many layers of non-linear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Higher-level features are derived from lower-level features to form a hierarchical representation. The layers following the input layer may be convolution layers that produce feature maps that are filtering results of the inputs and are used by the next convolution layer.

During training of a DNN architecture, a regression, which is structured as a set of statistical processes for estimating the relationships among variables, can include a minimization of a cost function. The cost function may be implemented as a function to return a number representing how well the neural network performed in mapping training examples to correct outputs. In training, if the cost function value is not within a pre-determined range, based on the known training images, backpropagation is used, where backpropagation is a common method of training artificial neural networks that are used with an optimization method such as a stochastic gradient descent (SGD) method.

Use of backpropagation can include propagation and weight update. When an input is presented to the neural network, it is propagated forward through the neural network, layer by layer, until it reaches the output layer. The output of the neural network is then compared to the desired output, using the cost function, and an error value is calculated for each of the nodes in the output layer. The error values are propagated backwards, starting from the output, until each node has an associated error value which roughly represents its contribution to the original output. Backpropagation can use these error values to calculate the gradient of the cost function with respect to the weights in the neural network. The calculated gradient is fed to the selected optimization method to update the weights to attempt to minimize the cost function.

GNNs are neural network architectures that operate on a graph. The aim of a GNN is for each node in the graph to learn an embedding containing information about its neighborhood (nodes directly connected to the target node via edges). In a GNN, the nodes in the input graph are associated with features and labels. Typical tasks in GNNs include node classification (predicting the label of a node), link prediction (predicting the possibility of a link between given nodes), and graph classification (predicting the categoryfor the graph).

To perform these tasks, GNNs combine feature information with graph structure to learn representations of nodes, which are low-dimensional vectors that are referred to as embeddings. Thus, learning such embeddings is one the goals of GNNs. There are available GNN architectures, including Graph-SAGE, Graph Convolution Networks (GCNs), and Graph Attention Networks (GATs). While each of these types of GNNs have their own advantages, they fundamentally differ in how the graph structure is used to learn the embeddings and what neural network transformations are used to aggregate neighborhood information.

At a high level, GNNs learn embeddings by combining iterative graph propagation and DNN operations (e.g., matrix multiplication and convolution). The graph structure is used to determine what to propagate and NNs direct how aggregations are done. Each node creates a k-layer computation graph based on its k-hop neighborhood and the associated features, which can be executed on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) using existing DNN architectures.

One of the distinctions between GNNs and DNNs is their input data dependencies: while traditional DNNs train on samples that are independent of each other (e.g., images), GNNs impose dependencies in the form of the graph structure. It is common to have a large number of dense features associated with every node (e.g., ranging from hundreds to several thousands) in the graph, and the k-hop computation graphs created by each node can be very large. Some techniques can help to process these nodes, such as sampling the neighbors at each layer, but depending on the graph structure, even a sampled computation graph and the features may not fit in the memory of a single GPU, making scalability a fundamental issue in GNNs. With the prevalence of large graphs, e.g., billions of nodes and edges, enabling GNN training in a distributed fashion is an important and challenging problem.

A system, referred to as P³, is presented that efficiently distributes the training of GNNs on large input graphs. Due to the data dependency in existing distributed training of GNNs, a major fraction of time is spent in network communication to generate the embedding computation graph with features. Further, relying on distributed graph processing techniques, such as advanced partitioning schemes, is useful in the context of graph processing but does not benefit GNNs, and may even be detrimental in some cases. Further yet, due to the high-network traffic issue, GPUs in distributed GNN training are underutilized, spending as much as 80% of the time blocked on communication operations. P³ addresses these problems to eliminate these inefficiencies.

P³ introduces a different approach by partitioning the computation graph and features into several subgraphs, which are trained in multiple GPUs and use an online partitioner and memory management techniques to leverage CPU and GPU computation capabilities.

P³ leverages a key characteristic of GNNs: unlike traditional DNNs where the inputs are small and model parameters are large (e.g., 8 billion for Megatron), GNNs have small model parameters but large inputs due to the large feature vectors associated with each node that account for the majority of network traffic during the GNN execution. Based on this, P³ distributes the graph structure and the features across the machines independently, relying on a random hash partitioner that is fast, computationally simple, and that incurs minimal overhead.

While calculating the embeddings, P³ takes a new approach. Instead of creating the computation graph by pulling the neighborhood of a node and the associated features, P³ pulls the graph structure, which is significantly smaller, and proposes push-pull parallelism, a novel approach to executing the computation graph that combines intra-layer model parallelism with data parallelism. P³ does not move features across the network, instead pushing the computation graph structure in the most-compute intense layer (layer 1) to the machines, which results in an even spread of the execution of layer 1. P³ proceeds with the rest of the k-layers by pulling the activations, which are much smaller, and switches to data parallelism.

Due to the partitioning strategy and the push-pull parallelism execution, P³ is able to use a simple pipelining technique that overlaps most of the computation and communication efficiently, thus effectively hiding (the already small) communication latencies. Further, the partitioning strategy also enables P³ to propose a simple caching mechanism that greedily caches graph and feature partitions on multiple machines if memory permits, resulting in further reduction of network communications.

The combination of these techniques enables P³ to outperform previous GNN solutions by a factor of up to seven times in time reduction for calculating the embeddings. Further, P³ is able to support much larger graphs and to scale with ease.

Figure 2 illustrates the embedding of nodes in a GNN, according to some example embodiments. GNNs map graph nodes to a <i-dimensional embedding space such that similar nodes in the graph are embedded close to each other.

In the illustrated example, nodes C and D are connected in input graph 202, which means that they are proximate to each other. The embedding ENC(C) 206 of node C results in Zc and the embedding ENC(D) of node D results in ZD, within the embedding space 204. Since the nodes are proximate, their embeddings Zc and ZD are also proximate. However, node G is not proximate to nodes C or D, so the embedding ZG is not proximate to Zc and ZD.

To obtain these embeddings, GNNs combine feature information associated with the nodes and the graph structure using information propagated and transformed from its neighborhood. The NN which transforms information at each hop is called a layer in the GNN; hence, a 2-hop (k-hop) neighborhood translates to a 2 (k) layer GNN.

The embedding z_v of node v after K layers of neighborhood aggregation is referred to as h_y and can be obtained with the following equations:

/i¾_v) = AGGREGATE^Xlh -¹ \ u E N(v)})

Here, h_y is the feature vector of node v at the layer k , Wk is the trainable weight matrix that is shared by all nodes for layer k , N(v) are the neighbors of node v, and s is a multiplier used for tuning the model. Further, h is initialized using the node features. The choice of AGGREGATE and COMBINE define the layers and how the embeddings are computed.

The embeddings may be used for generating recommendations. For example, if the user buys two products in an Internet commerce website, then the commerce website assigns a link to the products, because they are related, at least by the fact that the same customer bought both products at the same time. The commerce website may generate a very large graph with all these nodes and the links between the nodes. When a customer buys a product, the commerce website can recommend related products by selecting products with embeddings that are in proximity to the embedding of the product that the customer is buying.

To create these embeddings, the system creates a neighborhood graph for a given product, e.g., a one- or two-hop neighborhood. The GNN computes the neighborhood graphs, but because there may be millions of nodes, the graph may be really large with billions of data points. To calculate the embeddings, the graph is distributed across multiple machines for training.

Each machine fetches this information from other machines and then processes the computation graph to create the embeddings for the nodes assigned to the machine. Fetching this information from all other machines requires a large amount of computing resources, including network use. When a machine fetches the neighborhood information, the machine has to also fetch metadata (e.g., how many people purchased the item, price of the item, popularity of the item, how many items in stock), and this metadata is used to generate the embeddings. This information about the node is referred to as the features of the node, and the embedding of the node is referred to as the label of the node.

Figure 3 illustrates an architectural framework for distributed graph processing, according to some example embodiments. Existing frameworks for GNNs, such as the Deep Graph Library (DGL), support distributed training by combining distributed graph processing techniques with DNN techniques. The input graph is partitioned across the machines in the cluster. In this example, machine Ml 302 includes a host-memory area 304 for CPU operations and a device memory area 306 for GPU operations.

Given a batch size, referred to as a training sample, from the trainer 310, the computation graph for each node in the batch is generated by pulling the k-hop neighborhood of each node along with the associated features 312 to the K-hop sampler 314 (circle 2). This requires communication with other machines (e.g., M2, M3, M4) in the cluster. Once the computation graphs are constructed, standard DNN techniques, such as mini-batch training, are used to execute the computation, where mini-batches are created and copied to the GPU memory and then the model computation is triggered to obtain the GNN model.

However, there are several challenges for GNMA training in this architecture. The first challenge is the formation of communication bottlenecks due to data dependencies. Unlike traditional DNNs where the training data are independent of each other (e.g., images), GNNs impose a dependency between the training inputs in the form of graph structure. Thus, even though the batch size provided could be small (e.g., 1000), the computation graph samples could be exponentially larger due to the k-hop neighborhood and the associated features 312. A major reason for such large size is not the graph structure but the features, whose sizes typically range in hundreds to several thousands. In many real-world graphs, consisting of billions of nodes and edges, the 2-hop neighborhoods could be up to an order of magnitude larger than the 1-hop neighborhood.

When combined with the features, the resulting computation graph may easily exceed the memory of a single GPU or even main memory of a server. A common technique used to address such neighborhood explosion is sampling, where, instead of getting all the neighbors of the node in each hop, a fixed number is selected. An example is described below with reference to Figure 4, where the two-hop computation graph for node 1 is generated by sampling two neighbors at each hop. However, even with this sampling, the size of the computation graph could grow substantially, based on the sampling used and the number of layers. Since the neighborhood data and the features are obtained through the network, distributed training of GNNs spends a major fraction of time in network communication.

A second challenge is related to the ineffectiveness of partitioning, which is a common approach used in distributed graph processing of existing GNN frameworks. However, there are two issues. First, many partitioning schemes incur a cost in terms of computation and memory overhead. Second, the benefits of partitioning are severely limited as the layers in the GNN increase. It is noted that GNNs use k-hop neighborhoods to compute the embedding, and while partitioning schemes reduce communication, they only optimize communication at the first hop. Thus, when the number of layers increase, all partitioning schemes tend to fail.

A third challenge relates to GPU underutilization. Existing GNN frameworks utilize DNN techniques, such as data parallelism, to execute the generated computation graph. In data parallel execution, each machine executes a different set of samples. However, due to the data dependency, there is a communication bottleneck. A large fraction (e.g., up to 80%) of the time, the GPUs are waiting on communication. Thus, GPUs are heavily under-utilized in distributed GNN training due to network communication bottlenecks.

Figure 4 illustrates an example of building a two-layer neighborhood graph, according to some example embodiments. Graph 402 includes 6 nodes, numbered from 1 to 6, and matrix 404 includes the values of F features for the six nodes (the feature vectors for each node), where each row corresponds to one node and each column to one of the features.

The two-hop neighborhood sampling 406 for node 1 is calculated with a first layer 410 including the nodes with outgoing links to node 1 (nodes 2 and 5) and the second layer 408 with the nodes with outgoing links to the nodes in the first layer 410. The NN aggregates the features of nodes 2 and 5 for the computation associated with node 1.

Figure 5 illustrates a scheme for partitioning a graph structure and corresponding features for distributed processing, according to some example embodiments. Figure 5 shows an example partitioning scheme for the graph 402 in Figure 4.

The work for one or more nodes is distributed to the different machines. In this example, diagram 502 is for node 1 at machine Ml, diagram 504 is for nodes 2 and 5 at machine M2, diagram 506 is for node 3 at machine M3, and diagram 508 is for nodes 4 and 6 at machine M4.

The diagrams show the one-hop graphs for the nodes, and the corresponding machine gets the feature vector for the corresponding node or nodes. Thus, machine Ml gets the feature vector for node 1, and machine M2 gets the feature vectors for nodes 2 and 5.

Figure 6 illustrates the P³ method for partitioning a graph structure and corresponding features for distributed processing, according to some example embodiments. P³ presents an approach to distributed GNN training that reduces the overhead with computation graph generation. P³ provides independent hash partitioning of graph and features. In general, partitioning the input graph in an intelligent manner does not benefit GNN architectures significantly due to the characteristics of GNNs. P³ uses a simple partitioning scheme for the graph and associated features.

In some example embodiments, the nodes in the input graph are partitioned using a random hash partitioner, and the edges are co-located with their incoming nodes. Other embodiments may use other simple partitioning mechanisms, such as round robin and the like. The random hash partitioner is equivalent to some ID partitioning schemes available in some distributed graph processing frameworks and is computationally simple. Unlike other schemes (e.g., 2D partitioning), this scheme does not require preprocessing steps (e.g., creating local identifiers) or maintaining a separate routing table to find the partition where a node is present, which can be computed on the fly.

This partitioning of the graph ensures that P³ can support large graphs. In several cases, the graph structure (nodes and edges without the features) of real-world graphs can be held in the main memory of modern server class machines. In these cases, P³ can simply replicate the entire graph structure in every machine, which can further reduce the communication requirements.

While the graph structure may fit in memory, the input features rarely do. Typical GNNs work on input graphs where the feature vector sizes range from hundreds to several thousands. P³ partitions the input features along the feature dimension; that is, if the dimension of features is F and a machine cluster has N machines, then P³ assigns F/N features of every node to each of the machines. This is in contrast to existing partitioning schemes tuned for machine learning (ML) tasks.

P³ uses a hash-partitioning method for distributing the load to the different machines; that is, the identifier (ID) of the node (or some other value associated with the node) is hashed and the result indicates the machine assigned to the node.

In the illustrated example, the distribution to the four machines is the same as in Figure 5, but other examples may generate a different assignment of nodes to machines. Thus, diagram 602 is for node 1 at machine Ml, diagram 604 is for nodes 2 and 5 at machine M2, diagram 606 is for node 3 at machine M3, and diagram 608 is for nodes 4 and 6 at machine M4.

In this example, there are four features, and each feature is assigned to one of the machines; that is, each machine gets assigned one column from the matrix 404 of Figure 4 that includes the data for all the nodes for the corresponding feature. Further, as described in more detail below with reference to Figure 7, there are four sample sets, so each machine gets data for one of the features for all four sample sets.

Thus, in P³, the data is independently partitioned and it does not matter what features end up in what machine. That is, there is no relation between the feature slice in the machine and the nodes assigned to that machine.

This independent, simple partitioning of the graph and features enables P^3,s techniques to accelerate the training process. Breaking up the input along the feature dimension enables P³ to achieve work balance when computing embeddings. The hash-based partitioner ensures that the nodes, and the features in the layers farther from the node, are spread across the cluster of machines evenly. The simplicity of independently partitioning the structure and features also lets P³ cache structure and features independently in P³’s caching mechanism.

Figure 7 illustrates the process for generating embeddings, according to some example embodiments. P³ uses push-pull parallelism, and with the input graph and features partitioned, P³ adopts the mini-batch computation for GNNs, where P³ first generates the computation graph for a node and then executes it. In this example, there are four nodes, and each machine is assigned the computations for one of the nodes.

At the beginning of every mini -batch, each machine computes the computation graph for the nodes whose embeddings are computed at the machine. To generate the computation graph, each machine pulls the k-hop neighborhood for the node. Pulling refers to copying the data, possibly over the network, if the data is not local.

If the GNN architecture supports a sampling-based embedding computation, P³ pulls the sampled k-hop neighborhood; otherwise, P³ pulls the full k-hop neighborhood. It is noted that unlike existing GNN frameworks, the features are not pulled in either case to calculate the k-hop neighborhood, thereby significantly reducing the network communication necessary to create the computation graph.

If the entire graph structure is available in every machine, this is a local operation. Otherwise, this results in minimal network communication as the graph structure is small. At the end of this process, P³ ends up with the k-layer computation graph of each node in the minibatch at the machine which owns the node. In existing GNN frameworks, the machine owning the node ends up with both the computation graph and all the features for every layer.

In the case of existing GNNs, each machine can now independently execute the complete computation graph with features obtained in a data parallel fashion, starting at layer 1 and invoking global gradient synchronization at every layer boundary in the backward pass. However, P³ utilizes a hybrid parallelism approach that combines model parallelism and data parallelism, referred to herein as push-pull parallelism.

While model parallelism is rarely used in traditional DNNs, due to the underutilization of resources and difficulty in determining how to partition the model, P³ uses model parallelism to its advantage. Due to the nature of GNNs, the model (embedding computation graphs) is easy to partition cleanly since the boundaries (hops) are clear. Further, due to P^3,s partitioning strategy, the model parallelism does not suffer from underutilization of resources.

To start the computation-graph execution, P³ pushes 714 the computation graph for layer 1 to all the machines. In this case, the training data 703 includes for sample sets S1-S4. It is noted that layer 1 is the most compute intensive, as it requires input features from layer 0 (having the most fan-out), which are evenly spread in P³ due to the partitioning scheme.

Each of the features 704 is assigned to one machine, such as by using the hash algorithm. Thus, machine Ml 702, receives one or more features 704 assigned to machine Ml 702. In the illustrated example, each machine receives one feature, and each node is assigned one of the features, but other embodiments may have machines that receive more than one feature. Since there are four sample sets S1-S4, machine Ml 702 receives the values of the first feature from the four sample sets, represented as four planes in Figure 7.

Once each machine (e.g., machine Ml 702) obtains the computational graph for layer 1, the machine starts the forward pass in a model parallel fashion with layer 1M 718. Each machine computes partial activations 706 for layer 1M using the partition of input features the machine owns.

The intermediate results between the layers are referred to as activations. An activation function decides whether a neuron should be activated by calculating a weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. A NN has neurons that work in correspondence of weight, bias, and their respective activation function. In a NN, the weights and biases of the neurons are updated on the basis of the error at the output. This process is known as back-propagation. Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

Since all GPUs in the cluster collectively execute the layer which requires input from the most fan-out, this avoids underutilization of GPUs in P³. During testing, it was observed that GPUs in existing GNN frameworks (e.g., DGL) spend 80% of the time blocked on network communications compared to 15% blocking for P³.

Once the partial activations 706 are computed, the machine assigned to each node pulls 720 the partial activations 706 from the other machines. The node receiving the partial activations aggregates them using a reduce operation 716 to generate partial activations 708.

At this point, P³ switches to data parallelism mode to calculate layer ID. The partial activations 708 are then passed through the rest of layer ID operations (if any, e.g., non-linear operations that cannot be partially computed) to obtain the final activations 710 for layer 1. The computation proceeds in a data-parallel fashion to obtain the embedding 712, at which point the forward pass ends.

The reason for calculating the activations is that the layers could be nonlinear. The aggregation function adds up the activations, and the aggregation function could include functions such as finding the maximum, which is a non-linear function and cannot be combined in any order. It has to be combined in a particular order to get the desired results. If there is a non-linearity in the graph neural network, the layer cannot be processed in one operation, which means distributed processing cannot be done for the non-linearity. This is why there is a layer 1M and a layer ID. The nonlinear part has to be, perforce, at the corresponding machine, because it has to be ordered. In some GNNs where there is no non-linearity, layer ID is not used or will be a no-operation.

The backward pass proceeds in similar fashion to existing GNN frameworks in a data parallel fashion, invoking global gradient synchronizations until layer ID (circle 6 722). At layer ID, P³ pushes 724 the error gradient to all machines in the cluster and switches to model parallelism. Each machine now has the error gradients to apply the backward pass for layer 1M locally (circle 8) and the backward pass phase ends. Thus, the errors calculated in the back propagation are sent to all the machines, and each machine can now do its error computation.

Because supervised learning is used, the errors are calculated based on the partial results obtained and the ground truth known from the training set. This is the backward process where the partial errors are distributed back to the machines. Then, the next epoch (e.g., iteration) proceeds after adjusting parameters (e.g., hyperparameters) for the next epoch. For example, the process may be repeated for 100 epochs, 1000 epochs, or some other value.

While the partial activation computation in a model parallel fashion seemingly works in the general sense, it is restricted to transformations that can be aggregated from partial results. However, in certain GNN architectures (e.g., GAT), layer 1M in itself may introduce non-linear transformations. P³ relies on developer input to determine the tensors that require global synchronizations during model parallel execution to ensure correctness.

At a first glance, the additional steps in P³, namely the need to push graph structure in layer 1, aggregation of partial activations during the forward pass, and additional push of gradients in the backward pass, may seem like overheads that may lead to inefficiencies compared to just pulling the features along with the graph structure and executing everything locally as in existing GNN frameworks. However, P^3,s approach results in significant savings in network communication. First, P³ does not pull features at all, which tremendously reduces network traffic; typically, the 2-hop neighborhood in the GNN computation graphs is an order of magnitude more than the 1- hop neighborhood. Second, regardless of the number of layers in the GNN, only layer 1 needs to be partially computed and aggregated. Finally, the size of the activations and gradients are small in GNNs (due to the smaller number of hidden dimensions); thus, transferring them incurs much less overhead when compared to transferring features.

To illustrate this, an experiment was run with a representative GNN: a 2-layer GraphSAGE on the open-source OGB-Product dataset on four machines. A thousand labeled nodes were used to compute the embeddings and use neighborhood sampling (fan-out:25,10). The nodes were associated with feature vectors of size 100, and there were 16 hidden dimensions. At layer 0 (2- hops), there were 188339 nodes, and at layer 1 (1-hop), there were 24703 nodes. Pulling features along with graph structure incurred 71.84 MB of network traffic. On the other hand, the activation matrix is just of size input by hidden dimension. P³ only needed to transfer the partial activations from three other machines, thus incurring just 5 MB (3x24703x16) of traffic. Hence, by distributing the activation computation of the layer that holds the largest number of features, P³ was able to significantly reduce network communications.

Models may be run against a training dataset for several epochs (e.g., iterations), in which the training dataset is repeatedly fed into the model to refine its results. Once an epoch is run, the models are evaluated and the values of their variables are adjusted in an attempt to better refine the model in an iterative fashion. In various aspects, the evaluations are biased against false negatives, biased against false positives, or evenly biased with respect to the overall accuracy of the model. Each model develops a rule or algorithm over several epochs by varying the values of one or more variables affecting the inputs to more closely map to a desired result.

In existing system, the training data 703 is distributed to the machines, where each machine gets all the features for the nodes assigned to the machine. Every machine ends up with the graph structure and the features for the nodes. Now, the machines have all the information to send to the GPU, the machines push the information to the GPU in the machine, and then the GPU starts processing layer by layer.

Figure 8 illustrates the pipeline processing for creating the embeddings, according to some example embodiments. Although P³’s push-pull parallelism GNN computation graph creation and execution incurs significantly less network communication compared to existing GNN frameworks, P³ needs to communicate more times because P³ pushes the graph structure of layer 1, partial activations in the forward pass, and gradients in the backward pass. Further, since P³ is focused on distributed settings, data copying is necessary between CPU and GPU.

During the communication phase, the computation is stalled unless it is overlapped with communication using pipelining techniques. It is noted that GNN frameworks, such as DGL, already overlap computation and communication: while the CPU is busy creating the computation graph, the GPU is used to execute an already prepared mini batch.

P³ uses a simple pipelining mechanism. Pipelining refers to the process where, while one graph is pulled to the CPU for one product, the graph of a different product is processed in the GPU, so the GPU is working while the information is being pulled. With P^3,s pipelining, P³ can achieve almost 85% utilization from the GPU, compared to existing systems that only use about 15% of the GPU.

As discussed earlier, P³ pulls both the graph and the metadata 308. In existing systems, these two operations are coupled, but P³ decouples them so they can be performed separately.

Due to the approach in P³ to enable push-pull parallelism (switching between model and data parallelism at layer 1), P³ executes four phases per mini-batch: a model parallel phase in the forward pass 802, a data parallel phase in the forward pass 802, a data parallel phase in the backward pass 804, and a model parallel phase in the backward pass 804. This provides the opportunity to schedule four mini-batches of computation before a data dependency is created between phases. Figure 8 shows the operations for four machines M1-M . The arrows 806 denote sample data dependency for the operations. The digits in each cell correspond to the layer being calculated.

M means that the machine is in model parallel phase and executing in model parallel fashion. Further, D means only one machine is processing the information.

Thus, computations are overlapped, as shown in Figure 8. In the forward pass 802, the data parallel phase of minibatch 3 (denoted as 3D) has a data dependency on the model parallel phase (3M) in the forward pass. Hence, when phase 3M starts communication, two forward and two backward passes from other mini-batches are scheduled. The two forward and two backward strategy allows optimal overlap.

The weights 808, Wl, W2, and W3, correspond to each of the respective layers and are used for the error gradient calculation during back propagation.

The use of independent partitioning for the graph structure and features allows P³ to employ a caching scheme that can reduce the already minimal communication overhead. Depending on the graph and the size of the features, either the graph or the feature may be accommodated in fewer machines than are available. By default, the features and the graph are partitioned without duplication across the available machines. However, when host memory is available, P³ uses a simple greedy approach to utilize all the available free memory by caching the partitions of the graph and/or features on multiple machines using a user-defined setting. In some example embodiments, caching is used to try to fit the input in the minimum number of machines and create copies (e.g., caches) of partitions in other machines.

Further, P³ provides an API that developers can use to speed up new GNN architectures. The following table details one sample embodiment of the API:

Table 1

The following is an example use of the API. class GraphSAGE: def init (graph, in ft, out ft): graph['n'] = DistTensor(graph['n_p'],op='sum') fc self = fc neigh = Linear(in_ft, out ft)

# Generates message def scatter_udf(s_ft, e_ft, d ft) : return s_ft # Aggregates messages def gather_udf(m_ft): return mean(m ft) # Combines aggregate and prior feature def transform(v_ft, a_ft): return fc_self(v_ft) + fc_neigh(a_ft) # Applies non-linear operation def apply(v_ft, t_ft): return ReLU(t ft) def forward(graph, feat): graph['m'] = scatter (graph, feat, scatter udf) graph ['h r'] = gather(graph['m'], gather udf) graph ['h r'] = transform(feat, graph['n_p']) return apply(feat, graph['n'])

In one example implementation, P³ uses PyTorch and is adapted from the DGL repository. PyTorch’s Distributed DataParallel and Distributed RPC Framework for multi-machine model training is utilized, which includes a combination of an intra-layer model and data parallelism. P³ uses a pipelined schedule of two forward and two backward passes and rpc. remote calls to move mini-batch samples across machine boundaries.

In one example implementation, P³ users can register DistTensor, which is a tensor that enables storing node (and edge) data across machine boundaries. DistTensor is sharded and stored across a cluster of machines, where sharding is determined using partitioning meta-data. Use of partitioning meta-data for sharding ensures that the corresponding dimensions in different DistTensors are split consistently. While training using model-parallelism, NN- operations can manipulate only the local shard of DistTensor without synchronizing. If an NN-operation, at any point during model-parallelism training, needs a remote shard of DistTensor, a user has to ensure that DistTensor is synchronized by calling sync method and passing a reduction operation (e.g., sum). P3 automatically synchronizes DistTensor when training switches from model to data parallelism, and therefore no such restriction applies to NN-operations that are to be executed while a model is being trained using data parallelism, as these operations can manipulate any shard of DistTensor without incurring network communication. It is noted that in P3, all layers except layer 1 are trained using data parallelism and for NN-operations associated with them, above listed restrictions are not applicable as all shards of DistTensor are available locally. To enable automatic differentiation, it is recorded when DistTensor shards are sent (and received) over RPC during forward pass and leverage PyTorch’s Distributed AutoGrad and Distributed Optimizer to perform distributed backward pass and parameter update.

Figure 9 illustrates the performance improvements achieved using the P³ method, according to some example embodiments. Chart 902 shows a performance comparison between P³ and other existing methods. The horizontal axis is for three different graphs: OGB-Product, OGB-Paper, and UK-2006-5.

OGB-Product has 2.4 million nodes, 124 edges, and 100 features. OGB-Paper has 111 million nodes, 1.6 billion edges, and 128 features. UK-2006-5 has 78 million nodes, 2.9 billion edges, and 256 features. Further, GCN and GraphSage are two types of GNN architectures.

For each case, a comparison was made between DGL and P³. The results show the clear advantage of P³ over prior methodologies. The vertical axis is for the average epoch time in seconds.

Figure 10 illustrates the training and use of a machine-learning model, according to some example embodiments. In some example embodiments, ML models 1016 are utilized to perform operations associated GNNs, such as providing product recommendations.

ML is an application that provides computer systems the ability to perform tasks, without explicitly being programmed, by making inferences based on patterns found in the analysis of data. Machine learning explores the study and construction of algorithms, also referred to herein as tools, that may learn from existing data and make predictions about new data. Such machine learning algorithms operate by building an ML model 1016 from example training data 1012 in order to make data-driven predictions or decisions expressed as outputs or assessments 1020. Although example embodiments are presented with respect to a few machine-learning tools, the principles presented herein may be applied to other machine-learning tools.

Data representation refers to the method of organizing the data for storage on a computer system, including the structure for the identified features and their values. In ML, it is typical to represent the data in vectors or matrices of two or more dimensions. When dealing with large amounts of data and many features, data representation is important so that the training is able to identify the correlations within the data.

There are two common modes for ML: supervised ML and unsupervised ML. Supervised ML uses prior knowledge (e.g., examples that correlate inputs to outputs or outcomes) to learn the relationships between the inputs and the outputs. The goal of supervised ML is to learn a function that, given some training data, best approximates the relationship between the training inputs and outputs so that the ML model can implement the same relationships when given inputs to generate the corresponding outputs. Unsupervised ML is the training of an ML algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Unsupervised ML is useful in exploratory analysis because it can automatically identify structure in data.

The training data 1012 comprises examples of values for features 1002. In some example embodiments, the training data comprises labeled data with examples of values for the features 1002 and labels indicating the outcome, such as product features, product embeddings, and so forth. The machine-learning algorithms utilize the training data 1012 to find correlations among identified features 1002 that affect the outcome. A feature 1002 is an individual measurable property of a phenomenon being observed. The concept of a feature is related to that of an explanatory variable used in statistical techniques such as linear regression. Choosing informative, discriminating, and independent features is important for effective operation of ML in pattern recognition, classification, and regression. Features may be of different types, such as numeric features, strings, and graphs.

During training 1014, the ML program, also referred to as a ML algorithm or ML tool, analyzes the training data 1012 based on identified features 1002 and configuration parameters 1011 defined for the training. The result of the training 1014 is the ML model 1016 that is capable of taking inputs to produce assessments.

Training an ML algorithm involves analyzing large amounts of data (e.g., from several gigabytes to a terabyte or more) in order to find data correlations. The ML algorithms utilize the training data 1012 to find correlations among the identified features 1002 that affect the outcome or assessment 1020. In some example embodiments, the training data 1012 includes labeled data, which is known data for one or more identified features 1002 and one or more outcomes, such as product information, graph information, including node and edge information, and so forth.

The ML algorithms usually explore many possible functions and parameters before finding what the ML algorithms identify to be the best correlations within the data; therefore, training may make use of large amounts of computing resources and time.

Many ML algorithms include configuration parameters 1011, and the more complex the ML algorithm, the more parameters there are that are available to the user. The configuration parameters 1011 define variables for an ML algorithm in the search for the best ML model. The training parameters include model parameters and hyperparameters. Model parameters are learned from the training data, whereas hyperparameters are not learned from the training data, but instead are provided to the ML algorithm.

Some examples of model parameters include maximum model size, maximum number of passes over the training data, data shuffle type, regression coefficients, decision tree split locations, and the like. Hyperparameters may include the number of hidden layers in a neural network, the number of hidden nodes in each layer, the learning rate (perhaps with various adaptation schemes for the learning rate), the regularization parameters, types of nonlinear activation functions, and the like. Finding the correct (or the best) set of hyperparameters can be a very time-consuming task that makes use of a large amount of computer resources.

When the ML model 1016 is used to perform an assessment, new data 1018 is provided as an input to the ML model 1016, and the ML model 1016 generates the assessment 1020 as output. Figure 11 is a flowchart of a method 1100 for generating embeddings for nodes in a GNN, according to some example environments. While the various operations in this flowchart are presented and described sequentially, one of ordinary skill will appreciate that some or all of the operations may be executed in a different order, be combined or omitted, or be executed in parallel. Operation 1102 is for assigning each node from a graph of a GNN to one machine from a plurality of machines. From operation 1102, the method 1100 flows to operation 1104, where each feature, from a plurality of features, is assigned to one machine from the plurality of machines.

From operation 1104, the method 1100 flows to operation 1106 for calculating, at each machine, an embedding for each node assigned to the machine. Calculating the embedding for a first node in a first machine comprises operations 1108-1112.

At operation 1108, a computation graph is calculated for the first node for a predetermined number of layers of the graph.

From operation 1108, the method 1100 flows to operation 1109 for obtaining, from a set of training data, data for the features assigned to the first machine.

From operation 1109, the method 1100 flows to operation 1110, where the first machine computes partial activations for a first layer of the graph.

From operation 1110, the method 1100 flows to operation 1111, where the first machine pulls partial activations for the first node from other machines.

From operation 1111, the method 1100 flows to operation 1112 for processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

In one example, processing the partial activations includes aggregating the partial activations for the first node using a reduce operation.

In one example, processing the partial activations further includes passing the aggregated partial activations through a remainder of the layers of the graph to obtain the embedding for the rest of the nodes.

In one example, the method 1100 further comprises performing a backward pass to obtain error gradients for the embeddings of the nodes.

In one example, the method 1100 further comprises repeating the calculation of the embeddings and the backward pass for a plurality of epochs after adjusting parameters of the GNN.

In one example, calculating the computation graph for the first node includes identifying nodes with outgoing links to the first node, and assigning the identified nodes to the first layer of the graph of the first node.

In one example, nodes that are proximate in the graph have embeddings that are proximate to each other.

In one example, the nodes represent products for sale, wherein products that are related have embeddings that are proximate to each other.

In one example, the method 1100 further comprises detecting an interaction with a first product in a user interface, identifying related products to the first product based on the embeddings of the related products, and presenting the related products in the user interface.

Another general aspect is for a system comprising a plurality of machines, each machine comprising a memory comprising instructions and one or more computer processors. The instructions, when executed by the one or more computer processors, cause the one or more computer processors to perform operations comprising: assigning each node from a graph of a GNN to one machine from the plurality of machines; assigning each feature from a plurality of features to one machine from the plurality of machines; and calculating, at each machine, an embedding for each node assigned to the machine, the calculating for a first node in a first machine comprising: calculating a computation graph for the first node for a predetermined number of layers of the graph; obtaining, from a set of training data, data for the features assigned to the first machine; computing, by the first machine, partial activations for a first layer of the graph; pulling, by the first machine, partial activations for the first node from other machines; and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

In yet another general aspect, a machine-readable storage medium (e.g., a non-transitory storage medium) includes instructions that, when executed by a machine, cause the machine to perform operations comprising: assigning each node from a graph of a GNN to one machine from a plurality of machines; assigning each feature from a plurality of features to one machine from the plurality of machines; and calculating, at each machine, an embedding for each node assigned to the machine, the calculating for a first node in a first machine comprising: calculating a computation graph for the first node for a predetermined number of layers of the graph; obtaining, from a set of training data, data for the features assigned to the first machine; computing, by the first machine, partial activations for a first layer of the graph; pulling, by the first machine, partial activations for the first node from other machines; and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

In view of the disclosure above, various examples are set forth below. It should be noted that one or more features of an example, taken in isolation or combination, should be considered within the disclosure of this application.

Figure 12 is a block diagram illustrating an example of a machine 1200 upon or by which one or more example process embodiments described herein may be implemented or controlled. In alternative embodiments, the machine 1200 may operate as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine 1200 may operate in the capacity of a server machine, a client machine, or both in server-client network environments. In an example, the machine 1200 may act as a peer machine in a peer-to-peer (P2P) (or other distributed) network environment. Further, while only a single machine 1200 is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as via cloud computing, software as a service (SaaS), or other computer cluster configurations.

Examples, as described herein, may include, or may operate by, logic, a number of components, or mechanisms. Circuitry is a collection of circuits implemented in tangible entities that include hardware (e.g., simple circuits, gates, logic). Circuitry membership may be flexible overtime and underlying hardware variability. Circuitries include members that may, alone or in combination, perform specified operations when operating. In an example, hardware of the circuitry may be immutably designed to carry out a specific operation (e.g., hardwired). In an example, the hardware of the circuitry may include variably connected physical components (e.g., execution units, transistors, simple circuits) including a computer-readable medium physically modified (e.g., magnetically, electrically, by moveable placement of invariant massed particles) to encode instructions of the specific operation. In connecting the physical components, the underlying electrical properties of a hardware constituent are changed (for example, from an insulator to a conductor or vice versa). The instructions enable embedded hardware (e.g., the execution units or a loading mechanism) to create members of the circuitry in hardware via the variable connections to carry out portions of the specific operation when in operation. Accordingly, the computer- readable medium is communicatively coupled to the other components of the circuitry when the device is operating. In an example, any of the physical components may be used in more than one member of more than one circuitry. For example, under operation, execution units may be used in a first circuit of a first circuitry at one point in time and reused by a second circuit in the first circuitry, or by a third circuit in a second circuitry, at a different time.

The machine (e.g., computer system) 1200 may include a hardware processor 1202 (e.g., a central processing unit (CPU), a hardware processor core, or any combination thereof), a graphics processing unit (GPU) 1203, a main memory 1204, and a static memory 1206, some or all of which may communicate with each other via an interlink (e.g., bus) 1208. The machine 1200 may further include a display device 1210, an alphanumeric input device 1212 (e.g., a keyboard), and a user interface (UI) navigation device 1214 (e.g., a mouse). In an example, the display device 1210, alphanumeric input device 1212, and UI navigation device 1214 may be a touch screen display. The machine 1200 may additionally include a mass storage device (e.g., drive unit) 1216, a signal generation device 1218 (e.g., a speaker), a network interface device 1220, and one or more sensors 1221, such as a Global Positioning System (GPS) sensor, compass, accelerometer, or another sensor. The machine 1200 may include an output controller 1228, such as a serial (e.g., universal serial bus (USB)), parallel, or other wired or wireless (e.g., infrared (IR), near field communication (NFC)) connection to communicate with or control one or more peripheral devices (e.g., a printer, card reader).

The mass storage device 1216 may include a machine-readable medium 1222 on which is stored one or more sets of data structures or instructions 1224 (e.g., software) embodying or utilized by any one or more of the techniques or functions described herein. The instructions 1224 may also reside, completely or at least partially, within the main memory 1204, within the static memory 1206, within the hardware processor 1202, or within the GPU 1203 during execution thereof by the machine 1200. In an example, one or any combination of the hardware processor 1202, the GPU 1203, the main memory 1204, the static memory 1206, or the mass storage device 1216 may constitute machine-readable media.

While the machine-readable medium 1222 is illustrated as a single medium, the term “machine- readable medium” may include a single medium, or multiple media, (e.g., a centralized or distributed database, and/or associated caches and servers) configured to store the one or more instructions 1224.

The term “machine-readable medium” may include any medium that is capable of storing, encoding, or carrying instructions 1224 for execution by the machine 1200 and that cause the machine 1200 to perform any one or more of the techniques of the present disclosure, or that is capable of storing, encoding, or carrying data structures used by or associated with such instructions 1224. Non-limiting machine-readable medium examples may include solid-state memories, and optical and magnetic media. In an example, a massed machine-readable medium comprises a machine-readable medium 1222 with a plurality of particles having invariant (e.g., rest) mass. Accordingly, massed machine-readable media are not transitory propagating signals. Specific examples of massed machine-readable media may include non-volatile memory, such as semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and CD- ROM and DVD-ROM disks.

The instructions 1224 may further be transmitted or received over a communications network 1226 using a transmission medium via the network interface device 1220.

Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein. The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.

As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, modules, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general, structures and functionality presented as separate resources in the example configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A system comprising a plurality of machines, each machine comprising: a memory comprising instructions; and one or more computer processors, wherein the instructions, when executed by the one or more computer processors, cause the system to perform operations comprising: assigning each node from a graph of a graph neural network (GNN) to one machine from the plurality of machines; assigning each feature from a plurality of features to one machine from the plurality of machines; and calculating, at each machine, an embedding for each node assigned to the machine, the calculating the embedding for a first node in a first machine comprising: calculating a computation graph for the first node for a predetermined number of layers of the graph; obtaining, from a set of training data, data for the features assigned to the first machine; computing, by the first machine, partial activations for a first layer of the graph; pulling, by the first machine, partial activations for the first node from other machines; and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

2. The system as recited in claim 1, wherein processing the partial activations includes: aggregating the partial activations for the first node using a reduce operation.

3. The system as recited in claim 1, wherein processing the partial activations further includes: passing the aggregated partial activations through a remainder of the layers of the graph to obtain the embedding for the rest of the nodes.

4. The system as recited in claim 1, wherein the instructions further cause the one or more computer processors to perform operations comprising: performing a backward pass to obtain error gradients for the embeddings of the nodes.

5. The system as recited in claim 4, wherein the instructions further cause the one or more computer processors to perform operations comprising: repeating the calculation of the embeddings and the backward pass for a plurality of epochs after adjusting parameters of the GNN.

6. The system as recited in claim 1, wherein calculating the computation graph for the first node includes: identifying nodes with outgoing links to the first node; and assigning the identified nodes to the first layer of the graph of the first node.

7. The system as recited in claim 1, wherein nodes that are proximate in the graph have embeddings that are proximate to each other.

8. The system as recited in claim 1, wherein the nodes represent products for sale, wherein products that are related have embeddings that are proximate to each other.

9. The system as recited in claim 1, wherein the instructions further cause the one or more computer processors to perform operations comprising: detecting an interaction with a first product in a user interface; identifying related products to the first product based on the embeddings of the related products; and presenting the related products in the user interface.

10. A computer-implemented method comprising: assigning each node from a graph of a graph neural network (GNN) to one machine from a plurality of machines; assigning each feature from a plurality of features to one machine from the plurality of machines; and calculating, at each machine, an embedding for each node assigned to the machine, the calculating the embedding for a first node in a first machine comprising: calculating a computation graph for the first node for a predetermined number of layers of the graph; obtaining, from a set of training data, data for the features assigned to the first machine; computing, by the first machine, partial activations for a first layer of the graph; pulling, by the first machine, partial activations for the first node from other machines; and processing the partial activations through a remainder of the layers to obtain an embedding for the first node.

11. The method as recited in claim 10, wherein processing the partial activations includes: aggregating the partial activations for the first node using a reduce operation.

12. The method as recited in claim 11, wherein processing the partial activations further includes: passing the aggregated partial activations through a remainder of the layers of the graph to obtain the embedding for the rest of the nodes.

13. The method as recited in claim 10, further comprising: performing a backward pass to obtain error gradients for the embeddings of the nodes.

14. A system comprising means to perform any method of claims 10-13.

15. At least one machine-readable media including instructions that, when executed by a machine, cause the machine to perform any method of claims 10-13.