CN113449842B

CN113449842B - A distributed automatic differentiation method and related device

Info

Publication number: CN113449842B
Application number: CN202010231550.XA
Authority: CN
Inventors: 杨振章
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2024-09-27
Anticipated expiration: 2040-03-27
Also published as: CN113449842A

Abstract

The embodiment of the application provides a distributed automatic differentiation method and a related device, wherein the method comprises the following steps: acquiring a forward propagation calculation graph of the neural network model; based on that the model parameters of the neural network model stored by the computing nodes are the same as the model parameters stored by at least one computing node except the computing nodes in the distributed computing system, inserting a target operator between the input end of the model parameters of the computing nodes and an operator for computing the model parameters so as to obtain a target computational graph; the target operator is used for transmitting the model parameters through, and the derivative algorithm is used for summing the derivatives of the model parameters of all the calculation nodes with the same model parameters; a back propagation computation graph is generated based on a derivative algorithm of the operators in the target computation graph. By implementing the embodiment of the application, the derivation of the back propagation process of the user intervention model can be avoided, the development efficiency is improved, and the development difficulty is reduced.

Description

A distributed automatic differentiation method and related device

技术领域Technical Field

本申请涉及人工智能(Artificial Intelligence，AI)领域，由于涉及一种分布式自动微分方法及相关装置。The present application relates to the field of artificial intelligence (AI), and in particular to a distributed automatic differentiation method and related devices.

背景技术Background Art

随着深度学习的发展，神经网络模型变得越来越大，导致单机无法完成模型存储和计算，需要采用分布式计算系统。分布式计算系统包括可以并行计算的多个计算节点。神经网络模型的训练涉及到正向传播(forward propagation)和反向传播(backpropagation)两个计算过程。正向传播是指神经网络沿着输入层到输出层的顺序，依次计算并存储神经网络各层的中间变量(包括输出)，反向传播是指神经网络沿着输出层到输入层的顺序，依次计算并存储神经网络各层的中间变量以及参数的导数(或称梯度)。在采用分布式计算系统来训练神经网络模型通常有：数据并行(Data Parallelism)、模型并行(Model Parallelism)和混合并行这几种方式。With the development of deep learning, neural network models are becoming larger and larger, making it impossible for a single machine to complete model storage and calculation, and a distributed computing system is needed. A distributed computing system includes multiple computing nodes that can perform parallel computing. The training of a neural network model involves two computing processes: forward propagation and backpropagation. Forward propagation refers to the sequence of the neural network from the input layer to the output layer, sequentially calculating and storing the intermediate variables (including output) of each layer of the neural network. Backpropagation refers to the sequence of the neural network from the output layer to the input layer, sequentially calculating and storing the intermediate variables of each layer of the neural network and the derivatives (or gradients) of the parameters. There are usually several ways to train neural network models using distributed computing systems: data parallelism, model parallelism, and hybrid parallelism.

数据并行的基本思想是不同的计算节点有相同的待训练的模型参数(或称模型副本)，但每个计算节点负责训练数据的不同部分，因此适用于样本数据量大而训练模型较小的场景。The basic idea of data parallelism is that different computing nodes have the same model parameters to be trained (or model copies), but each computing node is responsible for a different part of the training data. Therefore, it is suitable for scenarios with large sample data volumes and small training models.

模型并行则是不同的计算节点有相同的待训练的样本数据(或称数据副本)，但每个计算节点负责神经网络模型的不同部分，模型参数可以不同，因此适用于训练模型大而样本数据量小的场景。Model parallelism means that different computing nodes have the same sample data to be trained (or data copies), but each computing node is responsible for a different part of the neural network model, and the model parameters can be different. Therefore, it is suitable for scenarios where the training model is large but the amount of sample data is small.

随着样本数据集越来越大，模型越来越大，受限于单节点的内存，传统的数据并行或模型并行将不适用，需要采用混合并行的方式来解决。混合并行的方式可视为对数据并行和模型并行的方式的综合。所不同的是，混合并行的正向传播计算过程可能会涉及通信算子(如all-reduce，all-gather，send/receive等算子)。As sample data sets and models become larger and larger, traditional data parallelism or model parallelism will not be applicable due to the memory constraints of a single node, and a hybrid parallel approach is needed to solve the problem. The hybrid parallel approach can be seen as a combination of data parallelism and model parallelism. The difference is that the forward propagation calculation process of the hybrid parallel approach may involve communication operators (such as all-reduce, all-gather, send/receive, etc.).

在反向传播计算的实现中，对于不涉及跨节点通信的网络，业界通常采用自动微分技术对输入层到输出层中的所有算子进行自动微分(Automatic Differentiation，AD)，从而获得参数的导数。然而，对于混合并行的方式，如果正向传播过程中涉及通信算子，则在反向传播中，沿着输出层到输入层的顺序，对于不存在通信算子的部分进行自动微分，涉及通信算子的部分需要开发人员手动推导求导逻辑及手动插入相关算子来进行跨节点的导数传递，费时费力，且无法严格保证正确性，导致开发效率低且难度大。In the implementation of back-propagation calculation, for networks that do not involve cross-node communication, the industry usually uses automatic differentiation technology to perform automatic differentiation (AD) on all operators from the input layer to the output layer to obtain the derivatives of the parameters. However, for the hybrid parallel method, if the forward propagation process involves communication operators, in the back propagation, along the order from the output layer to the input layer, automatic differentiation is performed for the part where there is no communication operator. The part involving communication operators requires developers to manually deduce the derivation logic and manually insert related operators to transfer derivatives across nodes, which is time-consuming and labor-intensive, and the correctness cannot be strictly guaranteed, resulting in low development efficiency and difficulty.

发明内容Summary of the invention

本申请实施例提供一种分布式自动微分方法及相关装置，能够实现自动处理数据并行、模型并行及混合并行等多种分布式网络在反向传播中的求导，避免用户介入推导，从而提升开发效率，降低开发难度。The embodiments of the present application provide a distributed automatic differentiation method and related devices, which can automatically process the derivation of various distributed networks such as data parallelism, model parallelism and hybrid parallelism in back propagation, avoid user intervention in the derivation, thereby improving development efficiency and reducing development difficulty.

第一方面，本申请实施例提供了一种分布式自动微分方法，该方法应用于分布式计算系统中的计算节点，包括：获取神经网络模型的正向传播计算图；其中，所述正向传播计算图包括多个算子，所述多个算子中包括本地算子和通信算子，所述本地算子表示可以在所述计算节点中完成计算的算子，所述通信算子表示依赖所述计算节点与所述分布式计算系统中除所述计算节点外的至少一个计算节点之间的通信来完成计算的算子；基于所述计算节点存储的神经网络模型的模型参数与所述分布式计算系统中除所述计算节点外的至少一个计算节点存储的模型参数相同，在所述计算节点的模型参数的输入端和用于计算该模型参数的算子之间插入目标算子，以获得目标计算图；其中，所述目标算子用于透传模型参数，且所述目标算子的导数算法用于对具有相同模型参数的所有计算节点的模型参数的导数进行求和；基于所述目标计算图中部分或全部算子的导数算法来生成所述神经网络模型的反向传播计算图。In a first aspect, an embodiment of the present application provides a distributed automatic differentiation method, which is applied to a computing node in a distributed computing system, including: obtaining a forward propagation computation graph of a neural network model; wherein the forward propagation computation graph includes multiple operators, the multiple operators including local operators and communication operators, the local operators representing operators that can complete calculations in the computing node, and the communication operators representing operators that rely on communication between the computing node and at least one computing node other than the computing node in the distributed computing system to complete calculations; based on the model parameters of the neural network model stored in the computing node being the same as the model parameters stored in at least one computing node other than the computing node in the distributed computing system, a target operator is inserted between the input end of the model parameters of the computing node and the operator used to calculate the model parameters to obtain a target computation graph; wherein the target operator is used to transmit the model parameters, and the derivative algorithm of the target operator is used to sum the derivatives of the model parameters of all computing nodes with the same model parameters; based on the derivative algorithms of some or all operators in the target computation graph, a back propagation computation graph of the neural network model is generated.

其中，一个算子代表一种基本计算方法，而每个算子都对应于反向传播中的一个导数算法，也就是说，导数算法可通过对算子进行求导获得。对于每个导数算法自身而言，导数算法可能是一个基本计算方法，也可能是多个基本计算方法的组合。Among them, an operator represents a basic calculation method, and each operator corresponds to a derivative algorithm in back propagation, that is, the derivative algorithm can be obtained by differentiating the operator. For each derivative algorithm itself, the derivative algorithm may be a basic calculation method or a combination of multiple basic calculation methods.

其中，目标算子又可称为mirror算子，在正向传播计算图中插入mirror后形成所述目标计算图，mirror算子的输入为该模型参数，mirror算子的输出作为原先用于计算该参数的算子的输入。由于mirror算子在正向传播中的作用是透传模型参数，所以mirror算子的输出也是该模型参数，即该模型参数最终还是会输入到原先用于计算该参数的算子。Among them, the target operator can also be called the mirror operator. After inserting the mirror in the forward propagation calculation graph, the target calculation graph is formed. The input of the mirror operator is the model parameter, and the output of the mirror operator is used as the input of the operator originally used to calculate the parameter. Since the role of the mirror operator in forward propagation is to pass through the model parameters, the output of the mirror operator is also the model parameter, that is, the model parameter will eventually be input into the operator originally used to calculate the parameter.

可以看到，本申请实施例设计了一套通用的分布式微分方案，开发人员只需实现分布式正向传播网络，无需关注分布式网络反向传播如何进行，无需关注模型参数在哪些计算节点做了镜像(即存在模型副本)。分布式计算系统的计算节点可以自动计算通信算子的导数算法，并调用通信算子的导数算法以完成跨节点的导数结果的传递(通信)，并自动实现mirror算子的插入、反向传播计算图的生成和输出，从而避免了现有技术中需要用户介入推导的弊端，节省人力物力成本，提升了开发效率，降低开发难度，本申请方案适用于解决当前主流的数据并行、模型并行、混合并行等分布式网络的自动微分问题，提升了用户的使用体验。It can be seen that the embodiment of the present application designs a set of general distributed differentiation solutions. Developers only need to implement a distributed forward propagation network, without paying attention to how the distributed network backpropagation is carried out, and without paying attention to which computing nodes the model parameters are mirrored (i.e., there are model copies). The computing nodes of the distributed computing system can automatically calculate the derivative algorithm of the communication operator, and call the derivative algorithm of the communication operator to complete the transmission (communication) of the derivative results across nodes, and automatically implement the insertion of the mirror operator, the generation and output of the backpropagation calculation graph, thereby avoiding the drawbacks of the prior art that requires users to intervene in the derivation, saving manpower and material costs, improving development efficiency, and reducing development difficulty. The present application solution is suitable for solving the current mainstream data parallelism, model parallelism, hybrid parallelism and other distributed network automatic differentiation problems, and improving the user experience.

基于第一方面，在可能的实施例中，所述基于所述目标计算图中部分或全部算子的导数算法来生成所述神经网络模型的反向传播计算图，包括：获取所述目标计算图中所述通信算子的导数算法、所述本地算子的导数算法以及所述目标算子的导数算法；根据所述通信算子的导数算法、所述本地算子的导数算法以及所述目标算子的导数算法，生成所述反向传播计算图。Based on the first aspect, in a possible embodiment, the back propagation calculation graph of the neural network model is generated based on the derivative algorithm of some or all operators in the target calculation graph, including: obtaining the derivative algorithm of the communication operator, the derivative algorithm of the local operator and the derivative algorithm of the target operator in the target calculation graph; generating the back propagation calculation graph according to the derivative algorithm of the communication operator, the derivative algorithm of the local operator and the derivative algorithm of the target operator.

具体的，可以按从神经网络模型的输出层开始到输入层的顺序，依次从数据库中调用每个本地算子/通信算子/可能存在的mirror算子的导数算法，自动生成完整的反向传播计算图。Specifically, the derivative algorithm of each local operator/communication operator/possible mirror operator can be called from the database in sequence from the output layer to the input layer of the neural network model to automatically generate a complete back-propagation calculation graph.

其中，所述本地算子的导数算法可以预先保存在本地存储。The derivative algorithm of the local operator may be pre-stored in local storage.

基于第一方面，在可能的实施例中，所述本地算子的导数算法保存在所述计算节点的本地存储中；所述方法还包括：对所述正向传播计算图中所述通信算子进行求导操作，以获得所述通信算子的导数算法。Based on the first aspect, in a possible embodiment, the derivative algorithm of the local operator is stored in the local storage of the computing node; the method also includes: performing a derivative operation on the communication operator in the forward propagation computation graph to obtain the derivative algorithm of the communication operator.

也即是说，分布式计算系统的计算节点可以自动计算通信算子的导数算法，并调用通信算子的导数算法以完成跨节点的导数结果的传递(通信)，并自动实现mirror算子的插入、反向传播计算图的生成和输出，从而避免了现有技术中需要用户介入推导的弊端，提升了开发效率，降低开发难度。That is to say, the computing nodes of the distributed computing system can automatically calculate the derivative algorithm of the communication operator, and call the derivative algorithm of the communication operator to complete the transmission (communication) of the derivative results across nodes, and automatically realize the insertion of the mirror operator, the generation and output of the back propagation calculation graph, thereby avoiding the disadvantages of the existing technology that requires users to intervene in the derivation, improving development efficiency and reducing development difficulty.

基于第一方面，在可能的实施例中，所述方法还包括：通过识别所述神经网络模型的输出层是否被所述分布式计算系统中的两个或两个以上计算节点重复计算(即在正向传播计算图中识别输出层是否被重复计算)，来确定所述反向传播计算图的输入数据的系数(sens)。Based on the first aspect, in a possible embodiment, the method also includes: determining the coefficient (sens) of the input data of the back propagation calculation graph by identifying whether the output layer of the neural network model is repeatedly calculated by two or more computing nodes in the distributed computing system (i.e., identifying whether the output layer is repeatedly calculated in the forward propagation calculation graph).

基于第一方面，在可能的实施例中，所述通过识别所述神经网络模型的输出层的loss是否被所述分布式计算系统中的两个或两个以上计算节点重复计算，来确定所述反向传播计算图的输入数据的sens，包括：若被重复计算，则根据重复计算的计算节点的数量来修改所述sens的值；也就是说可以将输入数据的系数sens的值设置为原值的1/n，示例性的，sens的原值可默认为1.0。Based on the first aspect, in a possible embodiment, the sens of the input data of the back-propagation calculation graph is determined by identifying whether the loss of the output layer of the neural network model is repeatedly calculated by two or more computing nodes in the distributed computing system, including: if it is repeatedly calculated, modifying the value of sens according to the number of computing nodes that repeatedly calculate; that is, the value of the coefficient sens of the input data can be set to 1/n of the original value. Exemplarily, the original value of sens can default to 1.0.

也就是说，如果loss重复计算，每台计算节点都以正向的loss值(即神经网络正向传播的输出)作为起点，各计算节点协同完成整体计算的反向传播，则在反向传播第一层中，需将sens设置成1/n，即每台机器在反向求导的起点都贡献1/n。That is to say, if the loss is calculated repeatedly, each computing node takes the positive loss value (that is, the output of the forward propagation of the neural network) as the starting point, and each computing node collaborates to complete the reverse propagation of the overall calculation. In the first layer of reverse propagation, sens needs to be set to 1/n, that is, each machine contributes 1/n at the starting point of the reverse derivation.

若没被重复计算，则维持所述sens的值不变，也就是维持原值。示例性的，sens的原值可默认为1.0。If it is not repeatedly calculated, the value of sens is maintained unchanged, that is, the original value is maintained. For example, the original value of sens may be defaulted to 1.0.

基于第一方面，在可能的实施例中，所述在所述计算节点的模型参数的输入端和用于计算该模型参数的算子之间插入目标算子，以获得目标计算图，包括：确定所述计算节点存储的神经网络模型的模型参数与所述分布式计算系统中除所述计算节点外的至少一个计算节点存储的模型参数相同；存储有相同的模型参数的计算节点位于同一个通信组；在所述计算节点的模型参数和用于计算该模型参数的算子之间插入目标算子，以获得所述目标计算图；其中，所述目标算子的输出等于所述目标算子的输入，所述目标算子的导数算法为对所述通信组中所有计算节点的模型参数的导数进行求和。Based on the first aspect, in a possible embodiment, inserting a target operator between the input end of the model parameters of the computing node and the operator used to calculate the model parameters to obtain a target calculation graph includes: determining that the model parameters of the neural network model stored in the computing node are the same as the model parameters stored in at least one computing node other than the computing node in the distributed computing system; the computing nodes storing the same model parameters are located in the same communication group; inserting the target operator between the model parameters of the computing node and the operator used to calculate the model parameters to obtain the target calculation graph; wherein the output of the target operator is equal to the input of the target operator, and the derivative algorithm of the target operator is to sum the derivatives of the model parameters of all computing nodes in the communication group.

本申请实施例中，可预先对定义Mirror算子来对应模型参数中的哪些参数。Mirror算子的在正向传播中，输入为模型参数副本w，输出等于输入w，它在正向传播中不做实际的计算，只是标识参数在哪些计算节点做了镜像。Mirror算子在反向传播中，输入为下一层神经网络计算所得的导数dout，输出dw等于对dout在通信组(group)上做一次all-reduce求和。In the embodiment of the present application, the Mirror operator can be pre-defined to correspond to which parameters in the model parameters. In the forward propagation of the Mirror operator, the input is the model parameter copy w, and the output is equal to the input w. It does not perform actual calculations in the forward propagation, but only identifies which computing nodes the parameters are mirrored on. In the reverse propagation, the Mirror operator takes as input the derivative dout calculated by the next layer of the neural network, and the output dw is equal to an all-reduce sum of dout on the communication group.

基于第一方面，在可能的实施例中，所述目标算子的导数算法为all_reduce求和算子。Based on the first aspect, in a possible embodiment, the derivative algorithm of the target operator is an all_reduce summation operator.

基于第一方面，在可能的实施例中，在获取所述目标计算图中每个算子的导数算法来生成所述神经网络模型的反向传播计算图之后，所述方法还包括：根据所述正向传播计算图和所述反向传播计算图，对所述神经网络模型进行训练。Based on the first aspect, in a possible embodiment, after obtaining the derivative algorithm of each operator in the target calculation graph to generate the back propagation calculation graph of the neural network model, the method also includes: training the neural network model according to the forward propagation calculation graph and the back propagation calculation graph.

基于第一方面，在可能的实施例中，分布式计算系统可以包括多个计算设备，每个计算设备可作为一个计算节点，可以利用不同计算设备并行完成神经网络模型的正反向计算。计算设备可以是终端，也可以是服务器。Based on the first aspect, in a possible embodiment, the distributed computing system may include multiple computing devices, each of which may serve as a computing node, and different computing devices may be used to complete the forward and reverse calculations of the neural network model in parallel. The computing device may be a terminal or a server.

第二方面，本申请实施例提供了一种计算节点，所述计算节点应用于分布式计算系统，包括：In a second aspect, an embodiment of the present application provides a computing node, which is applied to a distributed computing system, including:

单机自动微分模块，用于获取神经网络模型的正向传播计算图；其中，所述正向传播计算图包括多个算子，所述多个算子中包括本地算子和通信算子，所述本地算子表示可以在所述计算节点中完成计算的算子，所述通信算子表示依赖所述计算节点与所述分布式计算系统中除所述计算节点外的至少一个计算节点之间的通信来完成计算的算子；A single-machine automatic differentiation module, used to obtain a forward propagation calculation graph of a neural network model; wherein the forward propagation calculation graph includes multiple operators, the multiple operators include local operators and communication operators, the local operators represent operators that can complete calculations in the computing node, and the communication operators represent operators that rely on communication between the computing node and at least one computing node other than the computing node in the distributed computing system to complete calculations;

镜像算子模块，用于基于所述计算节点存储的神经网络模型的模型参数与所述分布式计算系统中除所述计算节点外的至少一个计算节点存储的模型参数相同，在所述计算节点的模型参数的输入端和用于计算该模型参数的算子之间插入目标算子，以获得目标计算图；其中，所述目标算子用于透传模型参数，且所述目标算子的导数算法用于对具有相同模型参数的所有计算节点的模型参数的导数进行求和；A mirror operator module, which is used to insert a target operator between the input end of the model parameter of the computing node and the operator used to calculate the model parameter based on the model parameter of the neural network model stored in the computing node being the same as the model parameter stored in at least one computing node other than the computing node in the distributed computing system, so as to obtain a target calculation graph; wherein the target operator is used to transparently transmit the model parameter, and the derivative algorithm of the target operator is used to sum the derivatives of the model parameters of all computing nodes having the same model parameter;

所述单机自动微分模块还用于，基于所述目标计算图中部分或全部算子的导数算法来生成所述神经网络模型的反向传播计算图。The single-machine automatic differentiation module is also used to generate a back-propagation calculation graph of the neural network model based on a derivative algorithm of some or all operators in the target calculation graph.

可以看到，本申请实施例设计了一套通用的分布式微分方案，开发人员只需实现分布式正向传播网络，无需关注分布式网络反向传播如何进行，无需关注模型参数在哪些计算节点做了镜像(即存在模型副本)。计算节点的单机自动微分模块可以自动计算通信算子的导数算法，并调用通信算子的导数算法以完成跨节点的导数结果的传递(通信)，镜像算子模块可自动实现mirror算子的插入、单机自动微分模块可进行反向传播计算图的生成和输出，从而避免了现有技术中需要用户介入推导的弊端，节省人力物力成本，提升了开发效率，降低开发难度，本申请方案适用于解决当前主流的数据并行、模型并行、混合并行等分布式网络的自动微分问题，提升了用户的使用体验。It can be seen that the embodiment of the present application designs a set of general distributed differentiation solutions. Developers only need to implement a distributed forward propagation network, without paying attention to how the distributed network backpropagation is carried out, and without paying attention to which computing nodes the model parameters are mirrored (i.e., there are model copies). The stand-alone automatic differentiation module of the computing node can automatically calculate the derivative algorithm of the communication operator, and call the derivative algorithm of the communication operator to complete the transmission (communication) of the derivative results across nodes. The mirror operator module can automatically implement the insertion of the mirror operator, and the stand-alone automatic differentiation module can generate and output the reverse propagation calculation graph, thereby avoiding the drawbacks of the prior art that requires users to intervene in the derivation, saving manpower and material costs, improving development efficiency, and reducing development difficulty. The present application solution is suitable for solving the current mainstream data parallelism, model parallelism, hybrid parallelism and other distributed network automatic differentiation problems, and improving the user experience.

基于第二方面，在可能的实施例中，所述单机自动微分模块具体用于：获取所述目标计算图中所述通信算子的导数算法、所述本地算子的导数算法以及所述目标算子的导数算法；根据所述通信算子的导数算法、所述本地算子的导数算法以及所述目标算子的导数算法，生成所述反向传播计算图。Based on the second aspect, in a possible embodiment, the single-machine automatic differentiation module is specifically used to: obtain the derivative algorithm of the communication operator, the derivative algorithm of the local operator, and the derivative algorithm of the target operator in the target calculation graph; generate the back propagation calculation graph according to the derivative algorithm of the communication operator, the derivative algorithm of the local operator, and the derivative algorithm of the target operator.

基于第二方面，在可能的实施例中，所述本地算子的导数算法保存在所述计算节点的本地存储中；所述计算节点还包括通信算子求导模块，所述通信算子求导模块用于：在获取所述目标计算图中每个算子的导数算法之前，对所述正向传播计算图中所述通信算子进行求导操作，以获得所述通信算子的导数算法。Based on the second aspect, in a possible embodiment, the derivative algorithm of the local operator is stored in the local storage of the computing node; the computing node also includes a communication operator derivation module, and the communication operator derivation module is used to: before obtaining the derivative algorithm of each operator in the target calculation graph, perform a derivation operation on the communication operator in the forward propagation calculation graph to obtain the derivative algorithm of the communication operator.

基于第二方面，在可能的实施例中，所述计算节点还包括sens处理模块：所述sens处理模块用于：通过识别所述神经网络模型的输出层是否被所述分布式计算系统中的两个或两个以上计算节点重复计算，来确定所述反向传播计算图的输入数据的系数(sens)。Based on the second aspect, in a possible embodiment, the computing node also includes a sens processing module: the sens processing module is used to determine the coefficients (sens) of the input data of the back-propagation calculation graph by identifying whether the output layer of the neural network model is repeatedly calculated by two or more computing nodes in the distributed computing system.

基于第二方面，在可能的实施例中，所述sens处理模块具体用于，在计算节点被重复计算的情况下，根据重复计算的计算节点的数量来修改所述sens的值；在计算节点没被重复计算的情况下，维持所述sens的值不变。Based on the second aspect, in a possible embodiment, the sens processing module is specifically used to modify the value of the sens according to the number of the computing nodes that are repeatedly calculated when the computing nodes are repeatedly calculated; and to maintain the value of the sens unchanged when the computing nodes are not repeatedly calculated.

基于第二方面，在可能的实施例中，所述镜像算子模块具体用于：确定所述计算节点存储的神经网络模型的模型参数与所述分布式计算系统中除所述计算节点外的至少一个计算节点存储的模型参数相同；存储有相同的模型参数的计算节点位于同一个通信组；在所述计算节点的模型参数和用于计算该模型参数的算子之间插入目标算子，以获得所述目标计算图；其中，所述目标算子的输出等于所述目标算子的输入，所述目标算子的导数算法为对所述通信组中所有计算节点的模型参数的导数进行求和。Based on the second aspect, in a possible embodiment, the mirror operator module is specifically used to: determine that the model parameters of the neural network model stored in the computing node are the same as the model parameters stored in at least one computing node other than the computing node in the distributed computing system; the computing nodes storing the same model parameters are located in the same communication group; inserting a target operator between the model parameters of the computing node and the operator used to calculate the model parameters to obtain the target calculation graph; wherein the output of the target operator is equal to the input of the target operator, and the derivative algorithm of the target operator is to sum the derivatives of the model parameters of all computing nodes in the communication group.

基于第二方面，在可能的实施例中，所述目标算子的导数算法为all_reduce求和算子。Based on the second aspect, in a possible embodiment, the derivative algorithm of the target operator is an all_reduce summation operator.

第三方面，本申请实施例提供一种计算节点，所述计算节点包括：存储器、通信接口及与所述存储器和通信接口耦合的处理器；所述存储器用于存储程序指令，所述处理器用于执行所述程序指令，所述通信接口用于在所述处理器的控制下与分布式计算系统的其他计算节点进行通信；所述处理器执行所述程序指令时执行如第一方面任意实施例所述方法中的步骤。In a third aspect, an embodiment of the present application provides a computing node, comprising: a memory, a communication interface, and a processor coupled to the memory and the communication interface; the memory is used to store program instructions, the processor is used to execute the program instructions, and the communication interface is used to communicate with other computing nodes of a distributed computing system under the control of the processor; when the processor executes the program instructions, the steps in the method described in any embodiment of the first aspect are performed.

第四方面，本申请实施例提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现第一方面任意实施例所述方法中的步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the method described in any embodiment of the first aspect are implemented.

第五方面，本发明实施例提供了一种计算机程序产品。该计算机程序产品包括程序指令，当该计算机程序产品被计算设备执行时，该计算设备执行前述第一方面的任一种可能的实施例提供的方法。In a fifth aspect, an embodiment of the present invention provides a computer program product. The computer program product includes program instructions, and when the computer program product is executed by a computing device, the computing device executes the method provided by any possible embodiment of the first aspect.

可以看到，本申请实施例设计了一套通用的分布式微分方案，适用于当前主流的数据并行、模型并行、混合并行等各类分布式网络的自动微分问题，开发人员只需实现分布式正向网络，无需关注分布式网络反向如何进行，无需关注参数在哪些节点做了副本。分布式计算系统的计算节点可以自动计算通信算子的导数算法，并调用通信算子的导数算法以完成跨节点的导数结果的传递(通信)，并自动实现mirror算子的插入、Sens计算、反向传播计算图的生成和输出，从而避免了现有技术中需要用户介入推导的弊端，节省人力物力成本，提升了开发效率，降低开发难度，进而提升了用户的使用体验。It can be seen that the embodiment of the present application designs a set of general distributed differentiation solutions, which are applicable to the automatic differentiation problems of various types of distributed networks such as current mainstream data parallelism, model parallelism, and hybrid parallelism. Developers only need to implement the distributed forward network, without paying attention to how the distributed network is carried out in reverse, and without paying attention to which nodes the parameters are copied. The computing nodes of the distributed computing system can automatically calculate the derivative algorithm of the communication operator, and call the derivative algorithm of the communication operator to complete the transmission (communication) of the derivative results across nodes, and automatically implement the insertion of the mirror operator, Sens calculation, and the generation and output of the back propagation calculation graph, thereby avoiding the drawbacks of the prior art that requires users to intervene in the derivation, saving manpower and material costs, improving development efficiency, reducing development difficulty, and thus improving the user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1示出了一种正向传播计算图的示例图；FIG1 shows an example diagram of a forward propagation computation graph;

图2示出了一种反向传播计算图的示例图；FIG2 shows an example diagram of a back-propagation computation graph;

图3是本申请实施例提供的一种系统架构的示意图；FIG3 is a schematic diagram of a system architecture provided in an embodiment of the present application;

图4是本申请实施例提供的一种计算节点的系统架构示意图；FIG4 is a schematic diagram of a system architecture of a computing node provided in an embodiment of the present application;

图5是本申请实施例提供的一种分布式自动微分装置的结构示意图；FIG5 is a schematic diagram of the structure of a distributed automatic differentiation device provided in an embodiment of the present application;

图6是本申请实施例提供的一种通信算子求导模块对通信算子求导的示意图；FIG6 is a schematic diagram of a communication operator derivation module for derivation of a communication operator provided in an embodiment of the present application;

图7是本申请实施例提供的一种应用于镜像算子模块的方法流程示意图；FIG7 is a schematic diagram of a method flow chart applied to a mirror operator module provided in an embodiment of the present application;

图8是本申请实施例提供的关于执行镜像算子插入的场景示例图；FIG8 is a diagram showing an example of a scenario for performing mirror operator insertion provided by an embodiment of the present application;

图9是本申请实施例提供的一种应用于Sens处理模块的方法流程示意图；FIG9 is a schematic flow chart of a method applied to a Sens processing module provided in an embodiment of the present application;

图10是本申请实施例提供的关于执行Sens处理的场景示例图；FIG10 is a diagram showing an example of a scenario for executing Sens processing provided by an embodiment of the present application;

图11是本申请实施例提供的一种应用于单机自动微分模块的方法流程示意图；FIG11 is a schematic flow chart of a method for applying a single-machine automatic differentiation module provided in an embodiment of the present application;

图12是本申请实施例提供的一种分布式自动微分方法的流程示意图；FIG12 is a flow chart of a distributed automatic differentiation method provided in an embodiment of the present application;

图13是本申请实施例提供的又一种分布式自动微分方法的流程示意图；FIG13 is a flow chart of another distributed automatic differentiation method provided in an embodiment of the present application;

图14为本发明实施例提供的一种计算节点与客户端及业务人员交互的示意图；14 is a schematic diagram of a computing node interacting with a client and a business person provided by an embodiment of the present invention;

图15是本发明实施例提供的一种计算节点的结构示意图。FIG. 15 is a schematic diagram of the structure of a computing node provided in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

需要说明的是，在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本申请实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used in the embodiments of the present application refers to and includes any or all possible combinations of one or more associated listed items.

为了便于更好的理解本申请所描述的技术方案，下面先解释本申请实施例所涉及的技术术语。In order to facilitate a better understanding of the technical solutions described in the present application, the technical terms involved in the embodiments of the present application are first explained below.

(1)计算图和算子(operator)：(1) Computational graph and operator:

计算图(Computational Graph)是对神经网络模型的计算过程进行描述的一种图结构。如果计算有明显的模块性，并且模块之间有明显的时间上和逻辑上的依赖关系，通常可以使用有向图结构来进行描述，即神经网络模型的计算过程可被抽象为张量数据和算子所组成的有向图结构。在实际应用中，有向图结构的基本元素有两个：分别为算子和算子之间的连线关系，连接关系指示了算子执行的先后顺序。A computational graph is a graph structure that describes the computational process of a neural network model. If the computation has obvious modularity and there are obvious temporal and logical dependencies between modules, a directed graph structure can usually be used to describe it, that is, the computational process of a neural network model can be abstracted as a directed graph structure composed of tensor data and operators. In practical applications, there are two basic elements of a directed graph structure: the connection relationship between operators, which indicates the order in which operators are executed.

一般来说，对神经网络模型使用计算图的方式进行描述，有利于对整个神经网络计算任务进行整体的把握，与此同时，计算图的表达方式也方便对计算任务进行调度和并行执行。由于神经网络模型的计算涉及到正向传播(也可以称为前向传播)计算过程和反向传播计算过程，所以申请实施例中，可将正向传播计算过程所用到的计算图称为正向传播计算图，将反向传播计算过程所用到的计算图称为反向传播计算图，图1示出了一种正向传播计算图的示例，图2示出了一种反向传播计算图的示例。Generally speaking, describing the neural network model using a computational graph is conducive to an overall grasp of the entire neural network computing task. At the same time, the expression of the computational graph is also convenient for scheduling and parallel execution of computing tasks. Since the calculation of the neural network model involves the forward propagation (also known as forward propagation) calculation process and the reverse propagation calculation process, in the application embodiment, the computational graph used in the forward propagation calculation process may be referred to as the forward propagation calculation graph, and the computational graph used in the reverse propagation calculation process may be referred to as the reverse propagation calculation graph. FIG1 shows an example of a forward propagation calculation graph, and FIG2 shows an example of a reverse propagation calculation graph.

在神经网络模型的计算图中可包括多个算子，一个算子代表一种基本计算方法，例如加操作、减操作、乘操作、除操作就是4个算子。对于正向传播计算图，可包括多个算子，如图1示出了三个算子(算子A1，B1，C1)，计算规则是沿着神经网络输入层到输出层的顺序，箭头表示数据的流向。而每个算子都可对应于反向传播中的一个导数算法，也就是说，导数算法可通过对算子进行求导获得。对于每个导数算法自身而言，导数算法可能是一个基本计算方法，也可能是多个基本计算方法的组合。这些导数算法可形成反向传播计算图。如图2示出了三个导数算法(导数算法A2，导数算法B2，导数算法C3)，计算规则是沿着神经网络输出层到输入层的顺序，箭头表示数据的流向。In the computational graph of the neural network model, multiple operators may be included, and one operator represents a basic computational method, such as addition, subtraction, multiplication, and division, which are four operators. For the forward propagation computational graph, multiple operators may be included, as shown in Figure 1, which shows three operators (operators A1, B1, and C1). The computational rule is the order from the input layer to the output layer of the neural network, and the arrow indicates the flow of data. Each operator may correspond to a derivative algorithm in the back propagation, that is, the derivative algorithm can be obtained by deriving the operator. For each derivative algorithm itself, the derivative algorithm may be a basic computational method or a combination of multiple basic computational methods. These derivative algorithms may form a back propagation computational graph. As shown in Figure 2, three derivative algorithms (derivative algorithm A2, derivative algorithm B2, and derivative algorithm C3) are shown, and the computational rule is the order from the output layer to the input layer of the neural network, and the arrow indicates the flow of data.

本申请实施例中，还可以将算子分类为本地算子和通信算子，本地算子表示可以在本地计算节点中完成计算的算子，即在单个计算节点中完成计算的算子；通信算子表示需要利用本地计算节点与分布式计算系统中的至少一个其余计算节点之间的通信来完成计算的算子，即需要依赖一计算节点以外的其它计算节点来完成计算的算子。例如加操作、减操作、乘操作、除操作就属于本地算子，而send(表示向其他节点发送)、receive(表示从其他节点接收)、all-reduce，all-gather等算子可视为通信算子。神经网络模型在混合并行方案的正向传播计算图可能会存在通信算子。In the embodiment of the present application, operators can also be classified into local operators and communication operators. Local operators refer to operators that can complete calculations in local computing nodes, that is, operators that complete calculations in a single computing node; communication operators refer to operators that need to use the communication between the local computing node and at least one other computing node in the distributed computing system to complete the calculation, that is, operators that need to rely on other computing nodes other than a computing node to complete the calculation. For example, addition, subtraction, multiplication, and division operations are local operators, while operators such as send (indicating sending to other nodes), receive (indicating receiving from other nodes), all-reduce, and all-gather can be regarded as communication operators. There may be communication operators in the forward propagation calculation graph of the hybrid parallel scheme of the neural network model.

(2)神经网络模型：(2) Neural network model:

深度学习框架通常采用计算图作为描述神经网络模型的主要数据结构，本申请实施例中涉及的该神经网络模型可以包括深度学习神经网络模型(deepneural network，DNN)、卷积神经网络模型(Convolutional Neural Network,CNN)、极限学习机模型(extreme learning machine，ELM)、用于目标检测的卷积神经网络(Region-CNN，RCNN)模型、长短期记忆神经网络(Long Short-Term Memory,LSTM)模型、全连接神经网络或其他的神经网络模型，本申请实施例不做限定。A deep learning framework typically uses a computational graph as the main data structure to describe a neural network model. The neural network model involved in the embodiments of the present application may include a deep neural network model (DNN), a convolutional neural network model (CNN), an extreme learning machine model (ELM), a convolutional neural network for target detection (Region-CNN, RCNN) model, a long short-term memory neural network (LSTM) model, a fully connected neural network or other neural network models, which are not limited in the embodiments of the present application.

例如，对于CNN而言，所涉及的多种类型的算子可以包括神经网络算子。例如，常见的神经网络算子有：卷积/反卷积算子，池化算子，激活算子、softmax(分类器)算子、全连接算子等。正向传播用于从CNN的输入层中接受输入数据，通过正向传播计算图处理后输出数据到输出层；反向传播用于根据输出层的输出导数(梯度)，计算其输入的梯度，通过反向传播计算图处理后传递到输入层。For example, for CNN, the various types of operators involved may include neural network operators. For example, common neural network operators include: convolution/deconvolution operators, pooling operators, activation operators, softmax (classifier) operators, fully connected operators, etc. Forward propagation is used to receive input data from the input layer of CNN, and output data to the output layer after processing through the forward propagation calculation graph; back propagation is used to calculate the gradient of its input according to the output derivative (gradient) of the output layer, and pass it to the input layer after processing through the back propagation calculation graph.

(3)分布式计算系统：(3) Distributed computing system:

本申请实施例中，分布式计算系统是利用多个计算节点进行并行计算的系统。一种具体实现中，分布式计算系统可以包括多个计算设备，每个计算设备可作为一个计算节点，可以利用不同计算设备并行完成神经网络模型的正反向计算。计算设备可以是终端，也可以是服务器，本申请不做限定。In the embodiment of the present application, the distributed computing system is a system that uses multiple computing nodes for parallel computing. In a specific implementation, the distributed computing system may include multiple computing devices, each of which may serve as a computing node, and different computing devices may be used to complete the forward and reverse calculations of the neural network model in parallel. The computing device may be a terminal or a server, which is not limited in the present application.

又一种实现中，当一个计算设备存在多核处理器时，每个计算核也可以是一个计算节点。当前多核处理器采用的最普遍的结构是基于存储共享的多核结构，处理器中包含了多个计算核，每个计算核上有独立的缓存，寄存器堆，计算单元以及指令控制单元，所有的计算核共享同一全局存储。多个计算核可以被用于处理那些有着较高并行度的计算任务，例如可以利用不同计算核并行完成神经网络模型的正反向计算。In another implementation, when a computing device has a multi-core processor, each computing core can also be a computing node. The most common structure currently used by multi-core processors is a multi-core structure based on storage sharing. The processor contains multiple computing cores, each computing core has an independent cache, register file, computing unit and instruction control unit, and all computing cores share the same global storage. Multiple computing cores can be used to process computing tasks with a high degree of parallelism. For example, different computing cores can be used to complete the forward and reverse calculations of a neural network model in parallel.

另一种具体实现中，一个计算节点也可以是运行在抽象硬件资源上的虚拟机、容器、计算实例(Computational instance)、或进程等。例如，在云服务器或数据中心，部署有多个虚拟机，多个虚拟机通过虚拟交换机通信，构成一个分布式计算系统，每一个虚拟机作为一个计算节点，执行特定的计算方法。In another specific implementation, a computing node can also be a virtual machine, container, computing instance, or process running on abstract hardware resources. For example, in a cloud server or data center, multiple virtual machines are deployed, and multiple virtual machines communicate through virtual switches to form a distributed computing system. Each virtual machine acts as a computing node and executes a specific computing method.

参见图3，图3是本申请实施例提供了一种系统架构300。其中执行设备210可以为本申请实施例描述的分布式计算系统，执行设备210可由一个或多个计算设备(例如服务器)实现，可选的，与其它设备配合，例如：数据存储、路由器、负载均衡器等设备；执行设备210可以布置在一个物理站点上，或者分布在多个物理站点上。执行设备210中的计算设备可以使用数据存储系统250中的数据，或者调用数据存储系统250中的程序代码实现本申请实施例描述的方法。在这过程中，计算设备采用算子作为执行计算任务的具体元素，为每个算子都提供了在CPU或者人工智能处理器上执行的函数(function)，根据计算图，计算设备执行计算图中每个算子对应的核函数，完成神经网络模型的计算。See Figure 3, which is a system architecture 300 provided in an embodiment of the present application. The execution device 210 may be a distributed computing system described in an embodiment of the present application. The execution device 210 may be implemented by one or more computing devices (such as servers), and optionally, in conjunction with other devices, such as data storage, routers, load balancers and other devices; the execution device 210 may be arranged at one physical site, or distributed at multiple physical sites. The computing device in the execution device 210 may use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the method described in an embodiment of the present application. In this process, the computing device uses operators as specific elements for executing computing tasks, and provides a function (function) executed on a CPU or an artificial intelligence processor for each operator. According to the computational graph, the computing device executes the kernel function corresponding to each operator in the computational graph to complete the calculation of the neural network model.

用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备210进行交互。每个本地设备可以表示任何计算设备，例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。Users can operate their respective user devices (e.g., local device 301 and local device 302) to interact with execution device 210. Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other type of cellular phone, a media consumption device, a wearable device, a set-top box, a game console, etc.

每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互，通信网络可以是广域网、局域网、点对点连接等方式，或它们的任意组合。The local device of each user can interact with the execution device 210 through a communication network of any communication mechanism/communication standard. The communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

在另一种实现中，执行设备210的一个方面或多个方面可以由每个本地设备实现，例如，本地设备301可以为执行设备210提供本地数据或反馈计算结果。In another implementation, one or more aspects of the execution device 210 may be implemented by each local device. For example, the local device 301 may provide local data or feedback calculation results to the execution device 210 .

需要注意的，执行设备210的所有功能也可以由本地设备实现。例如，本地设备301实现执行设备210的功能并为自己的用户提供服务，或者为本地设备302的用户提供服务。It should be noted that all functions of the execution device 210 can also be implemented by the local device. For example, the local device 301 implements the functions of the execution device 210 and provides services to its own user, or provides services to the user of the local device 302.

参见图4，下面示例性描述本申请实施例的一种计算节点的系统架构，该计算节点可以是分布式计算系统中的一个计算节点，如图4所示，该系统架构包括位于底层的硬件(例如NPU,ARM,GPU,芯片等硬件)以及位于上层的软件模块和数据库，软件模块具体包括分布式正向传播网络模块41和分布式自动微分装置42，数据库包括本地算子库43、本地算子的导数算法库44、通信算子库45(或称为集合通信45)和通信算子的导数算法库46。其中，本地算子库43可用于存储各种本地算子，通信算子库45可用于存储各种通信算子，本地算子的导数算法库44可用于存储各种本地算子的导数算法，通信算子的导数算法库46可用于存储各种通信算子的导数算法。其中，每个本地算子均可映射一种导数算法，每个通信算子也均可映射一种导数算法。Referring to FIG. 4 , the following is an exemplary description of the system architecture of a computing node in an embodiment of the present application. The computing node may be a computing node in a distributed computing system. As shown in FIG. 4 , the system architecture includes hardware at the bottom layer (e.g., hardware such as NPU, ARM, GPU, chips, etc.) and software modules and databases at the upper layer. The software modules specifically include a distributed forward propagation network module 41 and a distributed automatic differentiation device 42. The database includes a local operator library 43, a derivative algorithm library 44 of a local operator, a communication operator library 45 (or collective communication 45), and a derivative algorithm library 46 of a communication operator. Among them, the local operator library 43 can be used to store various local operators, the communication operator library 45 can be used to store various communication operators, the derivative algorithm library 44 of the local operator can be used to store the derivative algorithms of various local operators, and the derivative algorithm library 46 of the communication operator can be used to store the derivative algorithms of various communication operators. Among them, each local operator can be mapped to a derivative algorithm, and each communication operator can also be mapped to a derivative algorithm.

具体实现中，本申请实施例中的分布式自动微分装置42可以集成于人工智能(AI)计算框架(图未示)，该AI计算框架可以调用上述数据库中的算子，或者调用正向传播计算图，以供分布式自动微分装置使用。In a specific implementation, the distributed automatic differentiation device 42 in the embodiment of the present application can be integrated into an artificial intelligence (AI) computing framework (not shown in the figure), and the AI computing framework can call the operators in the above-mentioned database, or call the forward propagation calculation graph for use by the distributed automatic differentiation device.

分布式正向传播网络模块41用于提供神经网络模型在该计算节点处的正向传播计算图。其中，所述正向传播计算图包括多个算子，所述多个算子中包括本地算子和通信算子，所述本地算子表示可以在所述计算节点中完成计算的算子，所述通信算子表示需要利用所述计算节点与所述分布式计算系统中的至少一个其余计算节点之间的通信来完成计算的算子；The distributed forward propagation network module 41 is used to provide a forward propagation calculation graph of the neural network model at the computing node. The forward propagation calculation graph includes multiple operators, including local operators and communication operators. The local operator represents an operator that can complete the calculation in the computing node, and the communication operator represents an operator that needs to use the communication between the computing node and at least one other computing node in the distributed computing system to complete the calculation.

分布式自动微分装置42用于获取该正向传播计算图自动进行推理，自动完成反向传播的微分计算，以生成神经网络模型在该计算节点处的反向传播计算图。The distributed automatic differentiation device 42 is used to obtain the forward propagation calculation graph to automatically perform reasoning and automatically complete the differential calculation of the back propagation to generate the back propagation calculation graph of the neural network model at the calculation node.

在可能的实施例中，计算节点可用于数据并行或者模型并行或者混合并行的方式进行分布式神经网络模型的计算。In a possible embodiment, the computing nodes may be used to perform calculations on a distributed neural network model in a data parallel or model parallel or hybrid parallel manner.

在可能的实施例中，计算节点还可以用于根据正向传播计算图和反向传播计算图进行神经网络模型的训练。In a possible embodiment, the computing node may also be used to train a neural network model according to a forward propagation computation graph and a back propagation computation graph.

参见图5，图5是本申请实施例提供的一种分布式自动微分装置的结构示意图，分布式自动微分装置可以是应用于图4所示的计算节点的分布式自动微分装置42。如图5所示，该装置包括：通信算子求导模块51、镜像算子模块52、Sens处理模块53和单机自动微分模块54。这些模块中的部分或者全部可以采用软件代码的形式予以实现。其中：Referring to FIG. 5 , FIG. 5 is a schematic diagram of the structure of a distributed automatic differentiation device provided in an embodiment of the present application. The distributed automatic differentiation device may be a distributed automatic differentiation device 42 applied to the computing node shown in FIG. 4 . As shown in FIG. 5 , the device includes: a communication operator derivation module 51, a mirror operator module 52, a Sens processing module 53, and a stand-alone automatic differentiation module 54. Some or all of these modules may be implemented in the form of software code. Among them:

通信算子求导模块51，用于以通信算子的粒度，对每个通信算子都生成其反向逻辑，即生成通信算子的导数算法。每个通信算子都有其对应的反向逻辑，通信算子求导模块具有对所有通信算子进行“求导”的能力，所计算出的通信算子的导数算法可以保存在图4所示的通信算子的导数算法库。The communication operator derivation module 51 is used to generate the reverse logic of each communication operator at the granularity of the communication operator, that is, to generate the derivative algorithm of the communication operator. Each communication operator has its corresponding reverse logic, and the communication operator derivation module has the ability to "derive" all communication operators. The calculated derivative algorithm of the communication operator can be saved in the derivative algorithm library of the communication operator shown in FIG4.

镜像算子模块52，用于在本计算节点的神经网络模型的模型参数与至少一个其余计算节点拥有的模型参数相同时，在本计算节点的模型参数和用于计算该模型参数的算子之间插入目标算子，本申请实施例中所述目标算子也可称为镜像算子(Mirror算子)。具体的，可以在正向传播计算图中识别哪些模型参数重复分布在多个计算节点，在正向传播计算图的相应位置插入Mirror算子。The mirror operator module 52 is used to insert a target operator between the model parameters of the current computing node and the operator used to calculate the model parameters when the model parameters of the neural network model of the current computing node are the same as the model parameters of at least one other computing node. The target operator described in the embodiment of the present application may also be referred to as a mirror operator. Specifically, it is possible to identify which model parameters are repeatedly distributed in multiple computing nodes in the forward propagation calculation graph, and insert a Mirror operator at the corresponding position of the forward propagation calculation graph.

也即是说，当神经网络模型的模型参数(或称训练参数)在分布式计算系统的多个计算节点做镜像时(如数据并行或模型并行或混合并行的方式)，可以在正向传播计算图上引入Mirror算子，从而形成一个目标计算图，Mirror算子的算法是正向直通，透传模型参数，且Mirror算子的导数算法为all_reduce(SUM)算子，即反向做梯度all_reduce。That is to say, when the model parameters (or training parameters) of the neural network model are mirrored on multiple computing nodes in a distributed computing system (such as data parallelism, model parallelism, or hybrid parallelism), the Mirror operator can be introduced on the forward propagation calculation graph to form a target calculation graph. The algorithm of the Mirror operator is forward direct, transparently transmitting the model parameters, and the derivative algorithm of the Mirror operator is the all_reduce (SUM) operator, that is, the gradient all_reduce is performed in reverse.

Sens处理模块53，用于识别分布式计算系统中计算流程相同的计算节点的数量n，即识别各个计算节点正向传播过程的最后一个算子的loss值是否被重复计算(即在正向传播计算图中识别输出层是否被重复计算)，如果重复计算，则将反向传播的输入梯度的系数sens除以n。The Sens processing module 53 is used to identify the number n of computing nodes with the same computing process in the distributed computing system, that is, to identify whether the loss value of the last operator in the forward propagation process of each computing node is repeatedly calculated (that is, to identify whether the output layer is repeatedly calculated in the forward propagation calculation graph). If it is repeatedly calculated, the coefficient sens of the input gradient of the back propagation is divided by n.

单机自动微分模块54，用于根据神经网络模型的正向传播计算图或者目标计算图(假如存在Mirror算子的话)进行反向自动微分，自动插入本地算子对应的导数算法和通信算子对应的导数算法，以生成神经网络模型在本计算节点处的反向传播计算图，从而实现在数据并行或者混合并行的方式完成分布式自动微分。The stand-alone automatic differentiation module 54 is used to perform reverse automatic differentiation according to the forward propagation calculation graph or the target calculation graph (if a Mirror operator exists) of the neural network model, and automatically insert the derivative algorithm corresponding to the local operator and the derivative algorithm corresponding to the communication operator to generate the reverse propagation calculation graph of the neural network model at this computing node, thereby realizing distributed automatic differentiation in a data parallel or hybrid parallel manner.

为了更好理解本申请的技术方案，下面进一步对分布式自动微分装置中的各个模块的功能进行详细描述。In order to better understand the technical solution of the present application, the functions of each module in the distributed automatic differentiation device are further described in detail below.

对于通信算子求导模块：可参见图6，每个正向传播计算图的通信算子，都对应一段反向逻辑(若干个算子的组合)，对于每个通信算子，通信算子求导模块可为该通信算子定义并推导其对应的“导数”，从而获得该通信算子的导数算法，并将该导数算法保存到导数算法库以便于后续使用。For the communication operator derivation module: see Figure 6. Each communication operator in the forward propagation calculation graph corresponds to a section of reverse logic (a combination of several operators). For each communication operator, the communication operator derivation module can define and derive the corresponding "derivative" for the communication operator, thereby obtaining the derivative algorithm of the communication operator, and save the derivative algorithm to the derivative algorithm library for subsequent use.

举例来说，通信算子和其对应的导数算法可如下表1所示：For example, the communication operator and its corresponding derivative algorithm may be shown in Table 1 below:

表1Table 1

需要说明的是，上述表1仅用于解释本申请方案而非限定。It should be noted that the above Table 1 is only used to explain the present application scheme and is not intended to be limiting.

对于镜像算子模块：在进行数据并行或模型并行或混合并行时，如果对训练样本数据进行切分，而同一份模型参数w在多个计算节点做了镜像，即同时分布于多个计算节点，每台计算节点基于不同的训练样本与同一份模型参数副本进行正向传播计算。从数学上看，参数的导数需要对所有计算节点各个副本的导数进行求和。For the mirror operator module: When performing data parallelism, model parallelism, or hybrid parallelism, if the training sample data is split, and the same model parameter w is mirrored on multiple computing nodes, that is, distributed on multiple computing nodes at the same time, each computing node performs forward propagation calculations based on different training samples and the same copy of the model parameter. Mathematically, the derivative of the parameter needs to be summed up for the derivatives of each copy of all computing nodes.

模型参数副本在各个计算节点上做镜像也是一次操作，但一般不把这次镜像操作放在正向传播的范围内(即在正向计算开始之前，参数在各个计算节点上的副本已存在)，而反向传播是在正向传播的范围内进行反向求导。因此，在分布式自动微分时需要描述哪些参数在哪些节点上做了镜像。Mirroring the model parameter copies on each computing node is also an operation, but this mirroring operation is generally not placed in the scope of forward propagation (that is, before the forward calculation begins, the copies of the parameters on each computing node already exist), while the reverse propagation is reverse derivatives within the scope of forward propagation. Therefore, in distributed automatic differentiation, it is necessary to describe which parameters are mirrored on which nodes.

镜像算子模块可以在正向传播中识别哪些模型参数重复分布在多个计算节点，在正向传播计算图的相应位置插入Mirror算子。参见图7，在一种实现中，该过程可包括下述步骤：The mirror operator module can identify which model parameters are repeatedly distributed on multiple computing nodes in the forward propagation, and insert the Mirror operator at the corresponding position of the forward propagation calculation graph. Referring to FIG. 7 , in one implementation, the process may include the following steps:

在S401.镜像算子模块确定本地计算节点需要求导的神经网络模型的模型参数。In S401, the mirror operator module determines the model parameters of the neural network model that the local computing node needs to derive.

在S402.对该模型参数中的每个参数，分析是否在多个计算节点中有副本，如果有副本，则记录相关计算节点的标识(ID)。At S402, for each parameter in the model parameter, it is analyzed whether there are copies in multiple computing nodes. If there are copies, the identification (ID) of the relevant computing node is recorded.

在S403.为多个副本对应的计算节点创建通信组(group)。In S403, a communication group is created for the computing nodes corresponding to the multiple replicas.

在S404.为有多个副本的参数创建mirror算子，mirror算子中记录对应的通信组。In S404, a mirror operator is created for the parameter with multiple copies, and the corresponding communication group is recorded in the mirror operator.

在S405.将mirror算子插入到对应的参数的输入端与原先用于计算该参数的算子之间，从而得到目标计算图。也就是说，mirror算子的输入为该参数，mirror算子的输出作为原先用于计算该参数的算子的输入。由于mirror算子在正向传播中的作用是透传参数，所以mirror算子的输出也是该参数，即该参数最终还是会输入到原先用于计算该参数的算子。In S405, the mirror operator is inserted between the input end of the corresponding parameter and the operator originally used to calculate the parameter, thereby obtaining the target calculation graph. In other words, the input of the mirror operator is the parameter, and the output of the mirror operator is used as the input of the operator originally used to calculate the parameter. Since the role of the mirror operator in forward propagation is to pass through the parameter, the output of the mirror operator is also the parameter, that is, the parameter will eventually be input into the operator originally used to calculate the parameter.

本申请实施例中，镜像算子模块可预先对定义Mirror算子来对应模型参数中的哪些参数。Mirror算子的在正向传播中，输入为模型参数副本w，输出等于输入w，它在正向传播中不做实际的计算，只是标识参数在哪些计算节点做了镜像。Mirror算子在反向传播中，输入为下一层神经网络计算所得的导数dout，输出dw等于对dout在通信组(group)上做一次all-reduce求和；In the embodiment of the present application, the mirror operator module can pre-define which parameters in the model parameters the Mirror operator corresponds to. In the forward propagation of the Mirror operator, the input is the model parameter copy w, and the output is equal to the input w. It does not perform actual calculations in the forward propagation, but only identifies which computing nodes the parameters are mirrored on. In the reverse propagation of the Mirror operator, the input is the derivative dout calculated by the next layer of the neural network, and the output dw is equal to an all-reduce sum of dout on the communication group;

如图8所示，假如分布式计算系统中，模型参数w对应的通信组(group)包括两个计算节点，其通信编号分别为0和1(即图示中的机器0和机器1)。两台计算节点同时拥有同一份模型参数w的副本，但各自拥有一份不同的训练样本数据a1和a2。那么，在正向传播计算图中，镜像算子模块可将mirror算子的位置插入在参数w与原先的算子OP之间，在正向传播计算时，mirror算子的输出等于输入，也就是说，mirror算子的输入为参数w，mirror算子的输出作为算子OP的输入。由于mirror算子的输出也是参数w，即该参数w最终还是会输入到算子OP。As shown in Figure 8, if in a distributed computing system, the communication group corresponding to the model parameter w includes two computing nodes, whose communication numbers are 0 and 1 respectively (i.e., machine 0 and machine 1 in the figure). The two computing nodes have a copy of the same model parameter w at the same time, but each has a different set of training sample data a1 and a2. Then, in the forward propagation calculation graph, the mirror operator module can insert the position of the mirror operator between the parameter w and the original operator OP. During the forward propagation calculation, the output of the mirror operator is equal to the input, that is, the input of the mirror operator is the parameter w, and the output of the mirror operator is used as the input of the operator OP. Since the output of the mirror operator is also the parameter w, the parameter w will eventually be input into the operator OP.

在反向传播计算时，mirror的反向输出等于对输入dout在机器0和机器1(即通信组的每一个成员)上做一次all-reduce求和操作。During the back-propagation calculation, the reverse output of the mirror is equal to an all-reduce summation operation on the input dout on machine 0 and machine 1 (i.e., each member of the communication group).

对于Sens处理模块：Sens处理模块可用于确定反向传播计算图的输入数据的系数sens。参见图9，在一种实现中，该过程可包括下述步骤：For the Sens processing module: The Sens processing module can be used to determine the coefficients sens of the input data of the back propagation calculation graph. Referring to FIG. 9 , in one implementation, the process may include the following steps:

在S501.Sens处理模块识别神经网络的输出层是否被重复计算。At S501.Sens processing module identifies whether the output layer of the neural network is repeatedly calculated.

在S502.如果重复计算，则Sens处理模块确定重复计算的计算节点的数量n，将反向传播的输入数据的系数sens除以n。所得到的结果作为所述输入数据的新的系数sens。也就是说可以将输入数据的系数sens的值设置为原值的1/n，示例性的，sens的原值可默认为1.0。该输入数据通常为loss的导数(结果例如为1)。例如当n＝2时，反向传播的最终输入为1/2。In S502, if the calculation is repeated, the Sens processing module determines the number n of computing nodes to be repeated, and divides the coefficient sens of the input data of the back propagation by n. The result is used as the new coefficient sens of the input data. That is to say, the value of the coefficient sens of the input data can be set to 1/n of the original value. For example, the original value of sens can be defaulted to 1.0. The input data is usually the derivative of loss (the result is, for example, 1). For example, when n=2, the final input of the back propagation is 1/2.

在S503.如果没有重复计算，则Sens处理模块将反向传播的输入数据的系数sens维持不变，也就是维持原值。示例性的，sens的原值可默认为1.0。In S503, if there is no repeated calculation, the Sens processing module maintains the coefficient sens of the back-propagated input data unchanged, that is, maintains the original value. For example, the original value of sens may be defaulted to 1.0.

本申请实施例中，Sens处理模块可识别正向传播的输出层(一般为loss)是否重复计算，如果loss重复计算，每台计算节点都以正向的loss值(即神经网络正向传播的输出)作为起点，各计算节点协同完成整体计算的反向传播，则在反向传播第一层中，需将sens设置成1/n，即每台机器在反向求导的起点都贡献1/n。In an embodiment of the present application, the Sens processing module can identify whether the output layer of the forward propagation (generally loss) is repeatedly calculated. If the loss is repeatedly calculated, each computing node uses the forward loss value (that is, the output of the forward propagation of the neural network) as the starting point, and each computing node collaborates to complete the reverse propagation of the overall calculation. In the first layer of reverse propagation, sens needs to be set to 1/n, that is, each machine contributes 1/n at the starting point of the reverse derivation.

举例来说，参见图10，假如分布式计算系统中，机器0和机器1共同协同完成正向传播计算。For example, referring to FIG10 , if in a distributed computing system, machine 0 and machine 1 work together to complete the forward propagation calculation.

在机器0的正向传播计算流程，输入为数据b，将b通过broadcast算子进行广播，使得机器0和机器1各有一份，记为c(b和c相同)，通过乘法算子M计算c与本地数据d1的乘积得到cd1；将机器0的结果和机器1的结果通过通信算子all_gather算子做一次合并，得到[cd1,cd2]，最后通过算子Reduce-SUM对[cd1,cd2]求平方和，得到输出层的loss值。In the forward propagation calculation process of machine 0, the input is data b, and b is broadcasted through the broadcast operator so that machine 0 and machine 1 each have a copy, denoted as c (b and c are the same). The multiplication operator M is used to calculate the product of c and local data d1 to obtain cd1; the results of machine 0 and machine 1 are merged once through the communication operator all_gather operator to obtain [cd1, cd2], and finally the square sum of [cd1, cd2] is calculated through the operator Reduce-SUM to obtain the loss value of the output layer.

在机器1的正向传播计算流程，机器1接收计算节点0广播出的数据c，通过乘法算子M计算c与本地数据d2的乘积得到cd2；将机器0的结果和机器1的结果通过通信算子all_gather算子做一次合并，得到[cd1,cd2]，最后通过算子Reduce-SUM对[cd1,cd2]求平方和，得到输出层的loss值。In the forward propagation calculation process of machine 1, machine 1 receives the data c broadcast by computing node 0, and calculates the product of c and local data d2 through the multiplication operator M to obtain cd2; the results of machine 0 and machine 1 are merged once through the communication operator all_gather operator to obtain [cd1, cd2], and finally the square sum of [cd1, cd2] is calculated through the operator Reduce-SUM to obtain the loss value of the output layer.

可以看到，对于这两个机器的正向传播流程，最终输出的loss相同，也就是机器0和机器1计算loss的过程有重复，得到同一个loss值。因此在进行反向传播的过程中，Sens处理模块可以识别到loss被重复计算，且重复计算的计算节点数为2(即n＝2)，于是将反向传播的sens除以重复计算的计算节点数2。例如，sens默认值为1.0，因此在图10中的反向传播流程中，机器0和机器1的输入均为1/2。It can be seen that for the forward propagation process of these two machines, the final output loss is the same, that is, the process of calculating loss for machine 0 and machine 1 is repeated, and the same loss value is obtained. Therefore, during the reverse propagation process, the Sens processing module can recognize that the loss is repeatedly calculated, and the number of computing nodes for repeated calculation is 2 (that is, n = 2), so the reverse propagation sens is divided by the number of computing nodes for repeated calculation 2. For example, the default value of sens is 1.0, so in the reverse propagation process in Figure 10, the input of machine 0 and machine 1 is 1/2.

对于单机自动微分模块：单机自动微分模块可以根据正向传播计算图(或目标计算图)中每个算子(包含本地算子、通信算子、可能的mirror算子)的反向逻辑(导数算法)，生成反向传播计算图。参见图11，在一种实现中，该过程可包括下述步骤：For a stand-alone automatic differentiation module: the stand-alone automatic differentiation module can generate a back propagation calculation graph according to the reverse logic (derivative algorithm) of each operator (including local operators, communication operators, and possible mirror operators) in the forward propagation calculation graph (or target calculation graph). Referring to FIG. 11 , in one implementation, the process may include the following steps:

在S601，单机自动微分模块沿输神经网络输出层到输入层的顺序，依次调用每个算子的导数算法。其中，如果正向传播中的算子是本地算子，则在S602，单机自动微分模块到本地算子的导数算法库找到该本地算子对应的导数算法。如果正向传播中的算子是通信算子，则在S603，到通信算子的导数算法库找到该通信算子对应的导数算法。如果正向传播中的算子是mirror算子，则在S604，则找到镜像算子模块所确定的all_reduce(SUM)算子。最后在S605，单机自动微分模块输出反向传播计算图。In S601, the stand-alone automatic differentiation module calls the derivative algorithm of each operator in sequence along the order from the output layer to the input layer of the input neural network. Among them, if the operator in the forward propagation is a local operator, then in S602, the stand-alone automatic differentiation module finds the derivative algorithm corresponding to the local operator from the derivative algorithm library of the local operator. If the operator in the forward propagation is a communication operator, then in S603, the derivative algorithm corresponding to the communication operator is found from the derivative algorithm library of the communication operator. If the operator in the forward propagation is a mirror operator, then in S604, the all_reduce (SUM) operator determined by the mirror operator module is found. Finally, in S605, the stand-alone automatic differentiation module outputs the back-propagation calculation graph.

需要说明的是，对于本申请实施例描述的各方法实施例，为了方便起见，将其都表述为一系列的动作步骤的组合，但是本领域技术人员应该知悉，本申请技术方案的具体实现并不受所描述的一系列的动作步骤的顺序的限制。It should be noted that, for the sake of convenience, the various method embodiments described in the embodiments of the present application are expressed as a combination of a series of action steps, but those skilled in the art should be aware that the specific implementation of the technical solution of the present application is not limited by the order of the described series of action steps.

基于上文描述的系统架构以及各种功能模块，下面继续描述本申请实施例提供的分布式自动微分方法，该方法可应用于分布式计算系统中的计算节点，参见图12，图12是一种分布式自动微分方法的流程示意图，该方法包括但不限于以下步骤：Based on the system architecture and various functional modules described above, the following further describes the distributed automatic differentiation method provided in the embodiment of the present application. The method can be applied to computing nodes in a distributed computing system. See FIG. 12 , which is a flow chart of a distributed automatic differentiation method. The method includes but is not limited to the following steps:

S101、计算节点获取神经网络模型的正向传播计算图。S101. The computing node obtains the forward propagation calculation graph of the neural network model.

其中，所述正向传播计算图包括多个算子，所述多个算子中包括本地算子和通信算子，所述本地算子表示可以在所述计算节点中完成计算的算子，所述通信算子表示需要利用所述计算节点与所述分布式计算系统中的至少一个其余计算节点之间的通信来完成计算的算子。Among them, the forward propagation calculation graph includes multiple operators, including local operators and communication operators. The local operators represent operators that can complete calculations in the computing node, and the communication operators represent operators that need to utilize communication between the computing node and at least one remaining computing node in the distributed computing system to complete calculations.

具体的，所述神经网络模型的正向传播计算图可以是用于预先配置的，例如从用户配置的分布式正向网络中获得，该正向传播计算图的内容保存在所述计算节点的本地存储中。Specifically, the forward propagation computation graph of the neural network model may be pre-configured, for example, obtained from a distributed forward network configured by a user, and the content of the forward propagation computation graph is stored in the local storage of the computing node.

关于正向传播计算图、本地算子和通信算子的内容已在前文做了描述，这里不再赘述。The forward propagation computation graph, local operators, and communication operators have been described in the previous article and will not be repeated here.

S102、当所述计算节点拥有的神经网络模型的模型参数与至少一个其余计算节点拥有的模型参数相同时，计算节点在所述计算节点的模型参数和用于计算该模型参数的算子之间插入目标算子，得到目标计算图。S102. When the model parameters of the neural network model owned by the computing node are the same as the model parameters owned by at least one other computing node, the computing node inserts a target operator between the model parameters of the computing node and the operator used to calculate the model parameters to obtain a target computing graph.

其中，所述目标算子的算法是在正向传播中透传模型参数，即其输入等于输出，且所述目标算子的导数算法为对各个具有相同模型参数的计算节点在所述模型参数处的导数进行求和，例如所述目标算子的导数算法为all_reduce(SUM)算子。Among them, the algorithm of the target operator is to pass through the model parameters in the forward propagation, that is, its input is equal to the output, and the derivative algorithm of the target operator is to sum the derivatives of each computing node with the same model parameters at the model parameters. For example, the derivative algorithm of the target operator is the all_reduce (SUM) operator.

具体的，本申请实施例可通过镜像算子模块定义mirror算子，其属性为通信组(group)，在正向传播中其输出等于输入，在反向传播中其输出等于对输入在通信组上做allreduce求和操作。对用户输入的分布式正向传播网络进行分析，对于需求导的模型参数，识别其在哪些计算节点有共同的模型副本，为对应的计算节点创建通信组，并在正向传播计算图中，将mirror算子插入到对应的模型参数与原先用于计算该模型参数的算子之间，从而得到目标计算图。Specifically, the embodiment of the present application can define a mirror operator through a mirror operator module, whose attribute is a communication group (group). In forward propagation, its output is equal to the input, and in reverse propagation, its output is equal to the input in the communication group. The distributed forward propagation network input by the user is analyzed, and for the model parameters driven by the demand, it is identified on which computing nodes have a common model copy, and a communication group is created for the corresponding computing nodes. In the forward propagation calculation graph, the mirror operator is inserted between the corresponding model parameter and the operator originally used to calculate the model parameter, thereby obtaining the target calculation graph.

本步骤的具体实现过程可参考图7、图8实施例的相关描述，这里不再赘述。The specific implementation process of this step can refer to the relevant description of the embodiments of Figures 7 and 8, which will not be repeated here.

S103、计算节点获取所述目标计算图中每个算子的导数算法。S103: The computing node obtains a derivative algorithm for each operator in the target computing graph.

具体的，计算节点可获取所述目标计算图中的每个通信算子的导数算法、每个本地算子的导数算法以及目标算子的导数算法。所述本地算子的导数算法保存在所述计算节点的本地存储中。计算节点可利用通信算子求导模块对所述正向传播计算图中各个通信算子分别进行求导操作，以获得所述各个通信算子的导数算法，并保存到通信算子的导数算法库。Specifically, the computing node may obtain the derivative algorithm of each communication operator in the target computing graph, the derivative algorithm of each local operator, and the derivative algorithm of the target operator. The derivative algorithm of the local operator is stored in the local storage of the computing node. The computing node may use the communication operator derivation module to perform derivation operations on each communication operator in the forward propagation computing graph to obtain the derivative algorithm of each communication operator, and save it to the derivative algorithm library of the communication operator.

关于对通信算子进行求导的实现方式已在前文做了描述，这里不再赘述。The implementation method of derivation of the communication operator has been described in the previous article and will not be repeated here.

S104、计算节点根据所述目标计算图中每个算子(包括本地算子、通信算子、mirror算子)的导数算法来生成所述神经网络模型的反向传播计算图。S104. The computing node generates a back propagation computing graph of the neural network model according to the derivative algorithm of each operator (including local operators, communication operators, and mirror operators) in the target computing graph.

具体的，计算节点可通过单机自动微分模块依据本台计算节点上的正向传播流程(例如目标计算图)，从输出层开始到输入层，依次从数据库中调用每个本地算子/通信算子/可能存在的mirror算子的导数算法，自动生成完整的反向传播计算图。Specifically, the computing node can use the single-machine automatic differentiation module to call the derivative algorithm of each local operator/communication operator/possible mirror operator from the database in turn from the output layer to the input layer according to the forward propagation process (such as the target calculation graph) on the computing node, and automatically generate a complete back-propagation calculation graph.

本步骤的具体实现过程可参考图11实施例的相关描述，这里不再赘述。The specific implementation process of this step can refer to the relevant description of the embodiment of Figure 11, which will not be repeated here.

基于上文描述的系统架构以及各种功能模块，参见图13，图13是又一种分布式自动微分方法的流程示意图，该方法可应用于分布式计算系统中的计算节点，该方法包括但不限于以下步骤：Based on the system architecture and various functional modules described above, see FIG. 13 , which is a flow chart of another distributed automatic differentiation method, which can be applied to computing nodes in a distributed computing system, and includes but is not limited to the following steps:

S201、在各个计算节点，预先保存每个本地算子的导数算法到本地算子的导数算法库，并推导每个通信算子的导数算法，保存每个通信算子的导数算法到通信算子的导数算法库。保存神经网络模型的正向传播计算图。S201, at each computing node, pre-save the derivative algorithm of each local operator to the derivative algorithm library of the local operator, derive the derivative algorithm of each communication operator, and save the derivative algorithm of each communication operator to the derivative algorithm library of the communication operator. Save the forward propagation calculation graph of the neural network model.

具体的，各个计算节点可以在本地存储中预先配置本地算子库、本地算子库的导数算法库和通信算子库。对于通信算子库中的每个通信算子，可利用本申请实施例提供的通信算子求导模块为每个通信算子定义其“导数”，描述每个通信算子对应的反向逻辑，进而生成通信算子的导数库。Specifically, each computing node can pre-configure a local operator library, a derivative algorithm library of the local operator library, and a communication operator library in local storage. For each communication operator in the communication operator library, the communication operator derivation module provided in the embodiment of the present application can be used to define its "derivative" for each communication operator, describe the reverse logic corresponding to each communication operator, and then generate a derivative library of the communication operator.

具体的，各个计算节点共同用于协助完成一个完整的神经网络模型的计算，而在模型的分布式计算中(例如数据并行，或模型并行，或混合并行)，各个计算节点对应的正向传播计算图可以有所差异，例如模型参数，输入数据，计算图的计算规则等可以有所差异。各个计算节点可以通过分布式正向传播网络获得该计算节点自身对应的正向传播计算图。分布式正向传播网络可以是预先配置的，例如可以是用户通过分布式计算系统提供了对外的接口界面输入的，接口界面的形态可以是多样的，例如web界面、命令行工具、REST接口等，这里不做限定。Specifically, each computing node is used together to assist in the calculation of a complete neural network model, and in the distributed computing of the model (such as data parallelism, or model parallelism, or hybrid parallelism), the forward propagation computation graphs corresponding to each computing node may be different, such as model parameters, input data, and the calculation rules of the computation graph. Each computing node can obtain the forward propagation computation graph corresponding to the computing node itself through the distributed forward propagation network. The distributed forward propagation network can be pre-configured, for example, it can be provided by the user through the distributed computing system. The interface can be in various forms, such as a web interface, a command line tool, a REST interface, etc., which are not limited here.

S202、当计算节点需要执行反向传播计算图的推理时，从本地存储中获取该计算节点的正向传播计算图。S202: When a computing node needs to perform reasoning of a back-propagation computing graph, the forward propagation computing graph of the computing node is obtained from local storage.

S203、计算节点识别哪些模型参数重复分布在多个计算节点，在正向传播计算图的相应位置插入Mirror算子，得到目标计算图。S203, the computing node identifies which model parameters are repeatedly distributed on multiple computing nodes, and inserts the Mirror operator at the corresponding position of the forward propagation computing graph to obtain the target computing graph.

可以理解的，所述目标计算图包括了原先的正向传播计算图的内容以及新增的mirror算子的内容。目标计算图可以作为用于求取最终的反向传播计算图的逻辑上的中间变量，对外不可见。It can be understood that the target computation graph includes the content of the original forward propagation computation graph and the content of the newly added mirror operator. The target computation graph can be used as a logical intermediate variable for obtaining the final reverse propagation computation graph and is not visible to the outside.

S204、计算节点确定反向传播计算图的输入数据的系数sens。S204. The computing node determines the coefficient sens of the input data of the back-propagation computing graph.

具体的，计算节点可通过Sens处理模块分析系统中的各个计算节点插入mirror算子后的正向传播流程(目标计算图)，进而识别神经网络的输出层是否被重复计算，如果重复计算，根据重复计算的计算设备的数量确定当前节点的反向传播计算图的输入数据的系数sens。Specifically, the computing node can analyze the forward propagation process (target calculation graph) of each computing node in the system after the mirror operator is inserted through the Sens processing module, and then identify whether the output layer of the neural network is repeatedly calculated. If it is repeatedly calculated, the coefficient sens of the input data of the back propagation calculation graph of the current node is determined according to the number of computing devices that are repeatedly calculated.

本步骤的具体实现过程可参考图9、图10实施例的相关描述，这里不再赘述。The specific implementation process of this step can refer to the relevant description of the embodiments of Figures 9 and 10, which will not be repeated here.

S205、计算节点根据正向传播计算图(或目标计算图)中每个算子(包括本地算子、通信算子、可能的mirror算子)的导数算法，生成反向传播计算图并输出。S205. The computing node generates and outputs a reverse propagation computation graph according to the derivative algorithm of each operator (including local operators, communication operators, and possible mirror operators) in the forward propagation computation graph (or target computation graph).

S206、可选的，各个计算节点根据正向传播计算图和反向传播计算图对神经网络模型进行训练，从而实现分布式的模型训练过程。S206. Optionally, each computing node trains the neural network model according to the forward propagation calculation graph and the back propagation calculation graph, thereby realizing a distributed model training process.

可以理解的，在各个计算节点获得自己对应的反向传播计算图后，可利用该计算节点的正向传播计算图、反向传播计算图以及训练样本数据，对神经网络模型进行模型训练。具体的训练过程已被本领域技术人员公知，这里不做展开。It is understandable that after each computing node obtains its corresponding back propagation calculation graph, the forward propagation calculation graph, the back propagation calculation graph and the training sample data of the computing node can be used to train the neural network model. The specific training process is well known to those skilled in the art and will not be expanded here.

基于相同的发明构思，下面继续提供了本发明实施例的相关设备。Based on the same inventive concept, the following further provides related devices of the embodiments of the present invention.

参见图14，图14为本发明实施例提供的一种计算节点700与客户端740及业务人员交互的示意图。计算节点700可以包括多个处理器710以及多个存储器720。计算节点700可以是前文任意实施例描述的计算节点。14 is a schematic diagram of a computing node 700 interacting with a client 740 and a business person according to an embodiment of the present invention. The computing node 700 may include multiple processors 710 and multiple memories 720. The computing node 700 may be a computing node described in any of the above embodiments.

存储器720可用于存储程序代码和数据(如样本数据，模型参数，数据库，各种功能模块的代码等)。计算节点700还提供了对外的接口界面，接口界面的形态可以是多样的，例如web界面、命令行工具、REST接口等，这里不做限定。The memory 720 can be used to store program codes and data (such as sample data, model parameters, databases, codes of various functional modules, etc.). The computing node 700 also provides an external interface, which can be in various forms, such as a web interface, a command line tool, a REST interface, etc., which are not limited here.

处理器710可以用于运行前文描述的各种功能模块，具体的，处理器710可调用存储器720的程序指令执行前文任意方法实施例所描述的方法。The processor 710 may be used to run the various functional modules described above. Specifically, the processor 710 may call program instructions in the memory 720 to execute the method described in any of the method embodiments described above.

应当理解，计算节点700仅为本申请实施例提供的一个例子，并且，计算节点700可具有比示出的部件更多或更少的部件，可以组合两个或更多个部件，或者可具有部件的不同配置实现。It should be understood that computing node 700 is only an example provided in an embodiment of the present application, and computing node 700 may have more or fewer components than those shown, may combine two or more components, or may have different configurations of components.

参见图15，图15是本发明实施例提供的又一种计算节点800的结构示意图。计算节点800包括一个或多个处理器811、通信接口812和存储器813。其中，处理器811、通信接口812和存储器813之间可以通过总线814连接。在一些可能的实施方式中，计算节点800例如可部署于单个应用服务器或终端或服务器集群之中。Referring to FIG. 15 , FIG. 15 is a schematic diagram of the structure of another computing node 800 provided in an embodiment of the present invention. The computing node 800 includes one or more processors 811, a communication interface 812, and a memory 813. The processor 811, the communication interface 812, and the memory 813 may be connected via a bus 814. In some possible implementations, the computing node 800 may be deployed in, for example, a single application server or a terminal or a server cluster.

通信接口812可以为有线接口(例如以太网接口)，用于与其他计算节点或用户进行通信交互。The communication interface 812 may be a wired interface (eg, an Ethernet interface) for communicating and interacting with other computing nodes or users.

存储器813可以包括易失性存储器(Volatile Memory)，例如随机存取存储器(Random Access Memory，RAM)；存储器也可以包括非易失性存储器(Non-VolatileMemory)，例如只读存储器(Read-Only Memory，ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive，HDD)或固态硬盘(Solid-State Drive，SSD)存储器还可以包括上述种类的存储器的组合。存储器813可以存储有程序代码以及数据(例如样本数据，模型参数，数据库，各种功能模块的代码等)。The memory 813 may include a volatile memory, such as a random access memory (RAM); the memory may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD) or a solid-state drive (SSD). The memory may also include a combination of the above-mentioned types of memory. The memory 813 may store program code and data (such as sample data, model parameters, databases, codes of various functional modules, etc.).

处理器811包括一个或者多个通用处理器，其中，通用处理器可以是能够处理电子指令的任何类型的设备，包括中央处理器(Central Processing Unit，CPU)、微处理器、微控制器、主处理器、控制器以及ASIC(Application Specific Integrated Circuit，专用集成电路)等等。处理器811执行各种类型的数字存储指令，例如存储在存储器813中的软件或者固件程序，它能使计算节点800提供较宽的多种服务。例如，处理器811能够执行程序或者处理数据，以执行本申请披露的方法实施例的至少一部分。The processor 811 includes one or more general-purpose processors, wherein the general-purpose processor can be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, a main processor, a controller, and an ASIC (Application Specific Integrated Circuit). The processor 811 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 813, which enables the computing node 800 to provide a wide variety of services. For example, the processor 811 can execute a program or process data to perform at least a portion of the method embodiments disclosed in the present application.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者任意组合来实现。当使用软件实现时，可以全部或者部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令，在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网络站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、微波等)方式向另一个网络站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质，也可以是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带等)、光介质(例如DVD等)、或者半导体介质(例如固态硬盘)等等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present invention is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network or other programmable device. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one network site, computer, server or data center to another network site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer, or it can be a data storage device such as a server or data center that includes one or more available media integrated. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (e.g., a DVD, etc.), or a semiconductor medium (e.g., a solid-state hard disk), etc.

在上述实施例中，对各个实施例的描述各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has different emphases. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

Claims

1. A distributed automatic differentiation method, the method being applied to a computing node in a distributed computing system, characterized in that it comprises:

Obtain a forward propagation calculation graph of a neural network model; wherein the forward propagation calculation graph includes a plurality of operators, the plurality of operators including a local operator and a communication operator, the local operator represents an operator that can complete calculation in the computing node, and the communication operator represents an operator that relies on communication between the computing node and at least one computing node other than the computing node in the distributed computing system to complete calculation;

Based on the model parameters of the neural network model stored in the computing node being the same as the model parameters stored in at least one computing node other than the computing node in the distributed computing system, a target operator is inserted between the input end of the model parameter of the computing node and the operator used to calculate the model parameter to obtain a target calculation graph; wherein the target operator is used to transparently transmit the model parameter, and the derivative algorithm of the target operator is used to sum the derivatives of the model parameters of all computing nodes having the same model parameters;

A back propagation computation graph of the neural network model is generated based on a derivative algorithm of some or all operators in the target computation graph.

2. The method according to claim 1, characterized in that the generating the back propagation calculation graph of the neural network model based on the derivative algorithm of some or all operators in the target calculation graph comprises:

Obtaining a derivative algorithm of the communication operator, a derivative algorithm of the local operator, and a derivative algorithm of the target operator in the target computation graph;

The back-propagation computation graph is generated according to the derivative algorithm of the communication operator, the derivative algorithm of the local operator, and the derivative algorithm of the target operator.

3. The method according to claim 2, characterized in that the derivative algorithm of the local operator is stored in the local storage of the computing node; the method further comprises:

A derivative operation is performed on the communication operator in the forward propagation computation graph to obtain a derivative algorithm of the communication operator.

4. The method according to any one of claims 1 to 3, characterized in that the method further comprises:

The coefficient sens of the input data of the back-propagation computation graph is determined by identifying whether the output layer of the neural network model is repeatedly computed by two or more computing nodes in the distributed computing system.

5. The method according to claim 4, characterized in that the step of determining the sens of the input data of the back-propagation computation graph by identifying whether the output layer of the neural network model is repeatedly computed by two or more computing nodes in the distributed computing system comprises:

If it is repeatedly calculated, the value of sens is modified according to the number of computing nodes that are repeatedly calculated;

If it is not recalculated, the value of sens remains unchanged.

6. The method according to any one of claims 1 to 3 and 5, characterized in that inserting a target operator between an input terminal of a model parameter of the computing node and an operator for computing the model parameter to obtain a target computing graph comprises:

Determining that the model parameters of the neural network model stored in the computing node are the same as the model parameters stored in at least one computing node other than the computing node in the distributed computing system; the computing nodes storing the same model parameters are located in the same communication group;

Inserting a target operator between the model parameters of the computing node and the operator used to calculate the model parameters to obtain the target computation graph; wherein the output of the target operator is equal to the input of the target operator, and the derivative algorithm of the target operator is to sum the derivatives of the model parameters of all computing nodes in the communication group.

7. The method according to claim 4, characterized in that inserting a target operator between an input terminal of a model parameter of the computing node and an operator for computing the model parameter to obtain a target computing graph comprises:

8. The method according to claim 6 is characterized in that the derivative algorithm of the target operator is an all_reduce summation operator.

9. The method according to claim 7, characterized in that the derivative algorithm of the target operator is an all_reduce summation operator.

10. The method according to any one of claims 1-3, 5, 7-9, characterized in that after obtaining the derivative algorithm of each operator in the target calculation graph to generate the back propagation calculation graph of the neural network model, the method further comprises:

The neural network model is trained according to the forward propagation calculation graph and the back propagation calculation graph.

11. The method according to claim 4, characterized in that after obtaining the derivative algorithm of each operator in the target calculation graph to generate the back propagation calculation graph of the neural network model, the method further comprises:

12. The method according to claim 6, characterized in that after obtaining the derivative algorithm of each operator in the target calculation graph to generate the back propagation calculation graph of the neural network model, the method further comprises:

13. A computing node, applied to a distributed computing system, comprising:

A single-machine automatic differentiation module, used to obtain a forward propagation calculation graph of a neural network model; wherein the forward propagation calculation graph includes multiple operators, the multiple operators include local operators and communication operators, the local operators represent operators that can complete calculations in the computing node, and the communication operators represent operators that rely on communication between the computing node and at least one computing node other than the computing node in the distributed computing system to complete calculations;

A mirror operator module, which is used to insert a target operator between the input end of the model parameter of the computing node and the operator used to calculate the model parameter based on the model parameter of the neural network model stored in the computing node being the same as the model parameter stored in at least one computing node other than the computing node in the distributed computing system, so as to obtain a target calculation graph; wherein the target operator is used to transparently transmit the model parameter, and the derivative algorithm of the target operator is used to sum the derivatives of the model parameters of all computing nodes having the same model parameter;

The single-machine automatic differentiation module is also used to generate a back-propagation calculation graph of the neural network model based on a derivative algorithm of some or all operators in the target calculation graph.

14. The computing node according to claim 13, wherein the stand-alone automatic differentiation module is specifically used for:

15. The computing node according to claim 14, characterized in that the derivative algorithm of the local operator is stored in the local storage of the computing node; the computing node further comprises a communication operator derivation module, and the communication operator derivation module is used to:

Before obtaining the derivative algorithm of each operator in the target computation graph, a derivative operation is performed on the communication operator in the forward propagation computation graph to obtain the derivative algorithm of the communication operator.

16. The computing node according to any one of claims 13 to 15, characterized in that the computing node further comprises a sens processing module: the sens processing module is used to:

17. The computing node according to claim 16 is characterized in that the sens processing module is specifically used to modify the value of the sens according to the number of computing nodes that are repeatedly calculated when the computing node is repeatedly calculated; and maintain the value of the sens unchanged when the computing node is not repeatedly calculated.

18. The computing node according to any one of claims 13 to 15 and 17, wherein the mirror operator module is specifically used for:

19. The computing node according to claim 16, wherein the mirror operator module is specifically used for:

20. The computing node according to claim 18, characterized in that the derivative algorithm of the target operator is an all_reduce summation operator.

21. The computing node according to claim 19, characterized in that the derivative algorithm of the target operator is an all_reduce summation operator.

22. A computing device, characterized in that the computing device comprises: a memory, a communication interface, and a processor coupled to the memory and the communication interface; the memory is used to store program instructions, the processor is used to execute the program instructions, and the communication interface is used to communicate with other computing nodes of a distributed computing system under the control of the processor;

When the processor executes the program instructions, the processor performs the steps of the method according to any one of claims 1 to 12.

23. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 12 are implemented.